Extending the requests response class

21 March 2014

Requests 是一个奇妙的 Python 库, 也是我这些天用得最舒服的几个库之一. 我每天都用它主要是因为我的爬虫.

在你的爬虫项目中, 你可能会用到一些很简便的函数, 而且你可能刚刚复制了这些函数并把你的 response 对象作为参数传给它们. 我们可以做得更好.

我将会演示怎样给 Response 类增加一些简单的方法, 这样你可以在你自己的项目中用你自己的方法来使用这个技巧.

我们从定义一个 Response 类开始, 这个类有几个简单的方法. 最重要的方法是 doc(). 它”获得”解析过的 HTML 语法树, 这样我们其他的方法就不用在每次调用的时候都重新解析一遍 HTML.

import requests
from lxml import html
import inspect

class Response(object):
    def doc(self):
        if not hasattr(self, '_doc'):
            self._doc = html.fromstring(self.text)
        return self._doc

    def links(self):
        return self.doc().xpath('//a/@href')

    def images(self, filter_extensions=['jpg', 'jpeg', 'gif', 'png']):
        return [link for link in self.doc().xpath('//img/@src') if link.endswith(tuple(filter_extensions))]

    def title(self):
        title = self.doc().xpath('//title/text()')
        if len(title):
            return title[0].strip()
        else:
            return None

现在, 我们需要用我们新定义的类来修补 requests.Response 类. 我们将使用来自 inspect 模块的 getmember() 函数, 并把 ismethod() 作为参数.

for method_name, method in inspect.getmembers(Response, inspect.ismethod):  
    setattr(requests.models.Response, method_name, method.im_func)

这样就完成啦. 现在你可以对任何 reponse 对象使用这些简便的函数, 看下面这个例子:

r = requests.get('http://imgur.com/')
print r.title()
print r.images(filter_extensions=['png'])

现在我们继续, 把你的 response 对象变得向你想要的一样强大吧. 如果你对其他的爬虫技巧有兴趣, 可以看下我的python web scraping resource.

翻译完毕下面是自己的瞎说

第一次看到 inspect 模块的用法.

原文很短英文也不难所以推荐直接读原文, 之所以翻译了一下是觉得这文章确实很有意思~

标签:

Python