深入解析Python中的urllib2模块

时间：2021-05-22

Python 标准库中有很多实用的工具类，但是在具体使用时，标准库文档上对使用细节描述的并不清楚，比如 urllib2 这个 HTTP 客户端库。这里总结了一些 urllib2 的使用细节。

Proxy 的设置
Timeout 设置
在 HTTP Request 中加入特定的 Header
Redirect
Cookie
使用 HTTP 的 PUT 和 DELETE 方法
得到 HTTP 的返回码
Debug Log

Proxy 的设置

urllib2 默认会使用环境变量 http_proxy 来设置 HTTP Proxy。如果想在程序中明确控制 Proxy 而不受环境变量的影响，可以使用下面的方式

import urllib2enable_proxy = Trueproxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})null_proxy_handler = urllib2.ProxyHandler({})if enable_proxy:opener = urllib2.build_opener(proxy_handler)else:opener = urllib2.build_opener(null_proxy_handler)urllib2.install_opener(opener)

这里要注意的一个细节，使用 urllib2.install_opener() 会设置 urllib2 的全局 opener 。这样后面的使用会很方便，但不能做更细粒度的控制，比如想在程序中使用两个不同的 Proxy 设置等。比较好的做法是不使用 install_opener 去更改全局的设置，而只是直接调用 opener 的 open 方法代替全局的 urlopen 方法。

Timeout 设置

在老版 Python 中，urllib2 的 API 并没有暴露 Timeout 的设置，要设置 Timeout 值，只能更改 Socket 的全局 Timeout 值。

import urllib2import socketsocket.setdefaulttimeout(10) # 10 秒钟后超时urllib2.socket.setdefaulttimeout(10) # 另一种方式

在 Python 2.6 以后，超时可以通过 urllib2.urlopen() 的 timeout 参数直接设置。

import urllib2response = urllib2.urlopen('http://pile(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}') rsp=self.useragent(url) soup=BeautifulSoup(rsp) timespan=soup.find('div',{'class':'BlogStat'}) timespan=str(timespan).strip().replace('\n','').decode('utf-8') match=re.search(r'\d{4}\S\d{2}\S\d{2}\s\d{2}\S\d{2}',timespan) timestr=str(datetime.date.today()) if match: timestr=match.group() #print timestr ititle=soup.title.string div=soup.find('div',{'class':'BlogContent'}) rss=PyRSS2Gen.RSSItem( title=ititle, link=url, description = str(div), pubDate = timestr ) return rss def getcontent(self): rsp=self.useragent(self.baseurl) soup=BeautifulSoup(rsp) ul=soup.find('div',{'id':'RecentBlogs'}) for li in ul.findAll('li'): div=li.find('div') if div is not None: alink=div.find('a') if alink is not None: link=alink.get('href') print link html=self.enterpage(link) self.myrss.items.append(html) def SaveRssFile(self,filename): finallxml=self.myrss.to_xml(encoding='utf-8') file=open(self.xmlpath,'w') file.writelines(finallxml) file.close()if __name__=='__main__': rssSpider=RssSpider() rssSpider.getcontent() rssSpider.SaveRssFile('oschina.xml')

可以看到,主要是使用BeautifulSoup来抓取站点然后使用PyRSS2Gen来生成RSS并保存为xml格式文件.
顺便共享下我生成的RSS地址

http://104.224.129.109/myrss/oschina.xml

大家如果不想折腾的话直接使用feedly订阅就行了.
脚本我会每10分钟执行一次的.

深入解析Python中的urllib2模块

相关文章

python爬虫基础之urllib的使用

python 网络爬虫初级实现代码

Python中使用urllib2模块编写爬虫的简单上手示例

使用Python编写爬虫的基本模块及框架使用指南

Python urllib3软件包的使用说明