Python爬虫 urllib2的使用方法详解

时间：2021-05-22

所谓网页抓取，就是把URL地址中指定的网络资源从网络流中读取出来，保存到本地。在Python中有很多库可以用来抓取网页，我们先学习urllib2。

urllib2是Python2.x自带的模块(不需要下载，导入即可使用)

urllib2官网文档：https://docs.python.org/2/library/urllib2.html

urllib2源码

urllib2在python3.x中被改为urllib.request

urlopen

我们先来段代码：

#-*- coding:utf-8 -*-#01.urllib2_urlopen.py#导入urllib2库import urllib2#向指定的url发送请求，并返回服务器的类文件对象response = urllib2.urlopen("http://patible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}request =urllib2.Request(url, headers = header)#也可以通过调用Request.add_header()添加/修改一个特定的headerrequest.add_header("Connection","keep-alive")#也可以通过调用Request.get_header()来查看header信息request.get_header(header_name = "Connection")response = urllib2.urlopen(request)print(response.code) #可以查看响应状态码html = response.read()print(html) 随机添加/修改User-Agent#-*- coding:utf-8 -*-#05.urllib2_add_headers.pyimport urllib2import randomurl = "http://www.itcast.cn"ua_list = [ "Mozilla/5.0 (Windows NT 6.1; ) Apple.... ", "Mozilla/5.0 (X11; CrOS i686 2268.111.0)... ", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X.... ", "Mozilla/5.0 (Macintosh; Intel Mac OS... "]user_agent = random.choice(ua_list)request = urllib2.Request(url)#也可以通过调用Request.add_header()添加/修改一个特定的headerrequest.add_header("User-Agent", user_agent)#第一个字母大写，后面的全部小写request.add_header("User-agent")response = urllib2.urlopen(req)html = response.read()print(html)

注意

The urllib2 module has been split across several modules in Python 3 named urllib.request and urllib.error

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

Python爬虫 urllib2的使用方法详解

相关文章

Python中使用urllib2模块编写爬虫的简单上手示例

python爬虫基础之urllib的使用

使用Python编写爬虫的基本模块及框架使用指南

python 网络爬虫初级实现代码

python使用urllib2提交http post请求的方法