时间:2021-05-22
一、介绍
BeautifulSoup库是灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。
Python常用解析库
解析器 使用方法 优势 劣势 Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差 lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库 lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库 html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展
二、快速开始
给定html文档,产生BeautifulSoup对象
from bs4 import BeautifulSouphtml_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc,'lxml')输出完整文本
print(soup.prettify())<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3"> Tillie </a> ;and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body></html>浏览结构化数据
print(soup.title) #<title>标签及内容print(soup.title.name) #<title>name属性print(soup.title.string) #<title>内的字符串print(soup.title.parent.name) #<title>的父标签name属性(head)print(soup.p) # 第一个<p></p>print(soup.p['class']) #第一个<p></p>的classprint(soup.a) # 第一个<a></a>print(soup.find_all('a')) # 所有<a></a>print(soup.find(id="link3")) # 所有id='link3'的标签<title>The Dormouse's story</title>titleThe Dormouse's storyhead<p class="title"><b>The Dormouse's story</b></p>['title']<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>找出所有标签内的链接
for link in soup.find_all('a'): print(link.get('href'))http://example.com/elsiehttp://example.com/laciehttp://example.com/tillie获得所有文字内容
print(soup.get_text())The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....自动补全标签并进行格式化
html = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.prettify())#格式化代码,自动补全print(soup.title.string)#得到title标签里的内容标签选择器
选择元素
html = """<html><head><title>The Dormouse's story</title></head><body><p class="title" name="dromouse"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.title)#选择了title标签print(type(soup.title))#查看类型print(soup.head)获取标签名称
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.title.name)获取标签属性
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.p.attrs['name'])#获取p标签中,name这个属性的值print(soup.p['name'])#另一种写法,比较直接获取标签内容
print(soup.p.string)标签嵌套选择
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.head.title.string)子节点和子孙节点
html = """<html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"> <span>Elsie</span> </a> <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p>"""from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.p.contents)#获取指定标签的子节点,类型是list另一个方法,child:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.p.children)#获取指定标签的子节点的迭代器对象for i,children in enumerate(soup.p.children):#i接受索引,children接受内容 print(i,children)输出结果与上面的一样,多了一个索引。注意,只能用循环来迭代出子节点的信息。因为直接返回的只是一个迭代器对象。
获取子孙节点:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象for i,child in enumerate(soup.p.descendants):#i接受索引,child接受内容 print(i,child)父节点和祖先节点
parent
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(soup.a.parent)#获取指定标签的父节点parents
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点兄弟节点
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')#传入解析器:lxmlprint(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点标准选择器
find_all( name , attrs , recursive , text , **kwargs )
可根据标签名、属性、内容查找文档。
name
html='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')print(soup.find_all('ul'))#查找所有ul标签下的内容print(type(soup.find_all('ul')[0]))#查看其类型下面的例子就是查找所有ul标签下的li标签:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')for ul in soup.find_all('ul'): print(ul.find_all('li'))attrs(属性)
通过属性进行元素的查找
html='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')print(soup.find_all(attrs={'id': 'list-1'}))#传入的是一个字典类型,也就是想要查找的属性print(soup.find_all(attrs={'name': 'elements'}))查找到的是同样的内容,因为这两个属性是在同一个标签里面的。
特殊类型的参数查找:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')print(soup.find_all(id='list-1'))#id是个特殊的属性,可以直接使用print(soup.find_all(class_='element')) #class是关键字所以要用class_text
根据文本内容来进行选择:
html='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')print(soup.find_all(text='Foo'))#查找文本为Foo的内容,但是返回的不是标签所以说这个text在做内容匹配的时候比较方便,但是在做内容查找的时候并不是太方便。
方法
find
find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出。
ind_parents(), find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。
find_next_siblings() ,find_next_sibling()
find_next_siblings()返回后面的所有兄弟节点,find_next_sibling()返回后面的第一个兄弟节点
find_previous_siblings(),find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点
find_all_next(),find_next()
find_all_next()返回节点后所有符合条件的节点,find_next()返回后面第一个符合条件的节点
find_all_previous(),find_previous()
find_all_previous()返回节点前所有符合条件的节点,find_previous()返回前面第一个符合条件的节点
CSS选择器 通过select()直接传入CSS选择器即可完成选择
html='''<div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div></div>'''from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')print(soup.select('.panel .panel-heading'))#.代表class,中间需要空格来分隔print(soup.select('ul li')) #选择ul标签下面的li标签print(soup.select('#list-2 .element')) #'#'代表id。这句的意思是查找id为"list-2"的标签下的,class=element的元素print(type(soup.select('ul')[0]))#打印节点类型再看看层层嵌套的选择:
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')for ul in soup.select('ul'): print(ul.select('li'))获取属性
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')for ul in soup.select('ul'): print(ul['id'])# 用[ ]即可获取属性 print(ul.attrs['id'])#另一种写法获取内容
from bs4 import BeautifulSoupsoup = BeautifulSoup(html, 'lxml')for li in soup.select('li'): print(li.get_text())用get_text()方法就能获取内容了。
总结
推荐使用lxml解析库,必要时使用html.parser
标签选择筛选功能弱但是速度快 建议使用find()、find_all() 查询匹配单个结果或者多个结果
如果对CSS选择器熟悉建议使用select()
记住常用的获取属性和文本值的方法
更多关于Python爬虫库BeautifulSoup的介绍与简单使用实例请点击下面的相关链接
声明:本页内容来源网络,仅供用户参考;我单位不保证亦不表示资料全面及准确无误,也不保证亦不表示这些资料为最新信息,如因任何原因,本网内容或者用户因倚赖本网内容造成任何损失或损害,我单位将不会负任何法律责任。如涉及版权问题,请提交至online#300.cn邮箱联系删除。
python爬虫模块BeautifulSoup简介简单来说,BeautifulSoup是python的一个库,最主要的功能是从网页抓取数据。官方解释如下:Bea
本文为大家分享了Python爬虫包BeautifulSoup学习实例,具体内容如下BeautifulSoup使用BeautifulSoup抓取豆瓣电影的一些信息
下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容html_doc="""TheDormouse'
Python3安装第三方爬虫库BeautifulSoup4,供大家参考,具体内容如下在做Python3爬虫练习时,从网上找到了一段代码如下:#使用第三方库Bea
首先:文章用到的解析库介绍BeautifulSoup:BeautifulSoup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一