本人作为一名爬虫初学者,会每天分享自己的爬虫心路历程,希望能够帮助到有需要的小伙伴们。第一次写博客,有许多规矩不太了解,若有冒犯,请多多谅解,同时也希望大家多多指正本文中的不合理之处,谢谢大家!
一、前期准备 对象选择:本次爬取选择具有代表性的页码类网页——糗事百科,本次只爬取前四页的标题、笑话文字等内容。 浏览器使用:chrome 模块使用:requests、BeautifulSoup 确定URL:观察网页url规律,寻找每一页url的内在规律。如下,可知url规律为page/数字/。https://www.qiushibaike.com/text/
https://www.qiushibaike.com/text/page/2/
https://www.qiushibaike.com/text/page/3/
https://www.qiushibaike.com/text/page/4/
模块定义:
import requests #爬虫库
from bs4 import BeautifulSoup #解析库
获取请求头——headers,在chrome中鼠标点击右键点击检查,可以看到相关代码,然后点击network中的headers找到请求头。
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
二、爬取
创建一个for循环:
import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
url="https://www.qiushibaike.com/text/page/{}/".format(i)
利用requests获取代码:
import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
url="https://www.qiushibaike.com/text/page/{}/".format(i)
html=requests.get(url,headers=headers)
res=html.text
网页解析:
import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
url="https://www.qiushibaike.com/text/page/{}/".format(i)
html=requests.get(url,headers=headers)
res=html.text
soup=BeautifulSoup(res,'lxml')
按顺序依次爬取作者、笑话内容:
import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
url="https://www.qiushibaike.com/text/page/{}/".format(i)
html=requests.get(url,headers=headers)
res=html.text
soup=BeautifulSoup(res,'lxml')
titles=soup.select(" a h2")#筛选符合条件的标签
contonts=soup.select(".content span")
for a in titles:
print(a.get_text())#只输出文字
for b in contonts:
print(b.get_text())