利用爬虫爬取简单页码类网页数据

Halona ·

更新时间:2024-11-14

· 926 次阅读

利用爬虫爬取简单页码类网页数据

本人作为一名爬虫初学者，会每天分享自己的爬虫心路历程，希望能够帮助到有需要的小伙伴们。第一次写博客，有许多规矩不太了解，若有冒犯，请多多谅解，同时也希望大家多多指正本文中的不合理之处，谢谢大家！

一、前期准备 对象选择：本次爬取选择具有代表性的页码类网页——糗事百科，本次只爬取前四页的标题、笑话文字等内容。浏览器使用：chrome 模块使用：requests、BeautifulSoup 确定URL:观察网页url规律，寻找每一页url的内在规律。如下，可知url规律为page/数字/。

https://www.qiushibaike.com/text/
https://www.qiushibaike.com/text/page/2/
https://www.qiushibaike.com/text/page/3/
https://www.qiushibaike.com/text/page/4/

模块定义：

import requests #爬虫库
from bs4 import BeautifulSoup #解析库

获取请求头——headers，在chrome中鼠标点击右键点击检查，可以看到相关代码，然后点击network中的headers找到请求头。

headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}

二、爬取 创建一个for循环：

import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
    url="https://www.qiushibaike.com/text/page/{}/".format(i)

利用requests获取代码：

import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
    url="https://www.qiushibaike.com/text/page/{}/".format(i)
    html=requests.get(url,headers=headers)
    res=html.text

网页解析：

import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
    url="https://www.qiushibaike.com/text/page/{}/".format(i)
    html=requests.get(url,headers=headers)
    res=html.text
    soup=BeautifulSoup(res,'lxml')

按顺序依次爬取作者、笑话内容：

import requests #爬虫库
from bs4 import BeautifulSoup #解析库
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
for i in range (1,5):
    url="https://www.qiushibaike.com/text/page/{}/".format(i)
    html=requests.get(url,headers=headers)
    res=html.text
    soup=BeautifulSoup(res,'lxml')
    titles=soup.select(" a h2")#筛选符合条件的标签
    contonts=soup.select(".content span")
    for a in titles:
        print(a.get_text())#只输出文字
    for b in contonts:
        print(b.get_text())

作者：初学苟

爬虫数据单页

1024 个赞