忧郁的loli od链接爬取

Petra ·
更新时间:2024-09-20
· 953 次阅读

忧郁的loli od链接爬取说明思路以下是代码实现拓展思路注
可能是忧郁的loli太小众化了,在网上找相关的爬虫,没有什么搜索结果。GitHub上找到一个使用selenium爬取的,但由于此网站过小,服务器速度很慢,外加selenium本身也会降低浏览的速度,爬取很慢,我曾尝试让selenium避开图片加载,但速度依旧感人,于是决定自己写一个。
本人萌新一枚(也是第一次写博客),如发现代码里有很愚蠢的地方,请大佬们指出,谢谢。 说明

以下代码是在 获取国际链接 的页面上,无法通过network选项找到获取下载链接的链接的前提下写的。通过发现 获取下载链接的链接 是由 “https://od.hhgal.com/ + 游戏名 + 下载内容”组合而成,其中下载内容为“游戏名+.rar”或“游戏名+part%d.rar”组成。

刚刚在服务器测试了下,发现
程序需要按需修改headers才可以运行

思路

1.通过xpath定位元素,爬取主页上的各个游戏页面链接和游戏名。
2.进入各个游戏页面爬取文件说明,得到压缩文件的part数量,为合成链接做准备
3.向 获取下载链接的链接 发送请求, 这个请求会重定向到下载链接,只要查看之前发出的重定向 location信息就可以得到下载链接了。

以下是代码实现 import requests import urllib.parse from lxml import etree def judge_part(game_url,headers2): game_page = requests.get(game_url, headers = headers2) html = etree.HTML(game_page.content.decode()) files = html.xpath('//div[@class = "alert alert-info"]/span')[0] files_text = ''.join(files.itertext()) if 'part' in files_text: part = files_text.split('MD5')[-2] part = int(part.split('part')[-1]) return part else: return 0 def get_download_url(title, part, headers3): url2 = 'https://od.hhgal.com/' + urllib.parse.quote(title+'/'+title) headers3["Referer"] = 'https://od.hhgal.com/'+urllib.parse.quote(title) success = 1 with open('record.txt', 'a', encoding = 'utf-8') as f: if part == 0: each_url = url2 + urllib.parse.quote('.rar') f.write(title + '\n') try: res = requests.get(each_url, headers = headers3) location = res.history[0].headers['location'] f.write(title + '.rar:' + location + '\n\n') except: f.write(title + '.rar:' + 'GET_FAILED\n\n') success = 0 else: f.write(title + '\n') for i in range(1, part + 1): each_url = url2 + urllib.parse.quote('.part%d.rar'%i) try: res = requests.get(each_url, headers = headers3) location = res.history[0].headers['location'] f.write(title + 'part%d.rar:'%i + location + '\n') except: f.write(title + 'part%d.rar:'%i + 'GET_FAILED\n\n') success = 0 f.write('\n') return success url = 'https://www.hhgal.com/page/{}/' #进入主目录 headers1 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Cookie": "simplefavorites=%5B%7B%22site_id%22%3A1%2C%22posts%22%3A%5B28759%5D%2C%22groups%22%3A%5B%7B%22group_id%22%3A1%2C%22site_id%22%3A1%2C%22group_name%22%3A%22Default+List%22%2C%22posts%22%3A%5B28759%5D%7D%5D%7D%5D; security_session_verify=25b6b420babdd4f283942a39ea7654b5; security_session_mid_verify=90f5f4ee1476480467ab9f633ee3fa3a; wpfront-notification-bar-landingpage=1; wordpress_test_cookie=WP+Cookie+check; PHPSESSID=7jrmhukq2dp2mgqjqde4q2o0hi", "Host": "www.hhgal.com", "Referer": "https://www.hhgal.com/", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "same-origin", "Sec-Fetch-User": "?1", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36" } #进入游戏查看界面 headers2 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Cookie": "simplefavorites=%5B%7B%22site_id%22%3A1%2C%22posts%22%3A%5B28759%5D%2C%22groups%22%3A%5B%7B%22group_id%22%3A1%2C%22site_id%22%3A1%2C%22group_name%22%3A%22Default+List%22%2C%22posts%22%3A%5B28759%5D%7D%5D%7D%5D; security_session_verify=25b6b420babdd4f283942a39ea7654b5; security_session_mid_verify=90f5f4ee1476480467ab9f633ee3fa3a; wpfront-notification-bar-landingpage=1; wordpress_test_cookie=WP+Cookie+check; PHPSESSID=5oi82kfhndapa93l1e72m9pat8", "Host": "www.hhgal.com", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36" } #用于请求链接 headers3 = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Encoding": "gzip, deflate", "Accept-Language": "zh-CN,zh;q=0.9", "Cache-Control": "max-age=0", "Connection": "keep-alive", "Host": "od.hhgal.com", "Referer": "", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "same-site", "Sec-Fetch-User": "?1", "Upgrade-Insecure-Requests": "1", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36" } open('record.txt', 'w') page = 1 while page <= 90: print('Go to page %d'%page) #经测试,只有第一页不符合换页规则 if page == 1: response = requests.get('https://www.hhgal.com/', headers = headers1) else: response = requests.get(url.format(page), headers = headers1) text = response.content.decode() tree = etree.HTML(text) #获取游戏名 titles = tree.xpath('//div[@class = "article well clearfix mybody3"]//h1/a/span[@class="animated_h1"]') titles = [i.text for i in titles] #获取游戏页面链接 game_urls = tree.xpath('//div[@class = "article well clearfix mybody3"]//h1/a') game_urls = [i.attrib['href'] for i in game_urls] print('主页获取完毕') for title, game_url in zip(titles, game_urls): if title == '详细更新日志': continue part = judge_part(game_url, headers2) print('part获取完毕') if get_download_url(title, part, headers3): print(title,':SUCCEED') else: print(title, ':FAILED') page += 1 拓展思路 多线程爬取 断点重连,可继续下载 资源导入数据库 对小网站的爬取速度进行优化 爬取高速链接(高速链接里有个我没能找到规律的4位数字,若能找到规律则最好,若不能,可以进行穷举法尝试,但必须在4优化过的前提下进行,因为那个重定向链接速度真的慢)

由于重定向链接比较慢,爬虫启动后请耐心等待
获取失败的几种情况
1.少部分获取链接的网址命名规则不合
2.某些文件命名为part07而不是part7
3.文件命名规则不合
4.某些游戏并没有od链接


作者:smallfishoil



od

需要 登录 后方可回复, 如果你还没有账号请 注册新账号