MDPI 爬取 'title_link', 'author_list', 'cited_by', 'viewed_by' Demo 数据保存至CSV文件

Hazel ·

更新时间:2024-11-13

· 614 次阅读

URL ：


https://www.mdpi.com/search?sort=article_citedby&page_no=0&page_count=50&year_from=1996&year_to=2020&journal=cells&view=default
Page ：

Demo ：

# encoding: utf-8
"""
@author: lanxiaofang
@contact: fang@lanxf.cn
@software: PyCharm
@file: only_for_test.py
@time: 2020/5/2 17:19
"""
import random
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
class Cell(object):
    def __init__(self):
        self.header = {
                'Accept': 'application / json, text / plain, * / *',
                'Accept - Encoding': 'gzip, deflate, br',
                "Accept - Language": "zh - CN, zh;q=0.9",
                'Connection': "keep - alive",
                'Referer': 'https://www.mdpi.com/search?sort=article_citedby&page_no=0&page_count=50&year_from=1996&year_to=2020&journal=cells&view=default',
                'User-Agent': "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; de) Opera 11.01",
            }
        self.path = 'cell_info.csv'
        with open(self.path, 'w', encoding='utf-8') as f:  # 清空文件，如果该path文件打开的请记得关闭，否则会报错拒绝访问
            f.truncate()
        self.df = pd.DataFrame(columns=['title_link', 'author_list', 'cited_by', 'viewed_by'])
    def save_data(self):
        self.df.to_csv(self.path, index=False, mode='a', encoding='utf-8')
        print('----------save success')
    def get_data(self, start, end):
        for i in range(start, end):
            time.sleep(random.uniform(1, 5))
            url = 'https://www.mdpi.com/search?sort=article_citedby&page_no={}&page_count=50&year_from=1996&year_to=2020&journal=cells&view=default'.format(i)
            html = requests.get(url, headers=self.header)
            html.encoding = 'utf-8'
            soups = BeautifulSoup(html.text, 'html.parser')
            try:
                article_content = soups.find_all('div', class_='article-content')
                title_link = article_content[i].find_all('a', class_='title-link')[0].string
                authors = article_content[i].select('div.authors span.inlineblock a')
                author_list = []
                for item in authors:
                    author_list.append(item.string)
                cited = article_content[i].find_all('a')
                cited_by = 0
                for c in cited:
                    if 'Cited by' in c.contents[0]:
                        cited_by = int(c.contents[0][9:])  # [9:] 表示从第9位开始截取到最后
                        break
                viewed_by = 0
                for a in article_content[i]:
                    if 'Viewed by' in a:
                        viewed_by = int(a[13:])
                print(i, '-cited_by-', cited_by, '-viewed_by-', viewed_by, '-title_link-', title_link, '-author_list-', author_list)
                self.df = self.df.append(pd.DataFrame.from_dict({'title_link': title_link, 'author_list': author_list, 'cited_by': cited_by, 'viewed_by': viewed_by}, orient='index').T, ignore_index=True)
            except Exception as e:
                print('INFO *** ', e)
        self.save_data()
cell = Cell()
cell.get_data(0, 10)   # fill the number for page_no is 0 to another number

① 字符串内容从第i位开始截取到最后：str [i: ]
② 保存数据前先清空文件：f.truncate(8)会截取从文件开头到8字节大小的长度，为空或0size表示清空文件
仅供学习参考使用，性能方面未达最优

 
                    
                                        懒笑翻
                                                                                            
                    原创文章 151获赞 117访问量 58万+
                                            关注
                                                                私信
    
                展开阅读全文


作者：懒笑翻
                    
 
                

                            author
                            csv文件
                            BY
                            数据
                            link
                            demo
                            csv
                            title
                            list


           
    
    

            
                
                    
                
            
            
                
    
        
            需要 登录 后方可回复, 如果你还没有账号请 注册新账号
        
    
                
            
                
                    
                        相关文章

    
        
            HTML 表单和输入
        
        
            Odessa
            2020-11-28
        
    
    
        559
    


    
        
            Node.js EventEmitter
        
        
            Cybill
            2020-10-23
        
    
    
        790
    


    
        
            Lua 字符串
        
        
            Florence
            2021-02-12
        
    
    
        582
    


    
        
            详解HTML5 window.postMessage与跨域
        
        
            Maha
            2020-11-26
        
    
    
        643
    


    
        
            HTML标签meta总结,HTML5 head meta 属性整理
        
        
            Xenia
            2020-03-15
        
    
    
        740
    


    
        
    
    
        
            详解css栅格系统在项目中的灵活运用
        
        
            Ursula
            2020-04-01
        
    
    
        908
    


    
        
            浅谈laravel中的关联查询with的问题
        
        
            Damara
            2020-08-12
        
    
    
        763
    


    
        
            用PHP的反射实现委托模式的讲解
        
        
            Daphne
            2020-05-13
        
    
    
        628
    


    
        
            linux安装xmind的方法步骤
        
        
            Wenda
            2020-04-16
        
    
    
        819
    


    
        
            C++中的std::initializer_list使用解读
        
        
            Madeline
            2023-07-20
        
    
    
        184
    


    
        
            数据结构之带头结点的单链表
        
        
            Grizelda
            2023-07-20
        
    
    
        1827
    


    
        
            基于C语言的开源csv解析库MiniCSV的使用示例
        
        
            Qamar
            2023-07-20
        
    
    
        1493
    


    
        
            使用C++实现Excel文件与CSV之间的相互转换
        
        
            Rose
            2023-07-20
        
    
    
        1804
    


    
        
            C++各种输出数据类型详解
        
        
            Janna
            2023-07-20
        
    
    
        623
    


    
        
            Golang基于Vault实现敏感数据加解密
        
        
            Ophelia
            2023-07-21
        
    
    
        1848
    


    
        
            用Python进行数据清洗以及值处理
        
        
            Crystal
            2023-07-21
        
    
    
        283
    


    
        
            Python常用的数据清洗方法详解
        
        
            Laila
            2023-07-21
        
    
    
        1301
    


    
        
            Python multiprocessing.value实现多进程数据共享的示例
        
        
            Aine
            2023-07-21
        
    
    
        1131
    


    
        
            深入探究python中Pandas库处理缺失数据和数据聚合
        
        
            Bliss
            2023-07-21
        
    
    
        191
    


    
        
            Python中Pandas库的数据处理与分析
        
        
            Lillian
            2023-07-21
        
    
    
        155


        
    
        
            我要提问
        
    
    
        
        
    
        致谢
        
            帮助他人，成就自己。
            人生最大成功就是伸出热情而温暖的双手，尽自己所能去帮助身边的每一个人，只要无私的奉献，就会收获到美好的生活。
            1024问感谢每一位朋友的帮助和支持。
            软件开发网提供编程的基础软件技术培训教程,软件开发编程实例讲解Go,Node,HTML,CSS,Javascript,Python,Java,Ruby,C,PHP,MySQL等软件开发编程语言以及数据开发的基础知识，也提供大量的软件开发在线实例、从入门到精通就在1024问。
        
    
    
        
            
    育儿网
    微养生
    全球行
    美食街
    育儿
    菜谱大全
    海南旅游
    女性
    养狗百科
    星座