【网络爬虫】Google Scholar个人数据爬取

Li Shen

首先记录一下i10和h-index的计算方法:

 

假设你现在有一个引用次数的list:

a = [37, 23, 13, 9, 5, 4, 3, 2, 1, 1, 1, 0, 0]

 

i10指数就是你引用过10的文章总数,这个就很简单记个数就行:

a = [37, 23, 13, 9, 5, 4, 3, 2, 1, 1, 1, 0, 0]
t = 0

for i in a:
    if i >= 10:
        t+=1

print(t)

h-index表示你有x篇文章的引用量大于等于x,听起来有点绕,其实就是,将文章按引用量从高到低排列,当某篇文章的编号(这里编号起始为1不是0)大于该文章的引用量时,编号减一就是h-index值。

a = [37, 23, 13, 9, 5, 4, 3, 2, 1, 1, 1, 0, 0]

l = len(a)
i = 0
t = 0

while l > 0:
    if a[i] > i:
        l-=1
        i+=1
        t+=1
    else:
        print(t)
        break

 

最近在用beautifulsoup爬取google scholar的内容,存入数据库然后后台简单分析后将结果展现在前端。虽然能直接爬取h-index等指标,不过爬的过程中了解一下h-index的算法也不错,学到了新东西~

附google scholar简单爬取代码,python有个scholarly库专门用来爬google scholar的,不过功能不是很全,不太好用。希望谷歌什么时候能够开放Google scholar的API.....

import time
from bs4 import BeautifulSoup
import re
import requests
from pandas.core.frame import DataFrame
from sqlalchemy import create_engine
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr

a = 365

while a>0:
    try:

        header = {
            'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20'}  # 伪造浏览器
        website = requests.get("https://scholar.google.com/citations?user=pyTG14gAAAAJ&hl=en", headers=header)  # 连接网页

        website_content = BeautifulSoup(website.content, 'lxml')  # 获取网页html代码

        my_stats = website_content.find_all(class_="gsc_rsb_std")  # 获取引用,h-index和i10数据
        paper_ti = website_content.find_all(class_="gsc_a_at")  # 获取引用top20文章标题
        paper_author_journal = website_content.find_all(class_="gs_gray")  # 获取文章作者和杂志
        paper_date = website_content.find_all(class_="gs_oph")  # 获取文章发表年份
        paper_citation = website_content.find_all(class_="gsc_a_ac gs_ibl")  # 获取每篇文章引用数量

        ti_list = []
        t_list = []
        a_list = []
        j_list = []
        d_list = []
        c_list = []
        stat_list = []
        # 通过正则表达查找对应数据,并放入list, 然后list转dataframe写入数据库
        for ti in paper_ti:
            ti = str(ti)
            ti = re.findall(r'[>](.*?)[<]', ti)
            n_ti = ''.join(ti)
            ti_list.append(n_ti)

        for t in paper_author_journal:
            t = str(t)
            t = re.findall(r'[>](.*?)[<]', t)
            n_t = ''.join(t)
            t_list.append(n_t)

        for au in t_list[::2]:
            a_list.append(au)
        for jour in t_list[1::2]:
            j_list.append(jour)

        for year in paper_date:
            year = str(year)
            year = re.findall(r'[,](.*?)[<]', year)
            n_year = ''.join(year)
            d_list.append(n_year)

        for cite in paper_citation:
            cite = str(cite)
            cite = re.findall(r'[>](.*?)[<]', cite)
            n_cite = ''.join(cite)
            if n_cite != '':
                n_cite = int(n_cite)
                c_list.append(n_cite)
            else:
                c_list.append(0)

        for stats in my_stats:
            stats = str(stats)
            stats = re.findall(r'[>](.*?)[<]', stats)
            n_stats = ''.join(stats)
            n_stats = int(n_stats)
            stat_list.append(n_stats)

        ids = list(range(len(ti_list)))
        n_ids = list(map(lambda x: x + 1, ids))

        data = {"id": n_ids, "Paper_ti": ti_list, "Paper_authors": a_list, "Paper_journal": j_list,
                "Paper_date": d_list, "Paper_citation": c_list}
        dataframe = DataFrame(data)

        stat_data = {"id": 1, "total_cite": stat_list[0], "total_hidx": stat_list[2], "total_i10": stat_list[4],
                     "cite_since_last5yr": stat_list[1], "hidx_since_last5yr": stat_list[3],
                     "i10_since_last5yr": stat_list[5]}
        stat_dataframe = DataFrame(stat_data, index=[0])

        engine = create_engine("mysql+pymysql://admin:PASSWORD@localhost/MY_DATABASE")
        dataframe.to_sql(name="myblog_mygooglescholar", con=engine, if_exists='replace', index=False)
        stat_dataframe.to_sql(name="myblog_googlescholarstats", con=engine, if_exists='replace', index=False)

        #定时(设定为一天爬一次),记录爬取是否成功
        save_time = time.asctime()

        f = open("success.txt", 'a+')
        f.write(str(a) + ". Data successfully crawled!" + " ###### " + save_time + '\n')
        f.close()
        time.sleep(86400)
        a -= 1

    except:
        save_time = time.asctime()

        f = open("error.txt", 'a+')
        f.write(str(a) + ". Error!" + " ###### " + save_time + '\n')
        f.close()
        time.sleep(86400)
        a -= 1

    #设置邮件提醒,当还剩10天时更新脚本。(邮箱地址和密码已被XXX代替)
    sender = 'xxxxxxxx@xx.com'
    psd = 'xxxxxxxxxxxxxxxx'
    smtp_server = 'smtp.xxxx.com'
    receiver = 'xxxxxxxx@xxxxxx.com'

    message = MIMEText("注意:距离脚本更新还剩下10天。", 'plain', 'utf-8')
    message['From'] = formataddr(["Li Shen Blog", sender])
    message['To'] = formataddr(["Li Shen", receiver])
    message['Subject'] = "博客爬虫脚本更新提醒!"

    if a == 10:
        try:
            save_time = time.asctime()

            server = smtplib.SMTP_SSL(smtp_server, 465)
            server.login(sender, psd)
            server.sendmail(sender, receiver, message.as_string())
            f = open("success.txt", 'a+')
            f.write(str(a) + ". Alert email has been sent!" + " ###### " + save_time + '\n')
            f.close()
        except smtplib.SMTPException:
            save_time = time.asctime()

            f = open("error.txt", 'a+')
            f.write(str(a) + ". Error!" + " ###### " + save_time + '\n')
            f.close()

 

102

July 27, 2019, 3:18 a.m.

Comments:

You haven't logged in yet, Please Log in or Signup.

Li Shen

当然,也可以设置自动更新,不用邮件提醒然后手动去更新,类似于: a = 365 while a > 0: ... ... a-=1 if a == 10: a+=355