Scraping Google Scholar

Scraping Google Scholar

Table of Contents

Scraping Total Citations

The following snippet of code acquires the value for the total number of citations on a users profile. Both [USERNAME] and [LANGUAGE] should be replaced with the respective profile username (found in the scholar page URL) and language you wish to use (en = English). This value can then be embedded in your website on a button or inside the text as you wish. WordPress snippets enable this in an easy way; the below code can be copied into a new snippet and then select the option “only display when inserted into a post or page”. A small code snippet will then be available to embed in your webpage that looks something like {code_snippet id=[id] name = [name] php format} (note: the curly brackets should be replaced with square brackets, I used curly to stop WordPress trying to run the snippet!). 

				
					<?php

$profile = "https://scholar.google.com/citations?user=[USERNAME]&hl=[LANGUAGE]&oi=ao";

$contents = file_get_contents($profile);

$citations_xpath = '//*[@id="gsc_rsb_st"]/tbody/tr[1]/td[2]';

$dom = new DOMDocument();

@$dom->loadHTML($contents);

$xpath = new DOMXPath($dom);

$citations = $xpath->query($citations_xpath);

$value = $citations->item(0)->nodeValue;

echo $value;

?>
				
			
Scraping H-Index

As before, the following code snippet acquires the h-index from the profile. Both [USERNAME] and [LANGUAGE] should be replaced.

				
					<?php

$profile = "https://scholar.google.com/citations?user=[USERNAME]&hl=[LANGUAGE]&oi=ao";

$contents = file_get_contents($profile);

$hindex_xpath = '//*[@id="gsc_rsb_st"]/tbody/tr[2]/td[2]';

$dom = new DOMDocument();

@$dom->loadHTML($contents);

$xpath = new DOMXPath($dom);

$hindex = $xpath->query($hindex_xpath);

$value = $hindex->item(0)->nodeValue;

echo $value;

?>
				
			
Scraping All Article Information

The following block of code extracts all the information required to construct the table that can be found in my portfolio, consisting of recent articles, authors, citations, etc.. [USERNAME] and [LANGUAGE] should be replaced, additionally [SORT] can take either pubdate for the most recent papers or citedby for the most cited papers. By putting the key authors name in [KEY AUTHOR] you can add bold around their name whenever the name appears, e.g. for me I would have: keyAuthor = “B Wooding”.

				
					<?php

$baseUrl = "https://scholar.google.com";
$profile = "/citations?hl=en&user=[USERNAME]&view_op=list_works&hl=[LANGUAGE]&sortby=[SORT]";

$contents        = file_get_contents($baseUrl.$profile);
$citations_xpath = '//*[@id="gsc_a_b"]';

$dom = new DOMDocument();
@$dom->loadHTML($contents);

$xpath = new DOMXPath($dom);

$table = $xpath->query($citations_xpath);

$records = [];
foreach ($table as $row) {
    $trs = $row->getElementsByTagName('tr');

    foreach ($trs as $tr) {
        $td = $tr->getElementsByTagName('td')->item(0);

        $title     = $td->getElementsByTagName('a')->item(0)->nodeValue;
        $titleHref = $td->getElementsByTagName('a')->item(0)->getAttribute('href');
        $authors   = $td->getElementsByTagName('div')->item(0)->nodeValue;
        $conference = $td->getElementsByTagName('div')->item(1)->nodeValue;
        $cites     = $tr->getElementsByTagName('td')->item(1)->nodeValue;
        $year      = $tr->getElementsByTagName('td')->item(2)->nodeValue;

        $keyAuthor = "[KEY AUTHOR]";
        $authors = str_replace($keyAuthor, "<strong>$keyAuthor</strong>", $authors);

        $records[] = [
            'title'     => $title,
            'titleHref' => $baseUrl.$titleHref,
            'authors'   => $authors,
            'conference' => $conference,
            'cites'     => $cites,
            'year'      => $year,
        ];
    }
}

?>

<style>
  #php-table {
    font-family: sans-serif;
    color: #18181b;
    border-collapse: collapse;
    width: 100%;
  }

  #php-table thead th {
    padding: 1rem 0.8rem;
  }

  #php-table thead tr th:first-child {
    text-align: center;
  }

  #php-table th, td {
    border: 1px solid #ccc;
    text-align: left;
    padding: 0.6rem 0.8rem;
  }

  #php-table tr:nth-child(even) {
    background-color: #f6f6f6;
  }

  #php-table td > p {
    font-size: smaller;
    color: #777;
    margin: 0.4rem 0 0;
  }
</style>

<table id="php-table">
    <thead>
    <tr>
        <th>Title</th>
        <th>Cited&nbsp;by</th>
        <th>Year</th>
    </tr>
    </thead>
    <tbody>
    <?php
    foreach ($records as $record): ?>
        <tr>
            <td>
                <a target="_blank" href="<?= $record['titleHref'] ?>"><?= $record['title'] ?></a>
                <p><?= $record['authors'] ?></p>
                <p><?= $record['conference'] ?></p>
            </td>
            <td><?= $record['cites'] ?></td>
            <td><?= $record['year'] ?></td>
        </tr>
    <?php
    endforeach; ?>
    </tbody>
</table>
				
			

2 thoughts on “Scraping Google Scholar”

Comments are closed.