Scraping Google Scholar
Table of Contents
Scraping Total Citations
The following snippet of code acquires the value for the total number of citations on a users profile. Both [USERNAME] and [LANGUAGE] should be replaced with the respective profile username (found in the scholar page URL) and language you wish to use (en = English). This value can then be embedded in your website on a button or inside the text as you wish. WordPress snippets enable this in an easy way; the below code can be copied into a new snippet and then select the option “only display when inserted into a post or page”. A small code snippet will then be available to embed in your webpage that looks something like {code_snippet id=[id] name = [name] php format} (note: the curly brackets should be replaced with square brackets, I used curly to stop WordPress trying to run the snippet!).
loadHTML($contents);
$xpath = new DOMXPath($dom);
$citations = $xpath->query($citations_xpath);
$value = $citations->item(0)->nodeValue;
echo $value;
?>
Scraping H-Index
As before, the following code snippet acquires the h-index from the profile. Both [USERNAME] and [LANGUAGE] should be replaced.
loadHTML($contents);
$xpath = new DOMXPath($dom);
$hindex = $xpath->query($hindex_xpath);
$value = $hindex->item(0)->nodeValue;
echo $value;
?>
Scraping All Article Information
The following block of code extracts all the information required to construct the table that can be found in my portfolio, consisting of recent articles, authors, citations, etc.. [USERNAME] and [LANGUAGE] should be replaced, additionally [SORT] can take either pubdate for the most recent papers or citedby for the most cited papers. By putting the key authors name in [KEY AUTHOR] you can add bold around their name whenever the name appears, e.g. for me I would have: keyAuthor = “B Wooding”.
loadHTML($contents);
$xpath = new DOMXPath($dom);
$table = $xpath->query($citations_xpath);
$records = [];
foreach ($table as $row) {
$trs = $row->getElementsByTagName('tr');
foreach ($trs as $tr) {
$td = $tr->getElementsByTagName('td')->item(0);
$title = $td->getElementsByTagName('a')->item(0)->nodeValue;
$titleHref = $td->getElementsByTagName('a')->item(0)->getAttribute('href');
$authors = $td->getElementsByTagName('div')->item(0)->nodeValue;
$conference = $td->getElementsByTagName('div')->item(1)->nodeValue;
$cites = $tr->getElementsByTagName('td')->item(1)->nodeValue;
$year = $tr->getElementsByTagName('td')->item(2)->nodeValue;
$keyAuthor = "[KEY AUTHOR]";
$authors = str_replace($keyAuthor, "$keyAuthor", $authors);
$records[] = [
'title' => $title,
'titleHref' => $baseUrl.$titleHref,
'authors' => $authors,
'conference' => $conference,
'cites' => $cites,
'year' => $year,
];
}
}
?>
Title
Cited by
Year
= $record['title'] ?>
= $record['authors'] ?>
= $record['conference'] ?>
= $record['cites'] ?>
= $record['year'] ?>
Thanks for your blog, nice to read. Do not stop.
Thanks Mark. I have just updated the post to link to the code if you would like to do it yourself.