Processing academic research literature at scale is an interesting problem: there’s a lot of it.
There are literally millions of papers published every year across many tens of thousands of journals and no one person has time to read them all, with their own eyes at least.
But with computers, the Internet, and some clever code, we can definitely tackle this data deluge. There’s no such thing as ‘too much science’ – not when it comes to research papers. The volume, bandwidth, or disk space required simply isn’t challenging in 2015. You can download over a million papers from the PubMedCentral open access subset with relative ease. It’s not ‘big data’, just gigabytes.
In this post, I’m going to outline one specific use-case of the ContentMine toolchain, taking you through the process from ‘haystack’ to ‘needles’. In this demonstration, the ‘haystack’ is a set of 49 PLOS ONE papers, and the ‘needles’ (data) I want to tease out from this mini-mass of literature are binomial nomenclature – the scientific names given by biologists to organisms e.g. Homo sapiens, Gorilla gorilla, and E. coli.
Step One: downloading a sample
For demonstration purposes, rather than taking the whole of the open access subset, we’re just going to analyse 49 PLOS ONE papers, but if one desired this process could easily scale.
#download 49 papers as XML
getpapers --query 'species JOURNAL:"PLOS ONE" AND FIRST_PDATE:[2015-04-02 TO 2015-04-02]' -x --outdir plos-species
Step Two: normalizing the content
This step isn’t particularly exciting, but it’s an important one. Content is published in a variety of ways and styles, and from many different providers – thus for consistency of analysis the ‘raw’ content from step one must first be normalized prior to any analysis.
# normalize before you analyze!
$ norma -q plos-species/ -i fulltext.xml -o scholarly.html --transform nlm2html
Our normalization tool is called norma, it’s very dependable. You can point it at a directory output from getpapers with any number of papers within it, from 1 to a million, norma will chug away doing all the necessary pre-processing in next to no time at all. The output from this stage is formatted as Scholarly HTML, hence the “–transform” switch, transforming the input from NLM XML, to (scholarly) HTML; nlm2html for short. Other types of transform are also supported with norma.
Step Three: filtering out the names
This is a more sophisticated step. There’s a lot going-on ‘under the hood’ with this command. It’s not just parsing for regular expressions or dictionary-based lookup. ami2-species attempts to find all binomials, isolated genus names, and even abbreviated binomens e.g. E. coli . It’s also just one of many of the ContentMine ami-plugins.
# Find the needles in the haystack, with some context too
$ ami2-species -q ./plos-species -i scholarly.html --sp.species --context 35 50 --sp.type binomial genus genussp
Don’t be afraid if you see some error messages. The default settings are quite verbose at the moment.
the content of each of the results.xml files for each of the papers looks something like this:
$ cat ./plos-species/PMC4383566/results/species/binomial/results.xml
<?xml version="1.0" encoding="UTF-8"?>
<result pre="val (in the dry treatment) between " exact="Ricinodendron heudelotii" match="Ricinodendron heudelotii" post=" seedlings in the first and second batches of the " name="binomial"/>
<result pre="-pioneer light demanding species ( " exact="Albizia zygia" match="Albizia zygia" post=", Pericopsis elata, Pouteria aningeri</i" name="binomial"/>
<result pre="ng species ( Albizia zygia, " exact="Pericopsis elata" match="Pericopsis elata" post=", Pouteria aningeri, Sterculia rhinopeta" name="binomial"/>
<result pre="ygia, Pericopsis elata, " exact="Pouteria aningeri" match="Pouteria aningeri" post=", Sterculia rhinopetala, Piptadeniastrum" name="binomial"/>
<result pre="ata, Pouteria aningeri, " exact="Sterculia rhinopetala" match="Sterculia rhinopetala" post=", Piptadeniastrum africanum). " name="binomial"/>
<result pre="/i>, Sterculia rhinopetala, " exact="Piptadeniastrum africanum" match="Piptadeniastrum africanum" post="). " name="binomial"/>
It’s now easy to summarize findings across papers: how many of these 49 papers contain scientific names of organisms? 39 apparently.
# loop to print number of lines in each results file if=6 then there are 0 name hits
$ for i in ./plos-species/*/ ; do echo $i ; find ./$i -type f -name 'results.xml' | xargs cat | wc -l ; done
The ContentMine team plans to add functionality to make it easier to determine taxon co-occurrence and taxon name lookup, as well as easier ways of visualising the results. Watch this space!
The point of this post is to show that in 3 lines, using three ContentMine tools, you can effectively filter the literature for species. No information overload here.