Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.


Continue reading Evaluating Trends in Bioinformatics Software Packages

Audio interview with Peter Murray-Rust on the Data Skeptic Podcast (53 minutes)

In August, Peter Murray-Rust agreed to doing an interview with Kyle Polich at Data Skeptic “The podcast that is skeptical of and with data”. The interview was published online on 28th August 2015.

Data Skeptic is a podcast that alternates between short mini episodes with the host explaining concepts from data science to his non-data scientist wife, and longer interviews featuring practitioners and experts on interesting topics related to data, all through the eye of scientific skepticism.

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program’s founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copyright, and several other interesting topics.

Full transcript available here: ContentMine full transcript.

The draft transcript is  ~98% accurate and Peter will edit this in due course.



Both the audio and transcript are licensed under Creative Commons CC-BY.

With so much valuable content, in due course, we shall break this down into more sizable segments, but for now, enjoy the interview in full.

Just some of the topics covered:-


Entry points to the ContentMine toolchain

Where to get started when content mining? Well, it really depends on what you’ve currently got and/or where you plan to source your content from. In this bite-size post I’ll cover all 3 of the major entry points to the ContentMine toolchain.

entry points


We envisage that there will be 3 significant entry points to the ContentMine toolchain:

  1. From academic content aggregator websites via getpapers (most recommended route, if possible)
  2. From journal websites via quickscrape
  3. From local-desktop access user-supplied files (least recommended)


All three of these entry points pass content to norma, to normalise the to-be-mined-content to ContentMine standards and specifications, prior to analysis and visualisation by downstream ContentMine tools.


Here’s some command-line examples of how each of these entry-points work:


1.) The ideal workflow if your subject matter / resource provider allows it is to take standardised XML e.g. NLM XML from EPMC and work with this highly structured content. The example below is taken from a previous blog post on finding species.

getpapers --query 'species JOURNAL:"PLOS ONE" AND FIRST_PDATE:[2015-04-02 TO 2015-04-02]'  
          -x  --outdir plos-species
norma -q plos-species/ -i fulltext.xml -o scholarly.html --transform nlm2html
#downstream analyses proceed on normalised content from here onwards...


2.) If your subject matter isn’t covered by IEEE / arXiv / Europe PubMedCentral, or some other reason like pesky embargo periods, then you can try to enter content into the ContentMine toolchain via quickscrape.  In terms of file format the order of preference is (from best to worst): XML > HTML > PDF. Sadly many legacy subscription access publishers choose not to expose content in XML from their journal websites. The HTML workflow goes from publisher HTML -> tidied-up XHTML -> scholarly HTML.

#quickscrape usage on a Nature Communications paper
quickscrape --url http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html  
            --scraper journal-scrapers/scrapers/nature.json --output natcomms
info: quickscrape 0.4.5 launched with...
info: - URL: http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html
info: - Scraper: /home/ross/workspace/quickscrape/journal-scrapers/scrapers/nature.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html
info: [scraper]. URL rendered. http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. ncomms1031-s1.pdf.
info: [scraper]. download started. fulltext.pdf.
info: URL processed: captured 11/25 elements (14 captures failed)
info: all tasks completed

tree natcomms/
└── http_www.nature.com_ncomms_journal_v1_n3_abs_ncomms1031.html
    ├── fulltext.html
    ├── fulltext.pdf
    ├── ncomms1031-s1.pdf
    └── results.json

1 directory, 4 files

#norma steps
norma -i fulltext.html -o fulltext.xhtml --cmdir natcomms/ --html jsoup
norma -i fulltext.xhtml -o scholarly.html --cmdir natcomms/ --transform nature2html

tree natcomms/
└── http_www.nature.com_ncomms_journal_v1_n3_abs_ncomms1031.html
    ├── fulltext.html
    ├── fulltext.pdf
    ├── fulltext.xhtml
    ├── ncomms1031-s1.pdf
    ├── results.json
    └── scholarly.html

1 directory, 6 files


3.) If all else fails, you can feed your files into our toolchain directly via norma but this route doesn’t capture rich metadata about each item of user-supplied content, so it’s not an optimal pathway. Here’s an example of three random PDFs being prepared for analysis with norma:

#put content in direct via norma
norma -i A.pdf B.pdf C.pdf -o output/ctrees --cmdir

tree output/
└── ctrees
    ├── A_pdf
    │   └── fulltext.pdf
    ├── B_pdf
    │   └── fulltext.pdf
    └── C_pdf
        └── fulltext.pdf

4 directories, 3 files


…and that’s it. Three different entry-points to the ContentMine toolchain: primarily designed with XML, HTML or PDF in mind but other formats available as input too.

Finding species in the literature haystack

Processing academic research literature at scale is an interesting problem: there’s a lot of it.

There are literally millions of papers published every year across many tens of thousands of journals and no one person has time to read them all, with their own eyes at least.

But with computers, the Internet, and some clever code, we can definitely tackle this data deluge. There’s no such thing as ‘too much science’ – not when it comes to research papers. The volume, bandwidth, or disk space required simply isn’t challenging in 2015. You can download over a million papers from the PubMedCentral open access subset with relative ease. It’s not ‘big data’, just gigabytes.


In this post, I’m going to outline one specific use-case of the ContentMine toolchain, taking you through the process from ‘haystack’ to ‘needles’. In this demonstration, the ‘haystack’ is a set of 49 PLOS ONE papers, and the ‘needles’ (data) I want to tease out from this mini-mass of literature are binomial nomenclature – the scientific names given by biologists to organisms e.g. Homo sapiens, Gorilla gorilla, and E. coli.


Step One: downloading a sample

For demonstration purposes, rather than taking the whole of the open access subset, we’re just going to analyse 49 PLOS ONE papers, but if one desired this process could easily scale.

#download 49 papers as XML
getpapers --query 'species JOURNAL:"PLOS ONE" AND FIRST_PDATE:[2015-04-02 TO 2015-04-02]' -x  --outdir plos-species


Step Two: normalizing the content

This step isn’t particularly exciting, but it’s an important one. Content is published in a variety of ways and styles, and from many different providers – thus for consistency of analysis the ‘raw’ content from step one must first be normalized prior to any analysis.

# normalize before you analyze!
$ norma -q plos-species/ -i fulltext.xml -o scholarly.html --transform nlm2html

Our normalization tool is called normait’s very dependable. You can point it at a directory output from getpapers with any number of papers within it, from 1 to a million, norma will chug away doing all the necessary pre-processing in next to no time at all. The output from this stage is formatted as Scholarly HTML, hence the “–transform” switch, transforming the input from NLM XML, to (scholarly) HTML; nlm2html for short. Other types of transform are also supported with norma.


Step Three: filtering out the names

This is a more sophisticated step. There’s a lot going-on ‘under the hood’ with this command. It’s not just parsing for regular expressions or dictionary-based lookup. ami2-species attempts to find all binomials, isolated genus names, and even abbreviated binomens e.g. E. coli . It’s also just one of many of the ContentMine ami-plugins.

# Find the needles in the haystack, with some context too
$ ami2-species -q ./plos-species -i scholarly.html --sp.species --context 35 50 --sp.type binomial genus genussp

Don’t be afraid if you see some error messages. The default settings are quite verbose at the moment.



the content of each of the results.xml files for each of the papers looks something like this:

$ cat ./plos-species/PMC4383566/results/species/binomial/results.xml 
<?xml version="1.0" encoding="UTF-8"?>
<results title="binomial">
 <result pre="val (in the dry treatment) between " exact="Ricinodendron heudelotii" match="Ricinodendron heudelotii" post=" seedlings in the first and second batches of the " name="binomial"/>
 <result pre="-pioneer light demanding species ( " exact="Albizia zygia" match="Albizia zygia" post=", Pericopsis elata, Pouteria aningeri&lt;/i" name="binomial"/>
 <result pre="ng species ( Albizia zygia, " exact="Pericopsis elata" match="Pericopsis elata" post=", Pouteria aningeri, Sterculia rhinopeta" name="binomial"/>
 <result pre="ygia, Pericopsis elata, " exact="Pouteria aningeri" match="Pouteria aningeri" post=", Sterculia rhinopetala, Piptadeniastrum" name="binomial"/>
 <result pre="ata, Pouteria aningeri, " exact="Sterculia rhinopetala" match="Sterculia rhinopetala" post=", Piptadeniastrum africanum). " name="binomial"/>
 <result pre="/i&gt;, Sterculia rhinopetala, " exact="Piptadeniastrum africanum" match="Piptadeniastrum africanum" post="). " name="binomial"/>


It’s now easy to summarize findings across papers: how many of these 49 papers contain scientific names of organisms? 39 apparently.

# loop to print number of lines in each results file if=6 then there are 0 name hits
$ for i in ./plos-species/*/ ; do echo $i  ; find ./$i -type f -name 'results.xml' | xargs cat | wc -l ; done


The ContentMine team plans to add functionality to make it easier to determine taxon co-occurrence and taxon name lookup, as well as easier ways of visualising the results. Watch this space!

The point of this post is to show that in 3 lines, using three ContentMine tools, you can effectively filter the literature for species. No information overload here.

Explaining the difference between getpapers and quickscrape

Having written a blog post about getpapers yesterday, I thought it might be useful to explain the difference in utility between getpapers and quickscrape.

I think of getpapers as a handy command-line tool for search & retrieval of relevant research. However, there are a variety of circumstances that can prevent getpapers from returning you the full text of some relevant papers, this is where quickscrape becomes very useful.

quickscrape is a command-line tool simply for retrieval of known research you want to download, with more power and flexibility of download techniques than getpapers. To some extent, it is in theory possible to get anything and everything you have legal access to, in bulk, via quickscrape. Now that’s what I mean by POWER!


Q: Is there a situation in which I might use both getpapers and quickscrape?

A: Yes! getpapers has functionality specifically designed for input into quickscrape which can be very useful when getpapers finds relevant closed access papers for which publisher-imposed restrictions don’t allow EPMC to make available for full text download.

A worked example: I want to mine the last 3 months of papers published in PNAS. PNAS typically imposes a 6-month embargo on research published in it, so EPMC cannot allow full-text download of recent PNAS research from EPMC. So you have to go via the PNAS journal website to get recent PNAS articles.

# Use getpapers to get a list of all recent PNAS articles
  --query 'JOURNAL:"PNAS" AND FIRST_PDATE:[2015-04-01 TO 2015-07-01]' 
  --outdir recentpnas 

# Use quickscrape to download recent PNAS articles output by getpapers
  --urllist recentpnas/fulltext_html_urls.txt 
  --scraper journal-scrapers/scrapers/pnas.json  
  --output recentpnasfull 
  --outformat bibjson

Perfect synergy, eh?


Q: What’s a real use case in which someone would use quickscrape instead of getpapers?

A: When the journal (e.g. Acta Palaeontologica Polonica) or platform (e.g. bioRxiv) that the desired research is published on, is not in Europe PubMedCentral (EPMC), arXiv, or IEEE.

Incidentally, there are two Acta Palaeontologica Polonica articles in EPMC and I have no idea why they are in EPMC to be honest! It would certainly make my life easier if EPMC / PMC were more widely scoped in terms of subjects/journals allowed in.

I’m not a biomedical researcher myself so unfortunately this is a common problem for me. There is no central aggregation of evolution, ecology or palaeontology journal content – if you want to do full text mining on them you have to aggregrate the content yourself, with quickscrape !