Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.


Continue reading Evaluating Trends in Bioinformatics Software Packages

ContentMining Conifers and Visualising Output: Extracting Semantic Triples

Last month I wrote about visualising ContentMine output. It was limited to showing and grouping ami results which — in this case — were simply words extracted from the text.

A few weeks back, I got to know the ContentMine fact dumps. These are not just extracted words: they are linked to Wikidata Entities, from which all sorts of identifiers and properties can be derived. This is visualised by factvis. For example, when you hover over a country’s name, the population appears.

Now, I’m experimenting with the next level of information: semantic triples. Because NLP a step too far at the moment, I resorted to extracting these from tables. To see how this went in detail, please take a look at my blogpost. A summary: norma — the ContentMine program that normalises articles — has an issue with tables, table layout is very inconsistent across papers (which makes it hard to parse), and I’m currently providing Wikidata IDs for (hopefully) all three parts of the triple.

About that last one: I’ve made a dictionary of species names paired with Wikidata Entity IDs with this query. I limited the number of results because the Web Client times out if I don’t. I run the query locally using the following command:

curl -i -H "Accept: text/csv" --data-urlencode query@<queryFile>.rq -G https://query.wikidata.org/bigdata/namespace/wdq/sparql -o <outputFile>.csv

Then I can match the species names to the values I get when extracting from the table. If I can do this for the property as well, we’ll have a functioning program that creates Wikidata-grade statements from hundreds of articles in a matter of minutes.

There are still plenty of issues. Mapping Wikidata property names to text from tables, for example, or the fact that there are several species whose name are, when shorted, O. bicolor. I can’t know which one just from the context of the table. And all the table layout issues, although probably easier to fix, are still here. That, and upscaling, is what I’m going to focus on for the next month.

But for now, there’s a demo similar to factvis, to preview the triples:


How ContentMine at Cambridge will use CrossRef’s API to mine Science

I’ve described how CrossRef works – now I’ll show how ContentMine will use it for daily mining.

ContentMine sets out to mine the whole scientific literature “100 million facts”. Up till now we’ve been building the technical infrastructure, challenging for our rights, understanding the law, and ordering the kit. We’ve built and deployed a number of prototypes. But we are now ready to start indexing science in earnest.

Since ContentMining has been vastly underused, and because publisher actions have often chilled researchers and libraries, we don’t know in detail what people want and how they would tackle it. We think there are many approaches – here are a few

  1. daily examination of the complete daily literature (“Current awareness”).

  2. Query-based search of part of science and medicine. Thus a researcher might wish to study “obesity and smoking”. This is often likely to go back several years

  3. search for entities and identifiers. Which papers report Clinical trial data? What diseases are reported in Liberia?

  4. Search for associated data files.

(2-4) will be specific to the researcher, but (1) is general and we plan to do that. We’ll set up a set of search filters and apply them to every new paper that appears. For Open papers we can do anything, for Closed papers it has to be personal non-commercial research. A typical filter is for endangered species (I am personally concerned about endangered species and see it as a valid research topic). See http://iucn.contentmine.org/ where we index all papers in PloSOne and BMC for species and then look those up in the IUCN “red list”.

This filter currently aggregates all Open articles. Here’s our favourite species, Ursus maritimus , and when we search in our iucn.contentmine.org we find a key paper, http://www.biomedcentral.com/1471-2148/15/239 . But this can be extended to closed articles. So we can search the full-text of every paper for “Ursus maritimus” (and the other ca. 40,000 species on the IUCN list. Doing that over all papers (perhaps 150 million) is a problem, but it’s straightforward to do it for each day.

CrossRef estimate there are about 6000 journal articles a day (1440 minutes) – that won’t break any publisher servers – it’s ~1 per minute at worst even for Elsevier. So we’ll get all the daily papers and search them for species. We can store them on disk temporarily, then extract the polar bears, and then delete the files after a reasonable time.

So a daily search of papers is a trivial workload and a trivial impact on publisher servers. I’ve seen humans scanning several papers per minute.

CrossRef have even set up a template for polar bears https://github.com/CrossRef/ursus-maritimus. It includes the license option, but we’ll ignore that (see previous posts). So the workflow is:

  1. Fetch the URLs

python run.py fetch

2. Download URLs ./download.sh

Output will be in result/<date>. This also contains a copy of the URLs file.

We can also put this under a GUI wrapper.

Of course there will be other queries than polar bears so rather than download the papers every time for each query, we can batch the queries together (as long as they are PMR’s personal non-commercial research). Or PMR can copy the files temporarily to his non-commercial research disk and re-run the non-commercial research query to do research.

[I’m not being silly with the language. If I did someone else’s research for them, I might be challenged. You see how content-miners have to think always of the law first and science second. And how this could lead to flawed science planning. However I am very happy (a) to do joint personal research in any field that involves content-mining and (b) help any Cambridge scientist to do her non-commercial research on our joint Cambridge system.]

And we can publish the factual output. It may look something like http://iucn.contentmine.org/ Everything there is either a fact or a snippet of < 200 characters surrounding the fact. We point to the original paper – that’s sufficient acknowledgement , publish snippets as CC 0. If you can read the original closed paper then you are one of the lucky 0.1% of the world’s population that subscribes to the scholarly literature or you can pay 40 USD for 24 hrs access with no re-use rights.

So how many fact-types will be extracted per day? That’s up to you. I’ll do about 10 – species, sequences, chemistry, phylogenetics, word frequencies, templates, clinical trials … all things where I am a legitimate expert accepted by the scientific world. As a responsible scientist I am required to publish these facts and to make them CC0.

So how many facts? Let’s say 10 per paper? You do the sums…

so we’ve ordered servers capable of managing the computational and disk loads for this daily activity.

And Cambridge could become an example site for the Hargreaves-enabled UK and an inspiration for reformers in EU to get the laws changed.

Explaining CrossRef to ContentMiners


[I’ll publish a second blog explaining how CrossRef and ContentMine will be working together. This is about the mechanics of use, which are intricate.]

Yesterday I travelled to CrossRef to explore how we can work together to mine the literature for science.

We are Crossref, a not-for-profit membership organization for scholarly publishing working to make content easy to find, link, cite and assess. We do it in five ways: rallying the community; tagging the metadata; running a shared infrastructure; playing with new technology; and making tools and services to improve research communications.

CrossRef plays a critically important important role in Scholarly infrastructure. It receives notification of (most) scholarly publications (e.g. journal articles), manages the metadata, and provides ways for everyone to access the metadata. Geoff Bilder runs CrossRef and last week at OpenCon gave a sensationally memorable talk about how scholarly infrastructure should be Open/Free. Geoff and I see eye-to-eye on everything that matters and we are very fortunate that he is running this service.

I learnt a great deal about CrossRef yesterday. It’s effectively a publishers’ organization (though libraries etc. can join) and numerically it’s dominated by a long-tail of small publishers). There’s a board – this is however dominated by large publishers. Unlike (say) STMPublishers Association or CopyrightClearanceCenter – whose primary effect is to support publishers, CrossRef provides a real interface between publishers and readers (people and machines who read the literature).

CrossRef provides technology and help on how to access the literature. I warn potential users that they should read very carefully as there are publisher click-through licences that are unrenounceable. IMO this is very serious, people may click these without realising them, and then be bound by all time. I have not consciously clicked any and will not consciously click any. I spent quite a lot of time yesterday exploring what CrossRef require for the use of its service and what is additionally added by publishers. (I also query whether individuals can legally sign a click-through that relates to an organizational subscription – has your library authorised click-through for you?)

The CrossRef service provides a RESTful API (see their Github repository) which has a rich set of options, based essentially on metadata (titles, journals, authors, etc.) and not on fulltext, e.g.

Multiple filters can be specified in a single query. In such a case, different filters will be applied with AND semantics, while specifying the same filter multiple times will result in OR semantics – that is, specifying the filters:

  • is-update:true
  • from-pub-date:2014-03-03
  • funder:10.13039/100000001
  • funder:10.13039/100000050

would locate documents that are updates, were published on or after 3rd March 2014 and were funded by either the National Science Foundation (10.13039/100000001) or the National Heart, Lung, and Blood Institute (10.13039/100000050). These filters would be specified by joining each filter together with a comma:


Before proceeding I’ll try to explain what I think is the current position and then correct this document if I am wrong in details.

  • An “API” is an agreed specification for retrieving information from a server, normally with a URL that implements this specification. An API is often the best technical way of accessing information but a considerable amount has to be taken on trust (“am I seeing the whole data?”, “am I anonymous to the provider?”, “how stable is the information over time?”). I am, for example, prepared to use EuropePMC’s API; I am not prepared to use Elsevier’s because it requires unacceptable licence conditions.

  • A licence is a legally binding contract between parties. My understanding is that CrossRef does not require anyone to sign CrossRef licences and I have not done so. Licences are a major sticking point between miners and publishers. This was the impasse in “Licences4Europe” and it has not been resolved. I, along with other authors and signatories to the Hague Declaration believe that “The Right to Read is the Right to Mine”. Many publishers wish to impose licences that limit what and how we can mine, and may also allow charges and quotas to be imposed.

  • CrossRef is effectively acting as an agent for a number of publishers in providing services where miners may (not necessarily must) sign licences through click-through buttons. You should distinguish very carefully who has issued the licence and whether you should sign it.

Publishers have seriously confused APIs and Licences. They promote APIs as benefitting the miner, while omitting to point out that the miner has to sign an additional licence where they give up some or all of their rights. Many publishers also imply or state that it is illegal to scrape their landing pages and that the miner must use their API and therefore sign licences. Since Licences are very prominent on CrossRef’s site I warn miners that they should always find out whether they are mandatory, and if so challenge and refuse to sign them.

A (licence)token is a key that allows a particular miner to access content on a publisher’s site. This normally operates when the content is paywalled. The miner can obtain a token by:

  • specifying the publisher that they wish to mine

  • identifying themselves and/or their institution to (or through) CrossRef so their right to access can be checked by the publisher

  • receiving and storing the (multi-character) token – effectively a machine-readable key. I am not clear how long tokens live for. The token is then included in the query to a given publisher.

A researcher/miner therefore creates a query (without tokens) that expresses what they want and possibly adds tokens if required. The CrossRef API then performs the query and can return a number of fields (I am limiting discussion to journal articles):

  • a list of URLs (DOIs) that point to the fulltext of target articles and allow them to be accessed through the publisher-site. These URLs may represent a different access point from those from the landing page (web page). In principle the content from that access point should be the same as from the landing page (this is certainly not true for Elsevier at present, where PDFs and images are only available from the landing page, and XML only from the miningAPI).

  • The licence/s for that article (if a licence specifically exists)

  • The token allowing the miner to access the full text

If this sounds complex, that’s because it is not simple and depends critically on details. It’s also not something that most people are involved in. To recap:

  • if miner is prepared to sign away some of their rights: They list the publishers they are prepared to sign up to, create a query, receive the list of resultant metadata results (bibliography, titles, authors, link to landing pages?) and fulltext URLs and then download or otherwise search the fulltext on the publisher’s sites. They must remember that the licences may severely restrict what they can do at all stages of the process.

  • If miner is not prepared to sign away their rights. They submit a query as above, and (I believe) get back the same list of metadata, but without the licences and tokens. They are then technically able to use the URLs to scrape the publishers’ landing pages. Whether the publisher will try to stop them by legal and technical means we shall probably find out.

There is no formal limit on how and why the miner can use CrossRef’s services.

Extracting 100 million facts from the Scientific literature -1

In TheContentMine, funded by the Shuttleworth Foundation, we aim to extract 100 million facts from the scientific literature. That’s a large number so here’s our thinking….

What is a fact?

mist frog

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

“Fact” means slightly different things to philosophers, lawyers, scientists and advertisers. Wikipedia describes facts in science,

a fact is a repeatable careful observation or measurement (by experimentation or other means), also called empirical evidence. Facts are central to building scientific theories.


a scientific fact is an objective and verifiable observation, in contrast with a hypothesis or theory, which is intended to explain or interpret facts.

In ContentMine we highlight the “objective”, i.e. people will agree that they are talking about the same things in the same way. We concentrate on facts reported in scientific papers and my colleague Ross Mounce showed (Daily updates on IUCN Red List species) … some excellent examples about a mistfrog[1]Here are some examples he quoted:

  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • we tracked frogs using harmonic direction finding [32,33].
  • individuals move along and at right angles to the stream
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C). [Figure]
  • Fig 3.  Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).

All of the above , including the components of the graph, are FACTS. They have the features:

  • they are objective. They may or may not be “true” – another author might dispute the sizes of the frogs or where they live – but the authors have stated them as facts.
  • they can be represented in formal representations without losing meaning or precision. There are normally very few different ways of representing such facts. “Alauda arvensis sing on the wing” is a fact. “Hail to thee blithe spirit, bird thou never wert” is not a fact.
  • they are uncopyrightable. We content that all the facts we extract are uncopyrightable statements and therefore release them as CC0.

How do we represent facts? Generally they are a mixture of simple natural language statements and formal specifications. “A brown frog that lives in Queensland” is adequate; “L. rheocola. colour: brown; habitat: Queensland” says the same, slightly more formally.  Formal language is useful for us as it’s easier to extract. The form:

object: name;    property1: value1 ; property2: value2

is very common and very useful. Often it’s put in a table, graph or diagram. Transforming between these is one of the strengths of ContentMine software. The box plots could be put in words: “In winter in Windin Creek between 0 and 12% of the frogs had body temperatures below 15 Celsius”, but the plot may be more useful to some scientists (note the redundancy!).

So the scientific observations – temperatures, locations, dates – are all Facts. The sentence contains 6 factual concepts: winter, Windin Creek, 0%, 12%, L. rheocola, body temperature, < 15 C. In ContentMine we refer to all of these as “Facts”. Perhaps more formally we might say:
section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains Windin Creek

section-i/para-j/sentence-k in DOIjournal.pone.0127851 contains L. rheocola

Those who like RDF (we sometimes use it)  may regard these as triples (document-component contains entity). In similar manner the linked data as in Wikidata should be regarded as Facts (which is why we are working with Wikidata to export extracted facts there).

How many facts does a scientific paper contain?

Every entity in a document is a Fact. Each author, each species, each temperature, date, colour. A graph may have 100 facts, or 1000. Perhaps 100 facts / page? A 10-page paper might have have 1000 facts. Some chemistry papers have 300 pages of supporting information. So if we read 1 million papers we might get a billion facts – our estimate of 100 million is not hyperbole.

[1] reported in PloSONE (Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851).



Daily updates on IUCN Red List species

The ContentMine team have been working on user stories for our daily stream of facts. One such stand-out user story that we can easily cater-for with our tools is that of conservation biologists & practitioners looking to stay up-to-date with the very latest literature on IUCN Red List species.

Take for instance the journal PLOS ONE. It’s open access but the high volume and broad subject scope mean that people sometimes struggle to keep-up with relevant content published there. Recently (May, 2015) an interesting article on an endangered species of frog was published in PLOS ONE; we shall use this as an example in this post henceforth: Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851


Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA
Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA


This kind of peer-reviewed, published information is vitally important to conservation organisations. Typically, the Red List status of many groups is assessed and re-assessed by experts only every 5 years. It is extremely expensive, time-consuming, and tedious for humans to do these kinds of systematic literature reviews. We suggest that intelligent machines should do most of this screening work instead.

We think we could make the literature review process; cheaper, more rigorouscontinuous and transparent by publishing a daily stream of facts related to all Red List species. For the above paper, our 26 summary snippet facts extracted from the full text, labelled by section, might look something like the below (bold emphasis is mine to highlight the entity we would match). Note this reduces the full text from over 6000 words to a more bite-size summary of just ~700. Multiply this effect across thousands of papers and searches for thousands of different species and you might begin to understand the usefulness of this:

* Note that because PLOS ONE is an openly-licensed journal we can re-post as much context around each entity as we wish. <!–Other publishers such as Elsevier wish to impose a strict limit of only 200 characters of context around an entity. –>


Text as extracted by our ami-species plugin :

From the Introduction: section:

  • ...One such species is the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • Habitat modification and fragmentation also threaten L.rheocola [23,27]...
  • Very little is known of the ecology and behavior of L. rheocola. Individuals of this species call and breed year-round, although reproductive behavior decreases during the coolest weather [19,23]…
  • We used harmonic direction finding [32,33] to track individual L. rheocola and study patterns of movement, microhabitat use, and body temperatures during winter and summer…
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation.
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation…
  • We provide the first detailed information on the ecology and behavior of L. rheocola and suggest ecological mechanisms for observed patterns of infection dynamics…

and from other sections:


  • Because L. rheocola are too small to carry radiotransmitters, we tracked frogs using harmonic direction finding [32,33].
  • However, this was unlikely to cause a bias toward shorter movements in our study; L. rheocola has strong site fidelity and when a frog was not found on a particular survey (or surveys), it was almost always subsequently found less than 2 m from its most recent known location.
  • Litoria rheocola is a treefrog, and individuals move along and at right angles to the stream and also climb up and down vegetation; therefore, they use all three dimensions of space, with their directions of movement largely unconstrained in the horizontal plane but largely restricted to movements up and down individual plants in the vertical direction.
  • These models lose and gain water at rates similar to frogs, and temperatures obtained from these permeable models are closely correlated with L. rheocola body temperatures [43].


  • Fig 1. Distances moved by common mistfrogs (Litoria rheocola).
  • Fig 2. Proximity of common mistfrogs (Litoria rheocola) to the stream.
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).
  • Fig 4. Mean estimated body temperatures of common mistfrogs (Litoria rheocola) over the 24-hr diel period.
  • Table 1. Characteristics of microhabitats used by common mistfrogs (Litoria rheocola) during the day and night at two rainforest streams (Frenchman Creek and Windin Creek) during the winter (cool/dry season) and summer (warm/wet season).
  • Table 2. Results of separate one-way ANOVAs comparing characteristics of nocturnal perch sites that were available to and used by common mistfrogs (Litoria rheocola) at two rainforest streams (Frenchman Creek and Windin Creek) during winter (cool/dry season).


  • Our study provides the first detailed information on the ecology and behavior of the common mistfrog (Litoria rheocola), an IUCN Endangered species [22].
  • Overall, we found that L.rheocola are relatively sedentary frogs that are restricted to the stream environment, and prefer sections of the stream with riffles, numerous rocks, and overhanging vegetation (Table 2).
  • Our data confirm that L. rheocola are active year-round, but their behavior varies substantially between seasons.
  • Retallick [31] also found that juvenile and adult L. rheocola in field enclosures altered their behavior by season in similar ways; frogs used elevated perches more often in summer, and aquatic microhabitats more often during winter.
  • Additionally, Hodgkison and Hero [30] observed more L. rheocola at the stream during warmer months, suggesting that during that period frogs used perch sites that were more exposed and elevated than those used during cooler months, when frogs were seen less frequently.
  • The sedentary behavior of L. rheocola also may increase the vulnerability of this species to chytridiomycosis, particularly during winter, when movements are reduced.
  • Our results suggest that seasonal differences in environmental temperatures and L. rheocola body temperatures should cause this species to be more likely to develop B. dendrobatidis infections during cooler months and at higher elevations (Figs 3 and 4); this matches observed patterns of infection prevalence [9].
  • Our study provides detailed information on the movements, microhabitat use, and body temperatures of uninfected L. rheocola, and reveals how these behaviors differ by season and between sites varying in elevation.


NOTE: We have shown all phrases in the PLoSONE paper about L. rheocola. However our ami software allows us to select particular sections for display, or to restrict our search and filtering (e.g. to the Methods section).

This approach provides far far more information than is indicated in the abstract for the paper. Yet it also condenses it down to useful & relevant facts. We think this could be very useful to many…

To see the full stream of facts output by the ami-species plugin go to http://facts.contentmine.org/. It isn’t filtered specifically for IUCN RedList species yet, but if you’re interested in seeing this happen or something similar, please get in contact with us over at the forum: http://discuss.contentmine.org/ or via twitter @theContentMine.

ContentMine tools produce a supertree

Using ContentMine tools, the BBSRC-funded PLUTo project (Mounce, Murray-Rust, Wills) has created what we believe is the first ever machine-compiled supertree created entirely from figure images. To be precise, we accurately extracted phylogenetic relationships from 924 figure images published in the journal IJSEM to create this formal synthesis of knowledge.

Here it is below for your viewing pleasure, as represented as an image, in radial format. It’s rather big, including 2269 taxa so it’s a challenge just to fit it all in one screen!

So what next? Are we done? No. To give a sense of validity, we need to compare our machine-compiled supertree to trees of similar taxon-coverage such as the NCBI taxonomy tree and the SILVA ‘Living Tree Project’.  We have already done some work on the former. Our supertree as measured using Robinson-Foulds distance is only 1691 units different to the NCBI taxonomy tree composed of the same taxa. More randomisation work is needed to determine whether this distance is significantly different from that of any random tree with the same tip labels.


It’s also time to start writing it all up for formal publication. We hope this will encourage more scientists to publish machine-readable phylogenetic data rather than just figure images, as syntheses like this would be much easier to produce from proper machine-readable phylogenetic data formats, rather than pixel-based raster graphics as we have used for this supertree. We also hope this provides a compelling example of research that you can do with a content mining philosophy; published academic literature is there not just to be read, but to be mined, and re-used too!