Daily updates on IUCN Red List species

The ContentMine team have been working on user stories for our daily stream of facts. One such stand-out user story that we can easily cater-for with our tools is that of conservation biologists & practitioners looking to stay up-to-date with the very latest literature on IUCN Red List species.

Take for instance the journal PLOS ONE. It’s open access but the high volume and broad subject scope mean that people sometimes struggle to keep-up with relevant content published there. Recently (May, 2015) an interesting article on an endangered species of frog was published in PLOS ONE; we shall use this as an example in this post henceforth: Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851


Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA
Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA


This kind of peer-reviewed, published information is vitally important to conservation organisations. Typically, the Red List status of many groups is assessed and re-assessed by experts only every 5 years. It is extremely expensive, time-consuming, and tedious for humans to do these kinds of systematic literature reviews. We suggest that intelligent machines should do most of this screening work instead.

We think we could make the literature review process; cheaper, more rigorouscontinuous and transparent by publishing a daily stream of facts related to all Red List species. For the above paper, our 26 summary snippet facts extracted from the full text, labelled by section, might look something like the below (bold emphasis is mine to highlight the entity we would match). Note this reduces the full text from over 6000 words to a more bite-size summary of just ~700. Multiply this effect across thousands of papers and searches for thousands of different species and you might begin to understand the usefulness of this:

* Note that because PLOS ONE is an openly-licensed journal we can re-post as much context around each entity as we wish. <!–Other publishers such as Elsevier wish to impose a strict limit of only 200 characters of context around an entity. –>


Text as extracted by our ami-species plugin :

From the Introduction: section:

  • ...One such species is the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • Habitat modification and fragmentation also threaten L.rheocola [23,27]...
  • Very little is known of the ecology and behavior of L. rheocola. Individuals of this species call and breed year-round, although reproductive behavior decreases during the coolest weather [19,23]…
  • We used harmonic direction finding [32,33] to track individual L. rheocola and study patterns of movement, microhabitat use, and body temperatures during winter and summer…
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation.
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation…
  • We provide the first detailed information on the ecology and behavior of L. rheocola and suggest ecological mechanisms for observed patterns of infection dynamics…

and from other sections:


  • Because L. rheocola are too small to carry radiotransmitters, we tracked frogs using harmonic direction finding [32,33].
  • However, this was unlikely to cause a bias toward shorter movements in our study; L. rheocola has strong site fidelity and when a frog was not found on a particular survey (or surveys), it was almost always subsequently found less than 2 m from its most recent known location.
  • Litoria rheocola is a treefrog, and individuals move along and at right angles to the stream and also climb up and down vegetation; therefore, they use all three dimensions of space, with their directions of movement largely unconstrained in the horizontal plane but largely restricted to movements up and down individual plants in the vertical direction.
  • These models lose and gain water at rates similar to frogs, and temperatures obtained from these permeable models are closely correlated with L. rheocola body temperatures [43].


  • Fig 1. Distances moved by common mistfrogs (Litoria rheocola).
  • Fig 2. Proximity of common mistfrogs (Litoria rheocola) to the stream.
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).
  • Fig 4. Mean estimated body temperatures of common mistfrogs (Litoria rheocola) over the 24-hr diel period.
  • Table 1. Characteristics of microhabitats used by common mistfrogs (Litoria rheocola) during the day and night at two rainforest streams (Frenchman Creek and Windin Creek) during the winter (cool/dry season) and summer (warm/wet season).
  • Table 2. Results of separate one-way ANOVAs comparing characteristics of nocturnal perch sites that were available to and used by common mistfrogs (Litoria rheocola) at two rainforest streams (Frenchman Creek and Windin Creek) during winter (cool/dry season).


  • Our study provides the first detailed information on the ecology and behavior of the common mistfrog (Litoria rheocola), an IUCN Endangered species [22].
  • Overall, we found that L.rheocola are relatively sedentary frogs that are restricted to the stream environment, and prefer sections of the stream with riffles, numerous rocks, and overhanging vegetation (Table 2).
  • Our data confirm that L. rheocola are active year-round, but their behavior varies substantially between seasons.
  • Retallick [31] also found that juvenile and adult L. rheocola in field enclosures altered their behavior by season in similar ways; frogs used elevated perches more often in summer, and aquatic microhabitats more often during winter.
  • Additionally, Hodgkison and Hero [30] observed more L. rheocola at the stream during warmer months, suggesting that during that period frogs used perch sites that were more exposed and elevated than those used during cooler months, when frogs were seen less frequently.
  • The sedentary behavior of L. rheocola also may increase the vulnerability of this species to chytridiomycosis, particularly during winter, when movements are reduced.
  • Our results suggest that seasonal differences in environmental temperatures and L. rheocola body temperatures should cause this species to be more likely to develop B. dendrobatidis infections during cooler months and at higher elevations (Figs 3 and 4); this matches observed patterns of infection prevalence [9].
  • Our study provides detailed information on the movements, microhabitat use, and body temperatures of uninfected L. rheocola, and reveals how these behaviors differ by season and between sites varying in elevation.


NOTE: We have shown all phrases in the PLoSONE paper about L. rheocola. However our ami software allows us to select particular sections for display, or to restrict our search and filtering (e.g. to the Methods section).

This approach provides far far more information than is indicated in the abstract for the paper. Yet it also condenses it down to useful & relevant facts. We think this could be very useful to many…

To see the full stream of facts output by the ami-species plugin go to http://facts.contentmine.org/. It isn’t filtered specifically for IUCN RedList species yet, but if you’re interested in seeing this happen or something similar, please get in contact with us over at the forum: http://discuss.contentmine.org/ or via twitter @theContentMine.

Audio interview with Peter Murray-Rust on the Data Skeptic Podcast (53 minutes)

In August, Peter Murray-Rust agreed to doing an interview with Kyle Polich at Data Skeptic “The podcast that is skeptical of and with data”. The interview was published online on 28th August 2015.

Data Skeptic is a podcast that alternates between short mini episodes with the host explaining concepts from data science to his non-data scientist wife, and longer interviews featuring practitioners and experts on interesting topics related to data, all through the eye of scientific skepticism.

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program’s founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copyright, and several other interesting topics.

Full transcript available here: ContentMine full transcript.

The draft transcript is  ~98% accurate and Peter will edit this in due course.



Both the audio and transcript are licensed under Creative Commons CC-BY.

With so much valuable content, in due course, we shall break this down into more sizable segments, but for now, enjoy the interview in full.

Just some of the topics covered:-


Finding species in the literature haystack

Processing academic research literature at scale is an interesting problem: there’s a lot of it.

There are literally millions of papers published every year across many tens of thousands of journals and no one person has time to read them all, with their own eyes at least.

But with computers, the Internet, and some clever code, we can definitely tackle this data deluge. There’s no such thing as ‘too much science’ – not when it comes to research papers. The volume, bandwidth, or disk space required simply isn’t challenging in 2015. You can download over a million papers from the PubMedCentral open access subset with relative ease. It’s not ‘big data’, just gigabytes.


In this post, I’m going to outline one specific use-case of the ContentMine toolchain, taking you through the process from ‘haystack’ to ‘needles’. In this demonstration, the ‘haystack’ is a set of 49 PLOS ONE papers, and the ‘needles’ (data) I want to tease out from this mini-mass of literature are binomial nomenclature – the scientific names given by biologists to organisms e.g. Homo sapiens, Gorilla gorilla, and E. coli.


Step One: downloading a sample

For demonstration purposes, rather than taking the whole of the open access subset, we’re just going to analyse 49 PLOS ONE papers, but if one desired this process could easily scale.

#download 49 papers as XML
getpapers --query 'species JOURNAL:"PLOS ONE" AND FIRST_PDATE:[2015-04-02 TO 2015-04-02]' -x  --outdir plos-species


Step Two: normalizing the content

This step isn’t particularly exciting, but it’s an important one. Content is published in a variety of ways and styles, and from many different providers – thus for consistency of analysis the ‘raw’ content from step one must first be normalized prior to any analysis.

# normalize before you analyze!
$ norma -q plos-species/ -i fulltext.xml -o scholarly.html --transform nlm2html

Our normalization tool is called normait’s very dependable. You can point it at a directory output from getpapers with any number of papers within it, from 1 to a million, norma will chug away doing all the necessary pre-processing in next to no time at all. The output from this stage is formatted as Scholarly HTML, hence the “–transform” switch, transforming the input from NLM XML, to (scholarly) HTML; nlm2html for short. Other types of transform are also supported with norma.


Step Three: filtering out the names

This is a more sophisticated step. There’s a lot going-on ‘under the hood’ with this command. It’s not just parsing for regular expressions or dictionary-based lookup. ami2-species attempts to find all binomials, isolated genus names, and even abbreviated binomens e.g. E. coli . It’s also just one of many of the ContentMine ami-plugins.

# Find the needles in the haystack, with some context too
$ ami2-species -q ./plos-species -i scholarly.html --sp.species --context 35 50 --sp.type binomial genus genussp

Don’t be afraid if you see some error messages. The default settings are quite verbose at the moment.



the content of each of the results.xml files for each of the papers looks something like this:

$ cat ./plos-species/PMC4383566/results/species/binomial/results.xml 
<?xml version="1.0" encoding="UTF-8"?>
<results title="binomial">
 <result pre="val (in the dry treatment) between " exact="Ricinodendron heudelotii" match="Ricinodendron heudelotii" post=" seedlings in the first and second batches of the " name="binomial"/>
 <result pre="-pioneer light demanding species ( " exact="Albizia zygia" match="Albizia zygia" post=", Pericopsis elata, Pouteria aningeri&lt;/i" name="binomial"/>
 <result pre="ng species ( Albizia zygia, " exact="Pericopsis elata" match="Pericopsis elata" post=", Pouteria aningeri, Sterculia rhinopeta" name="binomial"/>
 <result pre="ygia, Pericopsis elata, " exact="Pouteria aningeri" match="Pouteria aningeri" post=", Sterculia rhinopetala, Piptadeniastrum" name="binomial"/>
 <result pre="ata, Pouteria aningeri, " exact="Sterculia rhinopetala" match="Sterculia rhinopetala" post=", Piptadeniastrum africanum). " name="binomial"/>
 <result pre="/i&gt;, Sterculia rhinopetala, " exact="Piptadeniastrum africanum" match="Piptadeniastrum africanum" post="). " name="binomial"/>


It’s now easy to summarize findings across papers: how many of these 49 papers contain scientific names of organisms? 39 apparently.

# loop to print number of lines in each results file if=6 then there are 0 name hits
$ for i in ./plos-species/*/ ; do echo $i  ; find ./$i -type f -name 'results.xml' | xargs cat | wc -l ; done


The ContentMine team plans to add functionality to make it easier to determine taxon co-occurrence and taxon name lookup, as well as easier ways of visualising the results. Watch this space!

The point of this post is to show that in 3 lines, using three ContentMine tools, you can effectively filter the literature for species. No information overload here.