ContentMine Case Study: mining the tree of life

Scientists represent the relationships between different organisms as branching ‘phylogenetic trees’. These diagrams tell us a lot about how close different species are to each other and the length of the branches can even tell us how long ago they shared a common ancestor. While we typically imagine these trees stretching back over millions of years, sometimes they can be be quite short, for example in tracing back to the origins of disease outbreaks like Ebola or Zika virus.

These trees contain lots of useful information and they take a long time to calculate. Even with powerful computers, large trees can take hours and days. If enough is known about the analysis and the data many trees can be remixed into ‘supertrees’, potentially saving a lot of time and resources and enabling new analyses.

Unfortunately most trees are presented in scientific papers only as images and information from images cannot easily be extracted in a form that is ready for analysis. Disappointingly, only 4-25% of data from trees is made available in a reusable way, for the remaining 75-96% the only way to create a dataset is painstakingly with a ruler (either digital or physical!) and transcribing labels for species and gene names. This takes a lot of researcher time that could be more usefully spent thinking, analysing and explaining – the parts of research where humans excel!

ContentMine founder Dr Peter Murray-Rust and Dr Ross Mounce set out to automate the tedious data extraction process as far as possible, thus enabling more data to be extracted from the many thousands of trees that are published each year. BBSRC funded the development of PLUTo: Phyloinformatic Literature Unlocking Tools and together Peter and Ross made software to extract relationships and branch length from tree diagrams and turn them into NeXML – a format that can be stored in structured databases online and analysed directly by computers.

They applied this to creating a supertree of bacteria (Fig 1) – synthesising 924 separate trees featuring 2269 taxa! As well as liberating a lot of data that could be useful for a number of analyses, the process uncovered many instances where species has been misspelled, meaning that this approach could help correct errors in the scientific record as well as enriching data resources.

Figure 1: The final output tree from PLUTo – 2269 bacterial taxa!


ContentMine is a non-profit with a mission to bring content mining tools and open ‘facts’ to researchers . We believe that mining improves the efficiency and quality of research though:

  • increasing the breadth and depth of scientific information that can be surveyed
  • saving time, freeing researchers up for the bits they’re best at (like thinking and analysing results!)
  • liberating open data for analysis

If you think our experienced team can help with your research, get in touch!

Announce: Microbial Supertree through ContentMining

I haven’t blogged for some time as I have been writing Liberation Software (software to make knowledge and people free). Now we (Ross Mounce, Matt Wills and I) have got our first significant scientific result – a supertree:


I am going to leave Ross the opportunity to blog this in detail – he was hacking this late last night – so a brief overview:

For every new microorganism it’s obligatory to compare it with other organisms in an evolutionary (phylogenetic) tree. Here’s a typical one (don’t be frightened – everyone can understand this if they are familiar with evolutionary ideas.) . The image was pusblished in (Citation: International Journal of Systematic and Evolutionary Microbiology b(2009),59,972–980 DOI 10.1099/ijs.0.000364-0)


wikipedia2015 are

[I have added “root” and magnified some of it].

31 microorganisms (mainly bacteria) listed in the middle. Each has a binomial (scientific) name (Pyramidobacter piscolens) , a strain identifier (W5455T), and an identifier in an RNA database (e.g. EU379932). The lines represent a “tree” with its root (not shown) at the left hand side and presumed divergence of the species. It’s certainly a useful classification; you can debate whether it’s a useful historical model of the actual evolution over many million years. Thus it says that Pyramidobacter piscolens is closely related to Jonquetella anthropi and much more distantly related to Escherichia coli, a bacterium in everybody’s gut.

Each paper provides one such tree – which could take significant amounts of computation (often hours depending on the strictness). What we have done – and this is a strength of Matt’s group, is to bring thousands of such trees together. They weren’t calculated with this in mind, and we are adding value by extracting them from the literature and making comparisons and aggregations.

Ross downloaded about 4,300 papers and the trees in them. I wrote the software to extract trees from the images. This is not trivial – the images are made of pixels – there are no explicit lines or characters or words and this research is full of heuristics. So we can’t always distinguish “O” (Oh) from “0” (one).  So there will be an unavoidable percentage of garbles.

BUT we have ways of detecting and correcting these (“cleaning”) and the most valuable are:

  • comparing the scientific name with the RNA ID
  • looking up the name in the NCBI’s Taxdump (a list of all biomedical species)

Ross has developed several methods of cleaning and we are reasonably confident that the error  rate in species is no worse that 1 in 1000. (Note, by the way, that the in a sibling image the authors have made a misprint: “Optiutus” should be “Opitutus”. so the primary literature also contains errors).

Everything we do is Open Notebook. We post stuff as soon as it is ready. We store it on Github (see above link, which has all 4300 trees), discuss it on ContentMine’s Discourse ( – you can see that every detail is made open), software in ( and many other repos, fully Open and often updated several times a day), and where Ross will be blogging it.

I hope to write much more frequently now.

Content Mining: we can now mine images (of phylogenetic trees and more)

The reason I use “content mining” and not “Text and Data Mining” is that science consists of more than text – images, audio video, code and much more.  Text is the best known and the most immediately tractable and many scientific disciplines have developed Natural Language Processing (NLP). In our group Lezan Hawizy, Peter Corbett, David Jessop, Daniel Lowe and others have developed ChemicalTagger, OSCAR, Patent Analysis, and OPSIN. ( ). So the is exactly that – an org that mines content.

But words are often a poor way of representing science and images are common. A general approach to processing all images is very hard and 2 years ago I though it was effectively impossible. However with hard work some subsets can be tractable. Here we show you some of the possibilities in phylogenetic trees (evolutionary trees). What is described below is simple to follow and simple to carry out, but it took me some months of exploration to find the best strategy. And I owe a great debt to Noureddin Sadawi who introduced me to thinning – I haven’t used his code but his experience was invaluable.

But you don’t need to worry. Here’s a typical tree. Its from PLoSONE, ( – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .


The tree has been wrapped into a circle with the Root at the centre and the leaves/tips on the edge of the circle. To transcribe this manually would take hours – we show it being done in a second.

There isn’t always a standard way of doing things but for many diagrams we have to:

  • flatten (remove shades of gray)
  • separate colours (often by flattening them)
  • threshold (remove noise) and background)
  • thin (remove all pixels except the 1-pixel-think backbone)

and here is the thinned diagram:


You’ll see that the lines are all still there but exactly 1 pixel thick. (We’ve lost a few colours, but that’s irrelevant for this example). Now we are going to look at the tree (and ignore the labels):


This has been selected automatically on pixel count, but we can also use bounding boxes and many shape characteristics.

We now analyse the structure and break it into connected components – a topological tree – by standard traversal methods. We end up with nodes and edges – this is a snapshot of a SVG.


[The black lines are artifacts of Inkscape]. So we have identified every node and every edge. The next thing is to trace the edges – that’s easy if they are straight, but here they are curved. Ideally we plan to fit circles, but we’ll use segments for the time being:


The curves are actually straight-line segments, but… no matter.

It’s now a proper phylogenetic tree! And we can serialize it as Newick (or NexML if we wanted).


And here is an interactive tree by posting that string into (try it yourself).


So – to summarize – we have taken a phylogenetic tree – that may have taken hundreds of hours to compute and extracting the key data. (Smart people will ask “what about the text labels?” – be patient, that’s coming).

… in a second.

That scales to over a million images per year on my single laptop! And the technology scales to many other disciplines and it’s completely Open Source (Apache2). So YOU can use it – as long as you give us the credit for writing it.