ContentMine Case Study: mining the tree of life

Scientists represent the relationships between different organisms as branching ‘phylogenetic trees’. These diagrams tell us a lot about how close different species are to each other and the length of the branches can even tell us how long ago they shared a common ancestor. While we typically imagine these trees stretching back over millions of years, sometimes they can be be quite short, for example in tracing back to the origins of disease outbreaks like Ebola or Zika virus.

These trees contain lots of useful information and they take a long time to calculate. Even with powerful computers, large trees can take hours and days. If enough is known about the analysis and the data many trees can be remixed into ‘supertrees’, potentially saving a lot of time and resources and enabling new analyses.

Unfortunately most trees are presented in scientific papers only as images and information from images cannot easily be extracted in a form that is ready for analysis. Disappointingly, only 4-25% of data from trees is made available in a reusable way, for the remaining 75-96% the only way to create a dataset is painstakingly with a ruler (either digital or physical!) and transcribing labels for species and gene names. This takes a lot of researcher time that could be more usefully spent thinking, analysing and explaining – the parts of research where humans excel!

ContentMine founder Dr Peter Murray-Rust and Dr Ross Mounce set out to automate the tedious data extraction process as far as possible, thus enabling more data to be extracted from the many thousands of trees that are published each year. BBSRC funded the development of PLUTo: Phyloinformatic Literature Unlocking Tools and together Peter and Ross made software to extract relationships and branch length from tree diagrams and turn them into NeXML – a format that can be stored in structured databases online and analysed directly by computers.

They applied this to creating a supertree of bacteria (Fig 1) – synthesising 924 separate trees featuring 2269 taxa! As well as liberating a lot of data that could be useful for a number of analyses, the process uncovered many instances where species has been misspelled, meaning that this approach could help correct errors in the scientific record as well as enriching data resources.

Figure 1: The final output tree from PLUTo – 2269 bacterial taxa!


ContentMine is a non-profit with a mission to bring content mining tools and open ‘facts’ to researchers . We believe that mining improves the efficiency and quality of research though:

  • increasing the breadth and depth of scientific information that can be surveyed
  • saving time, freeing researchers up for the bits they’re best at (like thinking and analysing results!)
  • liberating open data for analysis

If you think our experienced team can help with your research, get in touch!


One thought on “ContentMine Case Study: mining the tree of life”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s