Introducing Fellow Neo Christopher Chung: Computational Trends in Genomics

neo-christopher-chungHigh-throughput genomics has fundamentally changed how life scientists investigate their research questions. Instead of studying a single candidate gene, we are able to measure DNA sequences and RNA expression of every gene in a large number of genomes. This technology advance requires us to become familiar with data wrangling, programming, and statistical analysis. To meet new challenges, bioinformaticians have created a large number of software packages to deal with genomic data.

For example, a molecular biologist could nowadays sequence a full human genome that results in billions of very short and randomly located DNA sequences. Such short DNA sequences (e.g., ~hundreds base pairs) must be aligned, so that we can infer a full genome that is over 3 billion base pairs. Among many available computational methods, it is often difficult to know the most appropriate tool for one’s specific need. I propose to track the usage of bioinformatics tools used in sequence alignment, variant calling, and other genomic studies.

This work will lead to a new kind of reviews, that is interactive and dynamic. Eventually, molecular biologists will have a convenient portal that quantitatively summarizes the trends of computational and statistical methods in genomics. Furthermore, the source code will be published in Github, so that questions about other research trends can be duplicated and extended.


Introducing Fellow Alexandre Hannud Abdo: Digging up Global Health facts

alexandre-hannud-abdo“Our goal is to mine facts from global health research and provide automated referenced summaries to practitioners and agents who don’t have the means or the time to navigate the literature.

Plus, for extreme situations where decisions can’t wait for new research to be confirmed, we’ll do our best to highlight which preliminary facts, as published in the latest conferences, share more traits with previously confirmed results.”

Hi! My name is Ale, I was born and raised in Brazil, where I also obtained my PhD in Physics and later participated in large-scale projects in epidemiology and public health. I currently live in France, where I work at LISIS (Interdisciplinary Laboratory on Sciences, Innovations and Societies) in a project about the evolution of the field of oncology.

I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.

The research field of my proposal is Global Health, and our goal is as stated in the opening paragraph. It should be noted that I am not alone, and my interest comes from a group formed during OpenCon 2015 by Neo Chung — who serendipitously was also awarded a ContentMine fellowship under an entirely different subject.

Being a fellow will enable me to develop this proposal within a motivating community, with my lab better understanding my time away from my primary job. It will also accelerate my learning and contributing to the platform, and hopefully will help attract other researchers excited by this idea!

Introducing Fellow Alexandra Bannach-Brown: Text Mining in Preclinical Systematic Review

alexandra-bannach-brownMy research is interested in understanding in vivo modelling of depression using systematic review and meta-analysis.
Systematic review is an incredibly useful tool to assess the relevant literature and achieve a clear overview of the current literature, which is becoming increasingly difficult with the amount of papers published in scholarly journals increasing exponentially (Bornmann & Mutz, 2014, www). It can also provide better understanding of the laboratory methods used to induce the condition, the range of outcome measures used to assess depressive-like phenotypes, and the variables that might impact on the efficacy of different treatments (de Vries et al., 2011, www).

However, systematic reviews are time consuming and often not produced quickly enough to inform the field before they need to be updated (Tsafnat et al., 2014, www.). Automation techniques such as machine learning and text mining can aid the systematic review process and reduce the workload at various stages of the systematic review process, mainly the screening and data extraction stages (Jonnalagadda et al., 2013, www).

The method of model induction, outcome measures, treatments tested, and study quality in preclinical investigations of depression are of key interest in this project. I aim to use ContentMine to help investigate these key areas by using dictionaries, such as genera and species, to aid with document classification and clustering. Identifying the language used when authors report measures to reduce bias in their experiments, using regular expressions, can improve document tagging and data extraction for study quality measures. Identifying these key features in papers can aid the systematic review process by making grouping easier, for example grouping similar models, outcome measures, or animal species, for follow up analyses. These tools increase the efficiency in which systematic reviews are carried out, shortening the time from search date to publication, and thus allow for more up-to-date reviews to be produced. All measures to automate and streamline the systematic review process can reduce human workload, decrease costs associated with human time and expedite scientific advances.

Introducing Fellow Paola Masuzzo: Mining the literature to understand cell migration

paola-masuzzoI have just submitted my PhD in Biomedical Sciences at VIB and Ghent University (Belgium), with a thesis entitled “An open data exchange ecosystem: forging a new path for cell migration data analysis and mining”. During my PhD I have tried to make cell migration research more ‘open’: I have developed computational open source tools and algorithms for the storage, management, dissemination and analysis of cell migration experiments. Furthermore, I have tried to push this ‘open’ concept a bit beyond my own PhD, and have succeeded in engaging a few researchers in this fight: I am now working in MULTIMOT, an EU-H2020 funded project that aims to build an open data ecosystem for cell migration research, with the ultimate goal to increase reproducibility, and allow analyses to take place on rich datasets that would otherwise remain unused.

As a ContentMine fellow, I want to text mine literature around cell migration and invasion, because I believe that there is a huge amount of information in all the papers continuously published in the field, and that this information simply cannot be processed by the human eye. Specifically, text mine cell migration articles will hopefully help in the following tasks:

  • automatically detect a set of core information reported when describing experiments in the field, and therefore construct a collection of minimum reporting requirements from these. These requirements can then be used to aid experimental and computational reproducibility.
  • check for nomenclature consistency, the use of common terms or ontologies to describe the same concept, again with the goal to increase reproducibility, and allow meta-analyses to take place.
  • construct a knowledge map that could capture the current status of information in the field, especially in terms of cell motility-related compounds and (cancer) cell lines.

Barriers to content mining: fake article webpages

Two ContentMine team members featured in Nature News this week reporting difficulties they have faced in text mining Wiley publications due to ‘Trap’ URLs, designed to catch people using automated downloading.

Full article: ‘Publisher under fire for fake article webpages

Richard Smith-Unna, lead developer of ContentMine’s getpapers and quickscrape tools, encountered the issue while undertaking research that is fully supported by the UK copyright exception for text and data mining and where his University librarians had been informed. Ross Mounce, lead researcher on the ContentMine-supported PLUTo project has encountered similar issues before and states:

“I’m worried we’re seeing the beginning of an evolutionary arms race between legacy publishers and researchers.”

The methods employed were criticised as heavy-handed and unsophisticated by several commentators, with one librarian stating that they find the behaviour concerning “Because it demonstrates that supporting research is not the chief priority of these publishers.”

ContentMine continues to support the rights of researchers to build on our collective scientific knowledge through mining the academic literature and to call out barriers to that mission.

ContentMine Case Study: mining the tree of life

Scientists represent the relationships between different organisms as branching ‘phylogenetic trees’. These diagrams tell us a lot about how close different species are to each other and the length of the branches can even tell us how long ago they shared a common ancestor. While we typically imagine these trees stretching back over millions of years, sometimes they can be be quite short, for example in tracing back to the origins of disease outbreaks like Ebola or Zika virus.

These trees contain lots of useful information and they take a long time to calculate. Even with powerful computers, large trees can take hours and days. If enough is known about the analysis and the data many trees can be remixed into ‘supertrees’, potentially saving a lot of time and resources and enabling new analyses.

Unfortunately most trees are presented in scientific papers only as images and information from images cannot easily be extracted in a form that is ready for analysis. Disappointingly, only 4-25% of data from trees is made available in a reusable way, for the remaining 75-96% the only way to create a dataset is painstakingly with a ruler (either digital or physical!) and transcribing labels for species and gene names. This takes a lot of researcher time that could be more usefully spent thinking, analysing and explaining – the parts of research where humans excel!

ContentMine founder Dr Peter Murray-Rust and Dr Ross Mounce set out to automate the tedious data extraction process as far as possible, thus enabling more data to be extracted from the many thousands of trees that are published each year. BBSRC funded the development of PLUTo: Phyloinformatic Literature Unlocking Tools and together Peter and Ross made software to extract relationships and branch length from tree diagrams and turn them into NeXML – a format that can be stored in structured databases online and analysed directly by computers.

They applied this to creating a supertree of bacteria (Fig 1) – synthesising 924 separate trees featuring 2269 taxa! As well as liberating a lot of data that could be useful for a number of analyses, the process uncovered many instances where species has been misspelled, meaning that this approach could help correct errors in the scientific record as well as enriching data resources.

Figure 1: The final output tree from PLUTo – 2269 bacterial taxa!


ContentMine is a non-profit with a mission to bring content mining tools and open ‘facts’ to researchers . We believe that mining improves the efficiency and quality of research though:

  • increasing the breadth and depth of scientific information that can be surveyed
  • saving time, freeing researchers up for the bits they’re best at (like thinking and analysing results!)
  • liberating open data for analysis

If you think our experienced team can help with your research, get in touch!

Text and Data Mining (i.e. Content Mining) for Research and Innovation – What Europe Must Do Next

LisbonCouncil Logo_FIN

The Right to Read is the Right to Mine

This post is based upon an important report released 08h00 CET 31 MAY 2016

The Lisbon Council launches Text and Data Mining for Research and Innovation: What Europe Must Do Next, an interactive policy brief which looks at the challenge and opportunity of text and data mining in a European context. Building on the Lisbon Council’s highly successful 2014 paper, which served as an important and early source of evidence on the uptake and interest in text and mining among academics worldwide, the paper revisits the data two years later and finds that recent trends have only accelerated. Concretely, Asian and U.S. scholars continue to show a huge interest in text and data mining as measured by academic research on the topic. And Europe’s position is falling relative to the rest of the world. The paper looks at the legal complexity and red tape facing European scholars in the area, and call for wholesale reform. The paper was prepared for and formally submitted as part of the European Commission’s Public Consultation on the Role of Publishers in the Copyright Value Chain and on the ‘Panorama Exception.’  Source.
 In an associated Press Release:-

 Text and Data Mining for Research and Innovation looks at the transformative role of text and data mining in academic research, benchmarks the role of European research against global competitors and reflects on the prospects for an enabling policy in the text and data mining field within the broader European political and economic context.

Among the key findings:

  • Asia leapfrogs EU in research on text and data mining. Over the last decade, Asia has replaced the European Union as the world’s leading centre for academic research on text and data mining as judged by number of publications. From 2011 to 2016, Asian scholars’ share of academic publications in the field rose to 32.4% of all global publications, up from 31.1% in 2000. The EU’s global share fell to 28.2%, down from 38.9% in 2000. North America remained in third place at 20.9% due to the relatively small size of the three-country region.
  • China ranks No.1 within Asia. As recently as 2000, Japan and Taiwan led Asia with 12.6% and 7% of all global text-and-data-mining-based publications. After a steady rise in interest, China now leads. On its own, it accounted for 11.7% of all global publications in 2015, up from zero in 2000. This gave China a No. 2 finish in the country rankings, second only to the United States. China’s ranking within Asia is now No. 1.
  • Chinese patents on data mining see unprecedented growth. China also led the global growth in the number of patents pertaining to data mining. While the number of patents granted by the U.S. Patent and Trademark Office (USPTO) remained relatively stable over the past decade, the number of patents granted for data-mining-related products by the State Intellectual Property Office of the People’s Republic of China (SIPO) rose to 149 in 2015, up from just one in 2005.
  • Chinese researchers are champions in patenting TDM procedures. Chinese researchers and organisations are patenting text-and- data-mining procedures at a faster rate than any other country in the world. This suggests that Chinese researchers attach a growing priority to the potential use of this new technique for stimulating scientific breakthroughs, disseminating technical knowledge and improving productivity throughout the scientific and technical community.
  • Middle East entering the game too. Some of the fastest growth and greatest interest was seen in relative newcomers: India, Iran and Turkey. Having shown virtually no interest in text and data mining as recently as 2000, the Middle East is now the world’s fourth largest region for research on text and data mining, led by Iran and Turkey.
  • Europe remains slow. Large European scientific, technical and medical publishers have added text-and-data-mining functionality to some dataset licences, but the overall framework in Europe remains slow and full of uncertainty. Many smaller publishers do not yet offer access of this type. And scholars complain that existing licences are too restrictive and do not allow for generating the advanced “big data” insights that come from detecting patterns across multiple datasets stored in different places or held by different owners.
  • Legal clarity also matters. Some countries apply the “fair-use” doctrine, which allows “exceptions” to existing copyright law, including for text and data mining. Israel, the Republic of Korea, Singapore, Taiwan and the United States are in this group. Others have created a new copyright “exception” for text and data mining – Japan, for instance, which adopted a blanket text-and-data-mining exception in 2009, and more recently the United Kingdom, where text and data mining was declared fully legal for non-commercial research purposes in 2014.
  • What Europe Must Do. New technologies make analysis of large volumes of text and other media potentially routine. But this can only happen if researchers have clearly established rights to use the relevant techniques, supported by the necessary skills and experience. Broadly speaking, the European ecosystem for engaging in text and data mining remains highly problematic, with researchers hesitant to perform valuable analysis that may or may not be legal. The end result: Europe is being leapfrogged by rising interest in other regions, notably Asia. European scholars are even forced, on occasion, to outsource their text and data mining needs to researchers elsewhere in the world, as has been reported repeatedly in past European Commission consultations. Anecdotally, we hear stories of university and research bureaux deliberately adding researchers in North America or Asia to consortia because those researchers will be able to do basic text and data mining so much more easily than in the EU.


Some reactions on social media:-

 .@lisboncouncil calls for modern rules to enable #TDM Here’s #IFLA‘s (joint) call for this!

 Asia leapfrogs EU in research on text and data mining says latest @LisbonCouncil Report on #TDM

 EU falling behind on #TDM according to latest brief from @lisboncouncil

 This makes me mad Asia is doing #tdm Publishers like #elsevier and Wiley are stopping me and colleagues. @Senficon