Final Report: Analysing and visualising data from papers about conifers

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.


My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: ‘Picea glauca is a species of the genus Picea’, which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.


The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine’s getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.


Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

  • ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
  • ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
  • Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.


All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

  • Finally adding it as an NPM module, making it (way) easier for people to use it;
  • Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
  • Normalising table structures to support more designs, rendering assertion extraction more reliable;
  • Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
  • Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It’s easier to curate statements like ‘[1] says this and this’ when you know ‘[1]’ references some known article. Another example: while a statement containing ‘P. glauca’ says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.


As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.


In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API’s and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.


Continue reading Evaluating Trends in Bioinformatics Software Packages

ContentMining Conifers and Visualising Output: Extracting Semantic Triples

Last month I wrote about visualising ContentMine output. It was limited to showing and grouping ami results which — in this case — were simply words extracted from the text.

A few weeks back, I got to know the ContentMine fact dumps. These are not just extracted words: they are linked to Wikidata Entities, from which all sorts of identifiers and properties can be derived. This is visualised by factvis. For example, when you hover over a country’s name, the population appears.

Now, I’m experimenting with the next level of information: semantic triples. Because NLP a step too far at the moment, I resorted to extracting these from tables. To see how this went in detail, please take a look at my blogpost. A summary: norma — the ContentMine program that normalises articles — has an issue with tables, table layout is very inconsistent across papers (which makes it hard to parse), and I’m currently providing Wikidata IDs for (hopefully) all three parts of the triple.

About that last one: I’ve made a dictionary of species names paired with Wikidata Entity IDs with this query. I limited the number of results because the Web Client times out if I don’t. I run the query locally using the following command:

curl -i -H "Accept: text/csv" --data-urlencode query@<queryFile>.rq -G -o <outputFile>.csv

Then I can match the species names to the values I get when extracting from the table. If I can do this for the property as well, we’ll have a functioning program that creates Wikidata-grade statements from hundreds of articles in a matter of minutes.

There are still plenty of issues. Mapping Wikidata property names to text from tables, for example, or the fact that there are several species whose name are, when shorted, O. bicolor. I can’t know which one just from the context of the table. And all the table layout issues, although probably easier to fix, are still here. That, and upscaling, is what I’m going to focus on for the next month.

But for now, there’s a demo similar to factvis, to preview the triples:


Digging into cell migration literature

It’s now been a few weeks since I have started working with the other fellows and people at ContentMine to dig into cell migration literature. I must admit it has been quite a challenge, because I have meantime submitted and successfully defended my PhD thesis!

Now, if you are wondering what my specific project was about, you can read about it here: the basic idea is to get a picture of (in)consistency in cell migration literature nomenclature, to build a set of minimal reporting requirements for the community.

So, I have started using more and more the ContentMine pipeline, and, as most of the other fellows, I have encountered a few problems here and there, and the team has been of a fantastic support to fix these issues. I have used the getpapers command so much that I can now run both basic and more advanced queries basically with my eyes closed (or a das keyboard ;)). For now, I have only used the default eupmc API, and, given a lot of papers available, I have decided to narrow down my search downloading papers published between 2000 and 2016, describing in vitro cell migration studies.

This results into a search space of about 700 open access papers.

Having the full XML text, I have then used norma to normalize this and obtain scholarly html files. First thing I wanted to check is the word frequencies, to get a rough idea of which words are used mostly, and in which sections of the papers. The ami-2word plugin seemed to be just perfect for this! However, when running the plugin with a stop-words file (a file containing some words that I would like to be ignored during the analysis, like the ones listed here), the file seems to get ignored (most likely because it cannot be parsed by the plugin). You can find this file here.

I am now discussing this with the fellows and the entire team, and in the process of figuring out if I did something stupid, or if this is an issue we need to correct for, to make the tools at ContentMine even better!

The entire set of commands and results developed so far are in my github repo.

And here is what I want to do next:

  • fix the issue with the stop-words file and visualize the word frequencies across the papers (most likely using a word cloud)
  • use the ami-regex plugin with a set of expressions (terms) that I would like to search for in the papers
  • use the pyCProject to parse my CProject structure and the related CTree and convert these into Python data structures. In this way, downstream manipulation (filtering, visualization etc.) can be done using Python (I will definitely use a jupyter notebook that I will then make available online).

Paola, 2016 fellow

ContentMining Conifers and Visualising Output: Creating a Page for ContentMine Output

In the past few weeks I have made a few pages to visualise ContentMine output of articles with several topics. For this to work, I needed to develop a program to convert ContentMine output to a single (JSON) file to access the facts more easily, and load this in a HTML page. These pages currently contain lists of articles, genus and species. Now, you can quickly glance at a lot of articles, and see common recurrences (although currently most “common recurrences” are typos in plant names on the author’s end).

Example of a results page
Example of a results page

For a more detailed description of my progress, see my blog.

Introducing Fellow Guanyang Zhang: Mining weevil-plant associations

guanyang-zhangI am a taxonomist, the kind of biologists who are charged with discovering, documenting and describing life on earth. I specialize on insects, the most diverse and successful form of life. My ContentMine Fellowship project will focus on mining weevil-plant associations from literature records. I will describe my project in the following.

Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils (Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals. Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access. This study will use ContentMine to extract, organize and synthesize knowledge of host plant associations of weevils from the literature. I have been doing literature mining manually and generated nearly 700 entries.

Continue reading Introducing Fellow Guanyang Zhang: Mining weevil-plant associations

ContentMining Preclinical Models of Depression: 1st Update

My depression project, starting with an animal model, depression-specific search carried out in Pubmed and Embase in May 2016, includes 70,363 unique records after deduplication using EndNote. I am currently evaluating the options as to what is the best solution to get my full search (or of much of it as possible) into the ContentMine pipeline.

Here is an outline of the options I am considering and my thoughts:

  • Wait until the screening section of my systematic review is complete and import only the included studies.
    • Pros: Less papers to deal with, which has benefits computationally. Only the most relevant papers would go through the ContentMine pipeline.
    • Cons: Do not get information from the annotation carried out with ‘ami’ tool, in order to more thoroughly test out machine learning options for screening in systematic reviews.


  • Use the ‘quickscrape’ function on a list of manually collated URLs.
    • Pros: Records are smoothly imported into the ContentMine pipeline and in the correct formats.
    • Cons: Time. It takes a long time to manually collect the correct URLs and group them into categories for different scrapers. There will likely be issues in that some records will not have a URL. These records would slow the process as a more thorough library search would need to be conducted to find these records or the authors of the record would need to be contacted for full text. A third possibility is that the record does not have an electronic copy. These records could then not be processed by the ContentMine pipeline.


  • Run a search using EuPMC with the ‘getpapers’ function. I am currently retrieving about 20,000 records using the search string “(“depressive disorder” OR “depression” OR “depressive behavior” OR “depressive behaviour” OR “dysthymia” OR “dysthymic” AND “animal”)” which has been developed with a librarian at the University of Edinburgh to roughly correspond with the more complex original PubMed & Embase searches.
    • Pros: Records are quickly and easily imported into the ContentMine pipeline.
    • Cons: Would need to later reconcile the records downloaded with EuPMC search with the included papers from screening in order to translate these forward in the systematic review pipeline, to the data extraction phase.


  • Retrieve pdfs for the records of my search using EndNote and run ‘norma’ to convert pdfs to text.
    • Pros: The process of downloading full text pdfs from EndNote will be carried out in any case, in order to import full text references into the systematic review database that we have at CAMARADES for further data extraction and meta-analysis.
    • Cons: Possible issues with different pdf formats and therefore the possible output from the PDF conversion. Will the reader be able to deal with the typical journal format of columns (particularly double columns)? Whether and how well it deals with tables and figures? Can legends be retrieved from figures?


  • Run my PubMed search with the Open Access Filter and download the xml versions of the articles using the FTP service. Running my PubMed search using the open access filter retrieves 20,381 records which is comparable to the records retrieved using the amended EuPMC search with ‘getpapers’.
    • Considerations: Seeing as the records retrieved are comparable, I do not see much added advantage of using this facility over that of ContentMine EuPMC API.


  • Use CrossRef to retrieve full text documents, using a list of DOIs. This method can download up to 10,000 full text records at a time.
    • Pros: Another method of retrieving full text documents in a machine readable format. Could be faster method compared to ‘quickscrape’ as there are DOIs for roughly a third of records.
    • Cons: This method has similar issues to the ‘quickscrape’ tool, namely the time it takes to manually collect the correct DOIs, as only about a third of records have corresponding DOIs that have been located thus far.


If anyone has any input or possible solutions I am unaware of or have not yet considered, please leave a comment down below.


As with the other fellows, I am also experiencing issues using the ‘getpapers’ function and getting time-out errors. This is an issue that lies with EuPMC and its API and this issue will be rectified shortly.


In the meantime, while I weigh the advantages and disadvantages of the above options, I am putting together dictionaries to aide annotation of my documents. The dictionaries I am putting together are; animal models of depression, molecular and cellular pathways, outcome measures (in particular behavioural, neurobiological and anatomical), and risk of bias terms in the animal modelling literature.