Final Report: Analysing and visualising data from papers about conifers

Lars Willighagen, orcid:0000-0002-4751-4637

Final Report of my fellowship at the ContentMine.

Proposal

My proposal was to extract facts about various conifer species by analysing text from papers with software suited for analysing text and the tools provided by the ContentMine. These facts were then to be converted into JSON, and then viewable with an HTML (+CSS/JS) interface. Expected statements were like: ‘Picea glauca is a species of the genus Picea’, which could be parsed to the triple:Picea glauca; property:genus; subject:Picea.

Work

The main outcome of this project is a series of programmes converting tables from research articles into Wikidata statements. The workflow is as follows. First, papers matching a user-provided query are fetched by the ContentMine’s getpapers. Second, the tables are extracted from the fetched papers and converted to assertions. This is done by filling empty cells in tables and then treating each row as an object, the first column being the name and the others property-value pairs. Different table designs are currently parsed in the same way, resulting in incorrect extraction of data, something that can be accommodated for by normalising the table structure beforehand. The resulting assertions are then converted to JSON, currently in a custom scheme, to allow the next steps.

Finally, the JSON assertions are visualized in an HTML GUI. This includes a stepper form (see picture) where you can curate the assertion, link identifiers, and add it to Wikidata.

Source code: https://github.com/larsgw/ctj-factvis
Demo: https://larsgw.github.io/ctj-factvis

Getting these assertions from text, as I proposed, was harder. Tools I expected to find included in ContentMine software were nowhere to be found, but were planned, so actually implementing them myself did not seem a good use of my time. Luckily, the literature corpus does not actually contain that many statements about physical properties of conifers in plain text as I originally expected: most are in tables, figures or in supplementary files, leading me to using those instead. The nice thing is that one of the main focuses of the ContentMine is parsing tables from PDF, so this will definitely be of general use.

Other work

During the project and to explore the design of the ContentMine, additional related components were developed:

ctj: program to convert and re-order AMI data to JSON, making it easier to read in JavaScript (mainly good for web applications);
ctj-cardlists: program to view AMI JSON (see above) in a Web GUI (demo); and
Citation.js: added functionality to parse BibJSON (used for quickscrape output) into CSL, for further formatting. See blog post.

These first two simplified handing AMI output in the browser, while the third makes it easier to display references in common formats.

Dissemination

All source code of the project outcomes is available on GitHub:

Progress was communicated during the project via the ContentMine Discourse page, on my personal blog (~20 posts), and on the general ContentMining blog (2 long posts).

Future work

The developed pipeline works but is not perfect.The pipeline to parse tables mentioned above requires further generalisation. This defines some logical next steps: fixes:

Finally adding it as an NPM module, making it (way) easier for people to use it;
Making searching easier in the HTML GUI (will need work further upstream too). Currently the list of assertions are split into pieces, making it hard to find anything. This can be fixed with a search index;
Normalising table structures to support more designs, rendering assertion extraction more reliable;
Making the process of curating assertions and linking identifiers easier by linking more identifiers, and showing context, i.e. the original tables; and
Some small performance and UX things.

Another important thing that is too big for a single bullet point, is annotating abbreviations and references in the document before extracting the tables. It’s easier to curate statements like ‘[1] says this and this’ when you know ‘[1]’ references some known article. Another example: while a statement containing ‘P. glauca’ says nothing (there are 66+ species using that abbreviation), the article probably says which one it is somewhere outside the table, something that can be picked up if you annotate these before taking them out of context. This makes the interactive stepper form currently a necessity.

Evaluation

As noted, the work is far from done. Currently, it mainly shows a glimpse of what is possible had I spent more time on writing code. Short conclusions: CTJ is unpolished and slow. Because of a lack of customisation options, such as what data to use, you will almost always need to write custom code to not have to include tons of unnecessary data in your resulting JSON.

CTJ-Cardlists is actually pretty nice. It is slow, and it does not really show relations, but it does show an interesting overview of the literature corpus, like how often species are mentioned and with what they are mentioned together most of the time. You can easily draw reasonable conclusions like how often species names are misspelled. However, it would be more useful for this to have SQL queries or something similar. CTJ-Factvis shows even more potential, with the Wikidata integration. I do need to pay more attention to the fact that those assertions are alleged facts, and not regular ones, as I called them in earlier blog posts.

Fellowship

In general, the fellowship went pretty well for me. In retrospect, I did a lot of the things I wanted to do, even though that throughout the project it felt like there was so much left to do, and there is! I am really excited about the possibilities that emerged during the fellowship, and even in the last weeks. How cool would it be to extend this project with entire Web API’s and more? This is, for a big part, thanks to the support, feedback, and input of the amazing ContentMine team during the regular meeting, and the quick responses to various software issues. I also enjoyed blogging about my progress on my own blog and on the ContentMine blog.

Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.

contentmine

Continue reading Evaluating Trends in Bioinformatics Software Packages

MeSHing ContentMine

Ni!

A while back I set to convert the MeSH vocabulary tree into dictionaries usable by ContentMine.

What the MeSH ?

Medical Subject Headings is a hierarchical vocabulary of concepts organized by the U.S. National Library of Medicine, part of the U.S. National Institutes of Health. It is, as far as I know, one of the most complete vocabularies of terms considered relevant to medical research. This goes from social-demographic factors, diseases, organs, chemicals, bacteria etc. You can search and navigate the hierarchy here.

Why the MeSH?

There are several reasons to use the MeSH vocabulary as a dictionary in you content mining research. It is the standard by which papers in PubMed are classified, it is quite comprehensive regarding medical subjects, it is widely adopted and recognized as a standard for mining medical literature, it gets regularly updated in a traceable manner, is carefully annotated and entirely hierarchical.

So, even thought we now have Wikidata for everything and then some, in the medical field you usually don’t need to go further than what MeSH provides, reviewers – and perhaps yourself – may find it more reliable and comparable to other research, and by bringing it to ContentMine we might be able to improve WikiData from data in the MeSH vocabulary.

Converting XML MeSH

ContentMine’s dictionaries repository states that it can digest dictionaries in either XML or JSON formats. Since the MeSH vocabulary is distributed in XML, I’ve decided to first have it converted to XML for ContentMine as well, essentially using the format provided in the repo’s main page.

That leads us to this tiny little script mesh2cmdict.py which essentially does what is says it does. With some cool tweaks:

it quietly and temporarily downloads the XML MeSH file if you don’t provide it
it lets you select parts of the tree by using MeSH Tree Numbers in regular expressions

So if you’re looking to study diseases related to alcohol, you can create a specific dictionary with:

./mesh2cmdict.py --match 'C25.775.100|F03.900.100' alcoholdict.xml

Do you also want your dictionary to include a family of bacteria whose relationship to alcohol related diseases you’re interested in? Just complement the expression:

./mesh2cmdict.py --match '(C25.775.100|F03.900.100)|(B03.440.425.410.711|B03.851.595)' alcoholspirodict.xml

Great!

Now could we include some geographical locations? That’s a good point to understand MeSH’s limitations as well. If you browse the tree, you’ll notice geographical locations worldwide are mostly limited to countries, and to states in the USA. Not very great depending on your objectives, so working with other dictionary sources can be preferable.

At the same time, there’s a lot more to MeSH, such as qualifiers and relatedness, which could be explored to create even richer dictionaries. We’ll get there, eventually.

Next time

Well, now that I can create these fun dictionaries, it’d be really cool to use them! Aaand that shall be the subject of my the next post 😉

Happy holidays! \o/

Content mining for trend analysis

Let’s suppose you have assembled a large collection of papers (we’ll call that corpus) as a starting point for a literature review. Some of the first questions would be of an exploratory nature, you would like to get an intuition of what’s really in there. “Is there a certain structure, possibly a hidden bias I need to take into account? What is the coverage, are there some ‘holes’ in the data set, perhaps some missing months, or should I include another keyword in the search? How do certain keyword frequencies develop over time, is there a trend appearing?” We can help with getting this initial overview, and speeding up the process to get you working on the questions that really interest you.

Continue reading Content mining for trend analysis

ContentMining Conifers and Visualising Output: Extracting Semantic Triples

Last month I wrote about visualising ContentMine output. It was limited to showing and grouping ami results which — in this case — were simply words extracted from the text.

A few weeks back, I got to know the ContentMine fact dumps. These are not just extracted words: they are linked to Wikidata Entities, from which all sorts of identifiers and properties can be derived. This is visualised by factvis. For example, when you hover over a country’s name, the population appears.

Now, I’m experimenting with the next level of information: semantic triples. Because NLP a step too far at the moment, I resorted to extracting these from tables. To see how this went in detail, please take a look at my blogpost. A summary: norma — the ContentMine program that normalises articles — has an issue with tables, table layout is very inconsistent across papers (which makes it hard to parse), and I’m currently providing Wikidata IDs for (hopefully) all three parts of the triple.

About that last one: I’ve made a dictionary of species names paired with Wikidata Entity IDs with this query. I limited the number of results because the Web Client times out if I don’t. I run the query locally using the following command:

curl -i -H "Accept: text/csv" --data-urlencode query@<queryFile>.rq -G https://query.wikidata.org/bigdata/namespace/wdq/sparql -o <outputFile>.csv

Then I can match the species names to the values I get when extracting from the table. If I can do this for the property as well, we’ll have a functioning program that creates Wikidata-grade statements from hundreds of articles in a matter of minutes.

There are still plenty of issues. Mapping Wikidata property names to text from tables, for example, or the fact that there are several species whose name are, when shorted, O. bicolor. I can’t know which one just from the context of the table. And all the table layout issues, although probably easier to fix, are still here. That, and upscaling, is what I’m going to focus on for the next month.

But for now, there’s a demo similar to factvis, to preview the triples:

Job Opportunity: ContentMine Operations Manager

ContentMine was founded in 2016 as a UK non-profit company limited by guarantee. Our mission is to establish content mining for research and for education as widespread philosophy and practice through:

creating computer programs, protocols, practises, standards and educational materials that enable content mining,
training researchers and others in content mining,
encouraging research institutions and funders of research to support establishing freedom for anyone to engage in computational analysis of books, journals, databases and other knowledge sources for the purposes of education and research.

We develop open source software for mining the scientific literature and engage directly in supporting researchers to use mining, saving valuable time and opening up new research avenues.

For information, please visit http://contentmine.org/jobs

Position

We are seeking an Operations Manager to take overall operational responsibility for ContentMine’s development and execution of its mission, reporting to the Board of Directors and working closely with the ContentMine Founder, Dr Peter Murray-Rust. The successful candidate will develop deep knowledge of our core focus, operations, and business development opportunities and manage the transition of the organisation from a project to a sustainable non-profit with oversight of all major business areas from fundraising to communications and HR.

Salary

£40-45k pro rata, negotiable.

Time and Location

4 days per week, fixed term contract for four months in the first instance, with renewal subject to funding. The candidate should be a UK or EU national, remote working possible but candidates in easy travelling distance of Cambridge are preferred.

Responsibilities

Leadership and Management:

Ensure ongoing excellence in delivery of the ContentMine mission, including program evaluation, and consistent quality of finance and administration, Manage fundraising, communications, and systems; recommend timelines and resources needed to achieve the strategic goals.
Actively engage and energize ContentMine board members, contractors, collaborators, Fellows, volunteers and funders.
Ensure effective systems to track progress, evaluate program components and report to the Board and funders.

Fundraising and Communications:

Expand revenue generating and fundraising activities to support existing program operations and planned developments.
Oversee and refine all aspects of communications—from web presence to external relations, with the goal of creating a stronger brand based on a recent graphical design exercise.
Use external presence and relationships to garner new opportunities.

Planning and New Business:

Build partnerships with research-oriented organisations including groups and institutes, scholarly societies and NGOs.
Establish relationships with potential collaborators and philanthropic funders.
Write grant applications and tender for client contracts.
Manage relationships and work allocations with partner organisations and contractors who bring new skills and capabilities to projects.

Person Specification

The Operations Manager will be thoroughly committed to ContentMine’s mission. All candidates should have proven leadership and relationship management experience. Concrete demonstrable experience and other qualifications include:

At least 5 years of management experience; track record of effectively leading an outcomes-based organization.
Ability to point to specific examples of having developed and actioned strategies that have taken an organization to the next stage of growth.
Commitment to delivering quality programs and data-driven program evaluation.
Excellence in organisational management including developing high-performance teams, setting and achieving strategic objectives, and managing a budget.
Fundraising experience with the ability to engage a wide range of stakeholders, partiuclarly in the academic, non-profit, research and publishing sectors.
Strong written and verbal communication skills; a persuasive and passionate communicator with excellent interpersonal and multidisciplinary project skills.
Action-oriented, entrepreneurial, adaptable approach to business planning.
Ability to work effectively in collaboration with diverse groups of people.
Passion, integrity, positive attitude, mission-driven and self-directed focus are all desirable.

To apply

Please submit a cover letter and CV to admin@contentmine.org by 2 Dec 2016. Interviews will be held by the 9 Dec. Informal enquiries should be directed to Dr Peter Murray-Rust (peter@contentmine.org).

Digging into cell migration literature

It’s now been a few weeks since I have started working with the other fellows and people at ContentMine to dig into cell migration literature. I must admit it has been quite a challenge, because I have meantime submitted and successfully defended my PhD thesis!

Now, if you are wondering what my specific project was about, you can read about it here: the basic idea is to get a picture of (in)consistency in cell migration literature nomenclature, to build a set of minimal reporting requirements for the community.

So, I have started using more and more the ContentMine pipeline, and, as most of the other fellows, I have encountered a few problems here and there, and the team has been of a fantastic support to fix these issues. I have used the getpapers command so much that I can now run both basic and more advanced queries basically with my eyes closed (or a das keyboard ;)). For now, I have only used the default eupmc API, and, given a lot of papers available, I have decided to narrow down my search downloading papers published between 2000 and 2016, describing in vitro cell migration studies.

This results into a search space of about 700 open access papers.

Having the full XML text, I have then used norma to normalize this and obtain scholarly html files. First thing I wanted to check is the word frequencies, to get a rough idea of which words are used mostly, and in which sections of the papers. The ami-2word plugin seemed to be just perfect for this! However, when running the plugin with a stop-words file (a file containing some words that I would like to be ignored during the analysis, like the ones listed here), the file seems to get ignored (most likely because it cannot be parsed by the plugin). You can find this file here.

I am now discussing this with the fellows and the entire team, and in the process of figuring out if I did something stupid, or if this is an issue we need to correct for, to make the tools at ContentMine even better!

The entire set of commands and results developed so far are in my github repo.

And here is what I want to do next:

fix the issue with the stop-words file and visualize the word frequencies across the papers (most likely using a word cloud)
use the ami-regex plugin with a set of expressions (terms) that I would like to search for in the papers
use the pyCProject to parse my CProject structure and the related CTree and convert these into Python data structures. In this way, downstream manipulation (filtering, visualization etc.) can be done using Python (I will definitely use a jupyter notebook that I will then make available online).

Paola, 2016 fellow

ContentMine featured in Horizon magazine article “Copyright shift would put Europe ahead in ‘future of research’ data mining”

Horizon magazine featured an article on text and data mining and specifically the European Commission proposal for a copyright exception, currently covering “public or private organisations that are carrying out scientific research in the public interest”.

Read the full article >>

Dr Peter Murray-Rust is director of ContentMine, a not-for-profit organisation which has developed software that enables researchers to search through scientific papers on a particular subject. He gives the example of the Zika outbreak as an area where TDM can help to enhance knowledge.

‘We’re going to need to know a lot more about Zika, and much of it may already be in the scientific literature that’s been published but that we don’t read. We don’t read it because there’s so much, so we’ve built a machine, ContentMine, that will liberate the facts from the literature.’

ContentMining Conifers and Visualising Output: Creating a Page for ContentMine Output

In the past few weeks I have made a few pages to visualise ContentMine output of articles with several topics. For this to work, I needed to develop a program to convert ContentMine output to a single (JSON) file to access the facts more easily, and load this in a HTML page. These pages currently contain lists of articles, genus and species. Now, you can quickly glance at a lot of articles, and see common recurrences (although currently most “common recurrences” are typos in plant names on the author’s end).

For a more detailed description of my progress, see my blog.

Introducing Fellow Guanyang Zhang: Mining weevil-plant associations

guanyang-zhang I am a taxonomist, the kind of biologists who are charged with discovering, documenting and describing life on earth. I specialize on insects, the most diverse and successful form of life. My ContentMine Fellowship project will focus on mining weevil-plant associations from literature records. I will describe my project in the following.

Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils (Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals. Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access. This study will use ContentMine to extract, organize and synthesize knowledge of host plant associations of weevils from the literature. I have been doing literature mining manually and generated nearly 700 entries.

Continue reading Introducing Fellow Guanyang Zhang: Mining weevil-plant associations

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30