Digging into cell migration literature

It’s now been a few weeks since I have started working with the other fellows and people at ContentMine to dig into cell migration literature. I must admit it has been quite a challenge, because I have meantime submitted and successfully defended my PhD thesis!

Now, if you are wondering what my specific project was about, you can read about it here: the basic idea is to get a picture of (in)consistency in cell migration literature nomenclature, to build a set of minimal reporting requirements for the community.

So, I have started using more and more the ContentMine pipeline, and, as most of the other fellows, I have encountered a few problems here and there, and the team has been of a fantastic support to fix these issues. I have used the getpapers command so much that I can now run both basic and more advanced queries basically with my eyes closed (or a das keyboard ;)). For now, I have only used the default eupmc API, and, given a lot of papers available, I have decided to narrow down my search downloading papers published between 2000 and 2016, describing in vitro cell migration studies.

This results into a search space of about 700 open access papers.

Having the full XML text, I have then used norma to normalize this and obtain scholarly html files. First thing I wanted to check is the word frequencies, to get a rough idea of which words are used mostly, and in which sections of the papers. The ami-2word plugin seemed to be just perfect for this! However, when running the plugin with a stop-words file (a file containing some words that I would like to be ignored during the analysis, like the ones listed here), the file seems to get ignored (most likely because it cannot be parsed by the plugin). You can find this file here.

I am now discussing this with the fellows and the entire team, and in the process of figuring out if I did something stupid, or if this is an issue we need to correct for, to make the tools at ContentMine even better!

The entire set of commands and results developed so far are in my github repo.

And here is what I want to do next:

  • fix the issue with the stop-words file and visualize the word frequencies across the papers (most likely using a word cloud)
  • use the ami-regex plugin with a set of expressions (terms) that I would like to search for in the papers
  • use the pyCProject to parse my CProject structure and the related CTree and convert these into Python data structures. In this way, downstream manipulation (filtering, visualization etc.) can be done using Python (I will definitely use a jupyter notebook that I will then make available online).

Paola, 2016 fellow

Introducing Fellow Paola Masuzzo: Mining the literature to understand cell migration

paola-masuzzoI have just submitted my PhD in Biomedical Sciences at VIB and Ghent University (Belgium), with a thesis entitled “An open data exchange ecosystem: forging a new path for cell migration data analysis and mining”. During my PhD I have tried to make cell migration research more ‘open’: I have developed computational open source tools and algorithms for the storage, management, dissemination and analysis of cell migration experiments. Furthermore, I have tried to push this ‘open’ concept a bit beyond my own PhD, and have succeeded in engaging a few researchers in this fight: I am now working in MULTIMOT, an EU-H2020 funded project that aims to build an open data ecosystem for cell migration research, with the ultimate goal to increase reproducibility, and allow analyses to take place on rich datasets that would otherwise remain unused.

As a ContentMine fellow, I want to text mine literature around cell migration and invasion, because I believe that there is a huge amount of information in all the papers continuously published in the field, and that this information simply cannot be processed by the human eye. Specifically, text mine cell migration articles will hopefully help in the following tasks:

  • automatically detect a set of core information reported when describing experiments in the field, and therefore construct a collection of minimum reporting requirements from these. These requirements can then be used to aid experimental and computational reproducibility.
  • check for nomenclature consistency, the use of common terms or ontologies to describe the same concept, again with the goal to increase reproducibility, and allow meta-analyses to take place.
  • construct a knowledge map that could capture the current status of information in the field, especially in terms of cell motility-related compounds and (cancer) cell lines.