It’s now been a few weeks since I have started working with the other fellows and people at ContentMine to dig into cell migration literature. I must admit it has been quite a challenge, because I have meantime submitted and successfully defended my PhD thesis!
Now, if you are wondering what my specific project was about, you can read about it here: the basic idea is to get a picture of (in)consistency in cell migration literature nomenclature, to build a set of minimal reporting requirements for the community.
So, I have started using more and more the ContentMine pipeline, and, as most of the other fellows, I have encountered a few problems here and there, and the team has been of a fantastic support to fix these issues. I have used the getpapers command so much that I can now run both basic and more advanced queries basically with my eyes closed (or a das keyboard ;)). For now, I have only used the default eupmc API, and, given a lot of papers available, I have decided to narrow down my search downloading papers published between 2000 and 2016, describing in vitro cell migration studies.
This results into a search space of about 700 open access papers.
Having the full XML text, I have then used norma to normalize this and obtain scholarly html files. First thing I wanted to check is the word frequencies, to get a rough idea of which words are used mostly, and in which sections of the papers. The ami-2word plugin seemed to be just perfect for this! However, when running the plugin with a stop-words file (a file containing some words that I would like to be ignored during the analysis, like the ones listed here), the file seems to get ignored (most likely because it cannot be parsed by the plugin). You can find this file here.
I am now discussing this with the fellows and the entire team, and in the process of figuring out if I did something stupid, or if this is an issue we need to correct for, to make the tools at ContentMine even better!
The entire set of commands and results developed so far are in my github repo.
And here is what I want to do next:
- fix the issue with the stop-words file and visualize the word frequencies across the papers (most likely using a word cloud)
- use the ami-regex plugin with a set of expressions (terms) that I would like to search for in the papers
- use the pyCProject to parse my CProject structure and the related CTree and convert these into Python data structures. In this way, downstream manipulation (filtering, visualization etc.) can be done using Python (I will definitely use a jupyter notebook that I will then make available online).
Paola, 2016 fellow