MeSHing ContentMine


A while back I set to convert the MeSH vocabulary tree into dictionaries usable by ContentMine.

What the MeSH ?

Medical Subject Headings is a hierarchical vocabulary of concepts organized by the U.S. National Library of Medicine, part of the U.S. National Institutes of Health. It is, as far as I know, one of the most complete vocabularies of terms considered relevant to medical research. This goes from social-demographic factors, diseases, organs, chemicals, bacteria etc. You can search and navigate the hierarchy here.

Why the MeSH?

There are several reasons to use the MeSH vocabulary as a dictionary in you content mining research. It is the standard by which papers in PubMed are classified, it is quite comprehensive regarding medical subjects, it is widely adopted and recognized as a standard for mining medical literature, it gets regularly updated in a traceable manner, is carefully annotated and entirely hierarchical.

So, even thought we now have Wikidata for everything and then some, in the medical field you usually don’t need to go further than what MeSH provides, reviewers – and perhaps yourself – may find it more reliable and comparable to other research, and by bringing it to ContentMine we might be able to improve WikiData from data in the MeSH vocabulary.

Converting XML MeSH

ContentMine’s dictionaries repository states that it can digest dictionaries in either XML or JSON formats. Since the MeSH vocabulary is distributed in XML, I’ve decided to first have it converted to XML for ContentMine as well, essentially using the format provided in the repo’s main page.

That leads us to this tiny little script which essentially does what is says it does. With some cool tweaks:

  • it quietly and temporarily downloads the XML MeSH file if you don’t provide it
  • it lets you select parts of the tree by using MeSH Tree Numbers in regular expressions

So if you’re looking to study diseases related to alcohol, you can create a specific dictionary with:

./ --match 'C25.775.100|F03.900.100' alcoholdict.xml

Do you also want your dictionary to include a family of bacteria whose relationship to alcohol related diseases you’re interested in? Just complement the expression:

./ --match '(C25.775.100|F03.900.100)|(B03.440.425.410.711|B03.851.595)' alcoholspirodict.xml


Now could we include some geographical locations? That’s a good point to understand MeSH’s limitations as well. If you browse the tree, you’ll notice geographical locations worldwide are mostly limited to countries, and to states in the USA. Not very great depending on your objectives, so working with other dictionary sources can be preferable.

At the same time, there’s a lot more to MeSH, such as qualifiers and relatedness, which could be explored to create even richer dictionaries. We’ll get there, eventually.

Next time

Well, now that I can create these fun dictionaries, it’d be really cool to use them! Aaand that shall be the subject of my the next post 😉

Happy holidays! \o/


Introducing Fellow Alexandre Hannud Abdo: Digging up Global Health facts

alexandre-hannud-abdo“Our goal is to mine facts from global health research and provide automated referenced summaries to practitioners and agents who don’t have the means or the time to navigate the literature.

Plus, for extreme situations where decisions can’t wait for new research to be confirmed, we’ll do our best to highlight which preliminary facts, as published in the latest conferences, share more traits with previously confirmed results.”

Hi! My name is Ale, I was born and raised in Brazil, where I also obtained my PhD in Physics and later participated in large-scale projects in epidemiology and public health. I currently live in France, where I work at LISIS (Interdisciplinary Laboratory on Sciences, Innovations and Societies) in a project about the evolution of the field of oncology.

I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.

The research field of my proposal is Global Health, and our goal is as stated in the opening paragraph. It should be noted that I am not alone, and my interest comes from a group formed during OpenCon 2015 by Neo Chung — who serendipitously was also awarded a ContentMine fellowship under an entirely different subject.

Being a fellow will enable me to develop this proposal within a motivating community, with my lab better understanding my time away from my primary job. It will also accelerate my learning and contributing to the platform, and hopefully will help attract other researchers excited by this idea!