MeSHing ContentMine

Ni!

A while back I set to convert the MeSH vocabulary tree into dictionaries usable by ContentMine.

What the MeSH ?

Medical Subject Headings is a hierarchical vocabulary of concepts organized by the U.S. National Library of Medicine, part of the U.S. National Institutes of Health. It is, as far as I know, one of the most complete vocabularies of terms considered relevant to medical research. This goes from social-demographic factors, diseases, organs, chemicals, bacteria etc. You can search and navigate the hierarchy here.

Why the MeSH?

There are several reasons to use the MeSH vocabulary as a dictionary in you content mining research. It is the standard by which papers in PubMed are classified, it is quite comprehensive regarding medical subjects, it is widely adopted and recognized as a standard for mining medical literature, it gets regularly updated in a traceable manner, is carefully annotated and entirely hierarchical.

So, even thought we now have Wikidata for everything and then some, in the medical field you usually don’t need to go further than what MeSH provides, reviewers – and perhaps yourself – may find it more reliable and comparable to other research, and by bringing it to ContentMine we might be able to improve WikiData from data in the MeSH vocabulary.

Converting XML MeSH

ContentMine’s dictionaries repository states that it can digest dictionaries in either XML or JSON formats. Since the MeSH vocabulary is distributed in XML, I’ve decided to first have it converted to XML for ContentMine as well, essentially using the format provided in the repo’s main page.

That leads us to this tiny little script mesh2cmdict.py which essentially does what is says it does. With some cool tweaks:

  • it quietly and temporarily downloads the XML MeSH file if you don’t provide it
  • it lets you select parts of the tree by using MeSH Tree Numbers in regular expressions

So if you’re looking to study diseases related to alcohol, you can create a specific dictionary with:

./mesh2cmdict.py --match 'C25.775.100|F03.900.100' alcoholdict.xml

Do you also want your dictionary to include a family of bacteria whose relationship to alcohol related diseases you’re interested in? Just complement the expression:

./mesh2cmdict.py --match '(C25.775.100|F03.900.100)|(B03.440.425.410.711|B03.851.595)' alcoholspirodict.xml

Great!

Now could we include some geographical locations? That’s a good point to understand MeSH’s limitations as well. If you browse the tree, you’ll notice geographical locations worldwide are mostly limited to countries, and to states in the USA. Not very great depending on your objectives, so working with other dictionary sources can be preferable.

At the same time, there’s a lot more to MeSH, such as qualifiers and relatedness, which could be explored to create even richer dictionaries. We’ll get there, eventually.

Next time

Well, now that I can create these fun dictionaries, it’d be really cool to use them! Aaand that shall be the subject of my the next post 😉

Happy holidays! \o/

Advertisements

Published by

Ale Abdo

I am a knight who says... Ni! In my time away from doing this nasty thing, I play the role of a postdoctoral researcher at the Interdisciplinary Laboratory on Sciences, Innovations and Societies in Paris, France.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s