MeSHing ContentMine

Ni!

A while back I set to convert the MeSH vocabulary tree into dictionaries usable by ContentMine.

What the MeSH ?

Medical Subject Headings is a hierarchical vocabulary of concepts organized by the U.S. National Library of Medicine, part of the U.S. National Institutes of Health. It is, as far as I know, one of the most complete vocabularies of terms considered relevant to medical research. This goes from social-demographic factors, diseases, organs, chemicals, bacteria etc. You can search and navigate the hierarchy here.

Why the MeSH?

There are several reasons to use the MeSH vocabulary as a dictionary in you content mining research. It is the standard by which papers in PubMed are classified, it is quite comprehensive regarding medical subjects, it is widely adopted and recognized as a standard for mining medical literature, it gets regularly updated in a traceable manner, is carefully annotated and entirely hierarchical.

So, even thought we now have Wikidata for everything and then some, in the medical field you usually don’t need to go further than what MeSH provides, reviewers – and perhaps yourself – may find it more reliable and comparable to other research, and by bringing it to ContentMine we might be able to improve WikiData from data in the MeSH vocabulary.

Converting XML MeSH

ContentMine’s dictionaries repository states that it can digest dictionaries in either XML or JSON formats. Since the MeSH vocabulary is distributed in XML, I’ve decided to first have it converted to XML for ContentMine as well, essentially using the format provided in the repo’s main page.

That leads us to this tiny little script mesh2cmdict.py which essentially does what is says it does. With some cool tweaks:

  • it quietly and temporarily downloads the XML MeSH file if you don’t provide it
  • it lets you select parts of the tree by using MeSH Tree Numbers in regular expressions

So if you’re looking to study diseases related to alcohol, you can create a specific dictionary with:

./mesh2cmdict.py --match 'C25.775.100|F03.900.100' alcoholdict.xml

Do you also want your dictionary to include a family of bacteria whose relationship to alcohol related diseases you’re interested in? Just complement the expression:

./mesh2cmdict.py --match '(C25.775.100|F03.900.100)|(B03.440.425.410.711|B03.851.595)' alcoholspirodict.xml

Great!

Now could we include some geographical locations? That’s a good point to understand MeSH’s limitations as well. If you browse the tree, you’ll notice geographical locations worldwide are mostly limited to countries, and to states in the USA. Not very great depending on your objectives, so working with other dictionary sources can be preferable.

At the same time, there’s a lot more to MeSH, such as qualifiers and relatedness, which could be explored to create even richer dictionaries. We’ll get there, eventually.

Next time

Well, now that I can create these fun dictionaries, it’d be really cool to use them! Aaand that shall be the subject of my the next post 😉

Happy holidays! \o/

Content mining for trend analysis

Let’s suppose you have assembled a large collection of papers (we’ll call that corpus) as a starting point for a literature review. Some of the first questions would be of an exploratory nature, you would like to get an intuition of what’s really in there. “Is there a certain structure, possibly a hidden bias I need to take into account? What is the coverage, are there some ‘holes’ in the data set, perhaps some missing months, or should I include another keyword in the search? How do certain keyword frequencies develop over time, is there a trend appearing?” We can help with getting this initial overview, and speeding up the process to get you working on the questions that really interest you.

Continue reading Content mining for trend analysis

Job Opportunity: ContentMine Operations Manager

 

ContentMine was founded in 2016 as a UK non-profit company limited by guarantee. Our mission is to establish content mining for research and for education as widespread philosophy and practice through:

  • creating computer programs, protocols, practises, standards and educational materials that enable content mining,
  • training researchers and others in content mining,
  • encouraging research institutions and funders of research to support establishing freedom for anyone to engage in computational analysis of books, journals, databases and other knowledge sources for the purposes of education and research.

We develop open source software for mining the scientific literature and engage directly in supporting researchers to use mining, saving valuable time and opening up new research avenues.

For information, please visit http://contentmine.org/jobs

Position

We are seeking an Operations Manager to take overall operational responsibility for ContentMine’s development and execution of its mission, reporting to the Board of Directors and working closely with the ContentMine Founder, Dr Peter Murray-Rust. The successful candidate will develop deep knowledge of our core focus, operations, and business development opportunities and manage the transition of the organisation from a project to a sustainable non-profit with oversight of all major business areas from fundraising to communications and HR.

Salary

ÂŁ40-45k pro rata, negotiable.

Time and Location

4 days per week, fixed term contract for four months in the first instance, with renewal subject to funding. The candidate should be a UK or EU national, remote working possible but candidates in easy travelling distance of Cambridge are preferred.

Responsibilities

Leadership and Management:

  • Ensure ongoing excellence in delivery of the ContentMine mission, including program evaluation, and consistent quality of finance and administration, Manage fundraising, communications, and systems; recommend timelines and resources needed to achieve the strategic goals.
  • Actively engage and energize ContentMine board members, contractors, collaborators, Fellows, volunteers and funders.
  • Ensure effective systems to track progress, evaluate program components and report to the Board and funders.

Fundraising and Communications:

  • Expand revenue generating and fundraising activities to support existing program operations and planned developments.
  • Oversee and refine all aspects of communications—from web presence to external relations, with the goal of creating a stronger brand based on a recent graphical design exercise.
  • Use external presence and relationships to garner new opportunities.

Planning and New Business:

  • Build partnerships with research-oriented organisations including groups and institutes, scholarly societies and NGOs.
  • Establish relationships with potential collaborators and philanthropic funders.
  • Write grant applications and tender for client contracts.
  • Manage relationships and work allocations with partner organisations and contractors who bring new skills and capabilities to projects.

Person Specification

The Operations Manager will be thoroughly committed to ContentMine’s mission. All candidates should have proven leadership and relationship management experience. Concrete demonstrable experience and other qualifications include:

  • At least 5 years of management experience; track record of effectively leading an outcomes-based organization.
  • Ability to point to specific examples of having developed and actioned strategies that have taken an organization to the next stage of growth.
  • Commitment to delivering quality programs and data-driven program evaluation.
  • Excellence in organisational management including developing high-performance teams, setting and achieving strategic objectives, and managing a budget.
  • Fundraising experience with the ability to engage a wide range of stakeholders, partiuclarly in the academic, non-profit, research and publishing sectors.
  • Strong written and verbal communication skills; a persuasive and passionate communicator with excellent interpersonal and multidisciplinary project skills.
  • Action-oriented, entrepreneurial, adaptable approach to business planning.
  • Ability to work effectively in collaboration with diverse groups of people.
  • Passion, integrity, positive attitude, mission-driven and self-directed focus are all desirable.

To apply

Please submit a cover letter and CV to admin@contentmine.org by 2 Dec 2016. Interviews will be held by the 9 Dec. Informal enquiries should be directed to Dr Peter Murray-Rust (peter@contentmine.org).

Barriers to content mining: fake article webpages

Two ContentMine team members featured in Nature News this week reporting difficulties they have faced in text mining Wiley publications due to ‘Trap’ URLs, designed to catch people using automated downloading.

Full article: ‘Publisher under fire for fake article webpages

Richard Smith-Unna, lead developer of ContentMine’s getpapers and quickscrape tools, encountered the issue while undertaking research that is fully supported by the UK copyright exception for text and data mining and where his University librarians had been informed. Ross Mounce, lead researcher on the ContentMine-supported PLUTo project has encountered similar issues before and states:

“I’m worried we’re seeing the beginning of an evolutionary arms race between legacy publishers and researchers.”

The methods employed were criticised as heavy-handed and unsophisticated by several commentators, with one librarian stating that they find the behaviour concerning “Because it demonstrates that supporting research is not the chief priority of these publishers.”

ContentMine continues to support the rights of researchers to build on our collective scientific knowledge through mining the academic literature and to call out barriers to that mission.