FAQ’s (from potential users)

Q. This software sounds really interesting and we are potentially interested in using ContentMine to help with literature reviews.

A. We’ve already got two groups who are using CM for systematic reviews. Clinical trials , (Amy Price in Oxford) and Animal Tests (Malcolm Macleod, Edinburgh Neuroscience). These work to guidelines (CONSORT and ARRIVE) and have to review a large amount (10,000’s of papers) – filtering and then data extraction


Q. Could tell me a little bit more about how ContentMine works.

A. ContentMine can be used in many ways:, including:

  • to download large numbers of files from a search
  • to normalize published HTML/PDF/XML into a form useful for mining
  • to download and (using ‘norma’) normalize supplementary/supporting information|data
  • to filter papers that meet a user-defined criterion . This is usually much more powerful than provided online
  • to retrieve a batch of papers for which you have the URLs/DOIs
  • to mine the results using built-in (`ami`) plugins
  • to develop and use your own plugins for `ami`


Q. Does CM require running a literature search first (e.g. in Pubmed)?

A. It’s one of the common entry points (`getpapers`) but it’s not required. We also trawl the daily literature and mine it.


Q. What kind of information/databases can ContentMine mine?

A. Metadata, fulltext (in a variety of formats), images, diagrams and supplemental data


Q. Is it possible to pull out specific information from papers using ContentMine?

A. Yes. This is Information Extraction (IE). The commonest is to identify words and phrases in text, but we can also extract phrases (using Templates). There are pre-built tools for Species, Sequences, Identifiers, Genes, etc. and you can build your own Regular Expresssions. In certain domains (chemistry, phylogenetics, plots, and certain images) we can extract complex scientific objects.


Q. In what format is the mined information then presented in?

A. Any modern format. Primarily XML but also JSON and we’ve been asked for YAML

There’s a wide range. The main approaches include:

  •  indexing and filtering
  •  analysis of text, e.g. word frequency ad classification
  •  information extraction (e.g. numbers of patients)
  •  extraction from diagrams (new and unique)
  •  templated extraction

and special domains

  •  chemistry
  •  phylogenetic trees
  •  species
  •  sequences
  •  electroneurophysiology


Q. Can you provide seminars/workshops on how to use ContentMine?

A. Yes. We’ve done this for:

  • scientific projects we are collaborating with.
  • libraries who want to know about mining
  • funders
  • people who want to run workshops themselves


Q. Do you charge for these workshops, and if yes, how much?

A. We need to recover costs and this depends a bit on how much work we have to do for a specific workshop.


Q. Is there a charge to using the ContentMine software?

A. No – it’s all completely Open Source (Apache2).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s