Go ContentMine!

ContentMine depends on a lot of software, much of which we have had to write from scratch. You can learn about the software we’ve written here.

All our software is open source under liberal licenses (MIT or Apache2), so you’re free to download them, inspect the source, modify them and contribute. Unless otherwise stated, the Shuttleworth Foundation owns copyright to all our software.

Do ContentMining now!

IUCN redlist

Get a daily output of endangered species (experimental).


Use the ContentMining tools on the web and extract facts yourself (experimental).

Virtual Machine

Use our virtual machine locally on your PC.


  • node-journalTOCs. A very simple Node.js client for the JournalTOCs API that exposes methods for searching journals by keyword or ISSN and retrieving articles by journal ISSN.

Crawling software is maintained by Richard Smith-Unna


  • What are journal scrapers?
  • scraperJSON. A JSON schema for defining web scrapers in a declarative way.
  • journal-scrapers. A collection of community-maintained scraperJSON scrapers for academic publishers and journals.
  • thresher. A Node.js scraperJSON scraping library.
  • quickscrape. A command-line interface for thresher.
  • ContentMine App. A standalone, cross-platform app that lets you use the ContentMine scraping technology to scrape academic journals.

Scraping software is maintained by Richard Smith-Unna

Format conversion

  • Jumbo-converters. Takes raw structured text files (normally program output) and converts them to structured standardised formats.
  • PDF2SVG.
  • SVG2XML.
  • XHTML2STM. Converts XHTML to scientific/technical/medical semantic structured documents.
  • SVG builder. Takes SVG primitives and turns them into higher level graphics primitives (e.g. building rectangles or circles from paths).

Format convertsion software is maintained by Peter Murray-Rust

Image and data mining/fact extraction

  • AMI-OCR. Optical character recognition for scientific texts
  • ImageAnalysis. Takes raw raster images and builds a heirarchy of pixel-based objects.
  • DiagramAnalyser. Takes the output of ImageAnalysis and turns it into diagram primitives like lines, segments and characters (via AMI-OCR).

Fact extraction software is maintained by Peter Murray-Rust


We’re happy you’re thinking about contributing to ContentMine! All our code and also our trainings are organized in our GitHub repositories.

There are many ways to contribute:

  • by reporting an issue regarding software or training. You can find more detailed infos in the related GitHub repositories.
  • by suggesting new features
  • by writing a new journal scraper
  • by writing code and documentation
  • by closing issues
  • by writing about the software

Submit Great Issues

  • Before submitting a new issue, check to make sure a similar issue isn’t already open. If one is, contribute to that issue thread with your feedback.
  • When submitting a bug report, please try to provide as much detail as possible, i.e. a screenshot or gist that demonstrates the problem, the technology you are using, and any relevant links.