ContentMine Update and FORCE2015; we read and index the daily scholarly literature

We’ve been very busy and I haven’t blogged as much as I’d liked. Here’s an update and news about immediate events.

Firstly to welcome Graham Steel (McDawg) who is joining us as community manager. Graham is a massive figure in the UK and the world in fighting for Open. We’ve known each other for several years. Graham is a tireless, fearless fighter for access to scholarly information. He’s one to the #scolarlypoor (i.e. not employed by a rich university) so he doesn’t have access to the literature. Nonethelesss he fights for justice and access.

Here’s a past blog post 4 years ago where I introduce him and McDawg. He’ll be with us this weekend at FORCE2015, more later.

We have made large advances in the ContentMine technology. I’m really happy with the architecture which Cottagelabs, Richard Smith-Unna and I have been hacking. Essentially we automate the process of reading the daily scientific literature – this is between 1000 and 4000 articles depending on what you count. Each is perhaps 5-20 pages, many with figures. Our tools (quickscrape, Norma, and AMI) carry out the process of

  • scraping (downloading all the components of a paper (XML, HTML, PDF, CSV, DOC, TXT, etc.)
  • Normalising and tagging the papers. We convert PDF and XML to HTML5 , which is essentially Scholarly HTML. We extract the figures and interepret them where possible. We also identify the sections and tag them, so – for example – we can look at just the Materials and Methods section, or just the LIcence.
  • indexing and transformation (AMI). AMI now has several well tested plugins: chemistry, species, sequences, phylogenetic trees, and more generally Regular expressions designed for community creation.

Mark MacGillivray and colleagues have created a lovely faceted search index so it’s possible to ask scientific questions with a facility and precision that we think is completely novel.

We’re doing a workshop on this at FORCE2015 next Sunday (Jan 11) for 3 hours and hacking thereafter. The software is now easily used on or distributable in virtual machines. Everything is Open, so there is no control by third parties. The workshop will start by searching for species, and then move on to building custom searches and browsing. For those who can’t be there, Graham/McDawg is hoping to create a livestream – but no promises.

I’ve spent a wonderful 3 days in Berlin with fellow Shuttleworth fellow Johnny West. Johnny’s OpenOil project – – is about creating Open information about the extractive industries. It turns out that the technology we are using in ContentMine are extremely useful for understanding corporate reports. So I’ve been hacking corporate structure diagrams which are extremely similar to metabolic networks or software flowcharts.

More later, as we have to wrap up the hack….


Published by


Scotland's (main, but not only) #OpenScience #OpenAccess #OpenData #OpenSource #OpenKnowledge & #PatientAdvocate Loves blogging Glasgow, Scotland.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s