Quarterly Report

What did you make?

See below.

Continuing public impression

“The Right to Read is The Right To Mine” continues to capture people’s imagination.


Website is launched! Still actively being edited.


Richard Smith-Unna (RSU), Mark MacGillivray (MMcG) and PMR have all added large enhancements to the architecture. Getpapers (RSU) is an exciting new approach. The tools have been deployed at several workshops and become easier to use each time.

Quickscrape and getpapers

getpapers came out of the EBI workshop and allows search and retrieval to be combined into a single action. Effectively the user simply types getpapers [name of journal or repository] [search query] and the system sends the query and retrieves the papers automatically. These can then be passed down the pipeline to NORMA, AMI and CAT. We have added scrapers for arXiv, IEE and Nature. Quickscrape and getpapers are accessible on GitHub.


NORMA takes raw documents and turns them into structured, validated, conformant “scholarly HTML” our own novel implementation of HTML5 with RDFA or other tagging. Raw types such as PDF are converted into XML. The result of NORMA is a document which is immediately ready for mining, and also could be re-used in many other ways (e.g. transforming, extraction, snippets, styling, etc.) If all scientific documents were in NORMAlized form it would be a significant global change in scholarship.

PMR has written normalizers for IEE, arXiV, Nature to add to the Open Access journals. We will soon be able to offer Normalized science for much of the literature. PMR is revitalizing ScholarlyHTML with the original authors.


AMI is a complex  toolbox for document conversion, structuring, semantification, indexing and extraction. It’s designed for any STM document but has had most recent work on Chemistry and Phylogenetics. AMI is now wholly refactored and uses a plugin architecture (rather than the old visitor one). This is much simpler and we believe that communities should be able to write plugins with relative ease. An important feature. AMI has now had a lot more development in the following areas.


We have added the functionality of mining chemical names (OSCAR) and chemical processes (ChemicalTagger). We are integrating the rest of chemistry starting today.

Phylogenetic Trees

The phylo search and analysis has been robustified and applied to In J. Syst. Evol MicroBiol for 5000 papers. We have added Tesseract OCR to the toolkit, with a massive increase in power to analyze diagrams.

Word frequencies

PMR has implemented the classic document classification methods (BagOfWords, TermFrequencies) and will soon add InverseDocumentFrequency. This allows people to immediately analyze visually the concepts in a paper. BoW has now been deployed at several workshops and is our immediate lead into text-mining for newcomers.

Regular Expressions

This is one of our most powerful approaches as it’s easy for humans to edit. The Neuroscience group in Edinburgh were able to create 200 regular expressions in half an hour. This shows the appeal and power of the method.

Volunteer contributions

We ran a hack at Mozsprint and were delighted that 3 participants were able to understand the architecture and craft a new application (for latexml) which is an important addition for reading arXiv.

Workshop Components

As our workshops vary greatly in length, we have developed a component-like approach where different blocks can be fitted together. We have engaged Stefan Kasberger and Chris Kittle (Graz, AT) to develop this approach.


2015-03-16 Cochrane Collaboration Oxford UK. Cochrane does systematic reviews of clinical trials and is massively influential. This is likely to be an active ongoing collaboration and could be the mainstay of CM.

2015-03-30/31 EBI March  – deployed our tools and effectively designed getpapers as parts of our toolkit. This is also likely to be ongoing and a mainstay of our approach. (Nick Stenning from hypothes.is joined us)

2015-04-13/14. Wellcome Trust. Effectively the public launch of ContentMine. Very good adoption of the tools and approach by the Early Career Researchers. Very good attendance of the great-and-good (Research councils, policy makers, etc.)

2015-05-29/30 – ContentMine Gathering. Team Catchup on many aspects of the project. Welcomed Stefan Kasberger and Chris Kittel from Austria. Planned architecture, use cases, markets, engagement, animalgarden photo comics.

Scholarly publications

Jenny and Charles have a paper on responsible content mining


See here for about 20  slide decks. Most slidesets have several hundred views

Where did you go?

Other than hack days…

* 2015-03-19 PMR Data Innovation (a collection of government and related orgs). Well received.
* 2015-04-09 Jenny gave a presentation at WHO Geneva, hoping for engagement.
* 2015-04-20 PMR gave invited presentation a University of Lille.
* 2015-05-14 PMR gave invited webinar to BioCADDIE.

What have you been talking about?

* Content Mining. Most of my talks or totally or partially about TheContentMine. Major theme is promoting TheRightToReadisTheRightToMine especially in a European context. The theme was covered and read widely recently thanks to this article in The Guardian.
* Clinical Trials and ContentMining

Who is joining your thinking?

* LIBER (European Research Library Association – we now have the H2020 grant ACTIVATED.
* European Bioinformatics Institute, continuing hacks plans. Also talking about integrating chemistry.
* Wikimedia and Wikidata ongoing.
* Amy Price (Oxford) and others in the field including Cochrane Institute, Ben Goldacre and others.
* Wellcome Trust and other funders.
* Edinburgh Neuroscience (Systematic reviews of animal testing). Hope to explore grants with them.
* Musti! we’ve done a hack on his literature review, engaged with colleague Pamela at UCL and are planning a hack there.

What was your biggest win over the past 3 months – and what did you learn from it?

* That our architecture and tools and workshop strategy works. People are taking to it very easily so we are very confident of scaling in those areas. We also got an extension for Ross Mounce’s project at Bath and have made exciting progress in analyzing pixel maps of phylogenetic trees. We plan a workshop later this year.

What was your biggest loss over the past 3 month – and what did you learn from it?

* probably sleep (again)

What is the shape of your team?

Core Team

* PMR. Leader/fellow. I spend every day thinking about theCM, and most of them hacking documents or code or events. In peripheral activities (UKPMC, various chemistry projects) it makes sense to promote myself as a SF and I am very proud to do so. I am on editorial and advisory boards.

* Jenny Molloy (ca 1 day/week, “Manager”) who is just finishing her thesis and intends to stay in Cambridge for 3-5 years. Universally regarded as absolutely wonderful.

* Ross Mounce (RM). Zero cost. Full time PostDoctoral Researcher at Bath Univ on PLUTo project, extracting data from bio-science literature using PMR and TheCM tools. Hightly visible Open advocate and revolutionary thinking in the phylogenetic community with frequent meetings. Ross’s output is funded elsewhere but being Open is all accessible to TheCM. Ross visits PMR for a day about every 2 weeks.

* Richard Smith-Unna. Richard has been paid as a contractor but will also be highly valuable as a core member and I shall invite him next time we talk. He is technically very competent both at computation and as a scientist. He is also very pragmatic.

* Steph Unna. Joined to look after the documentation and website (a Herculean task already).

* Graham Steel. A tireless advocate for Open Science who is super excited to be part of ContentMine and throwing energy into it. Graham will manage engagement with collaborators, recording of events, tweeting, etc.

* Stefan Kasberger. Web design and workshop components.

* Chris Kittel. Web design and workshop components.


CottageLabs (CL), contractor. A cooperative of developers in the higher-ed informatics/library space. PMR has worked with MMcG and RJ for several years and been on joint JISC projects (e.g. Open Bibliography). Very highly regarded in JISC/HEFCE/Open Publishing (e.g. PLOS).
* Mark MacGillivray (MMcG). Final year part-time PhD at Edinburgh Informatics.
* Richard Jones (RJ) . Ex Symplectics.
* Emanuel Toliv.
* CL has provided a server, built the website and designed, and implemented, the overall workflow architecture. They work closely with RSU on automating a deliverable service/process.


Advisory Board up and functioning in a positive regular manner:
* Professor Charles Oppenheim
* Laura James
* Heather Joseph (SPARC)
* John Wilbanks (SAGE)
* Joe McArthur (OA Button)
* Puneet Kishoor


We have sub-projects which we are growing…
* Clinical Trials. Run by Amy Price and Graham Steel. Amy is a mature student between Oxford and Florida. She is enthusiastic and well known in the Trials field but lacks technical knowledge in informatics – so a great example of our core market.
* Crystallography. The Crystallographic Open Database is a radical movement to liberate crystallographic data and well set up technically. We hope to do a daily trawl of crystal structures, including behind pay-walls. Hack day in 2 weeks time
* Neuroscience/animal testing at Edinburgh.

What are the metrics you are tracking yourself against – and where are you?

We set ourselves real-world events (on fixed days) where we have to deliver:
* workshops -This has worked very well. We have manged (with a lot of effort) to run workshops which people have liked and which gradually become commoditised. We have a wide range of components and can address different communities.
* talks. Talks mainly from PMR, but also RSU and CO. PMR’s talks are very well received and have generated further engagement.
* demonstrations. Talks and workshops have live demonstrations and this is a very challenging way of setting deadlines and metrics. The team are now used to this and we meet deadlines.
* service. We have sort of soft-launched. We need to integrate the ingestion on a daily basis. This continues to “slip”, it’s always “next week” but I think the Clinical Trials imperative will drive it.
* invites – about 2 for the next 3 months, which will do fine!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s