Content Mining will be legal in UK; I inform Cambridge Library and the world of my plans

Early last week the UK House of Lords passed the final stages of a Statutory Instrument with exceptions to copyright. For me that most important was that those with legitimate access to electronic content can now use mining technology to extract data without permission from the owners. The actual legislation took less than a minute, but the process has been desperately fought by the traditional publishers who have attempted to require subscribers to get permission from them.


That means that I, who have legitimate access to the content of Cambridge University Library and their electronic subscriptions, can now use machines to read any or all of this without breaking copyright law. Moreover the publishers cannot override this with additional restrictive clauses in their contracts.

The new law restricts the use to “non-commercial” but this will no affect what I intend to do. To avoid any confusion I am publicly setting out my intentions; because I shall be using subscription content I am advising Cambridge University Library. I am not asking anyone’s permission because I don’t have to.

Yesterday I wrote to Yvonne Nobis, Head of Science Information in CUL.

I am informing you of my content mining research using subscription content in CUL. Please forward this to anyone else in CUL who may need to know. Also if there is any time this week I would be very happy to meet (or failing that Skype) – even for a short time.
As you know the UK government has passed a Statutory Instrument based on the Hargreaves review of copyright exempting certain activities from copyright, especially “data analytics” which covers content mining for facts. This comes into force on 2014-06-01.
I intend to use this to start non-commercial research and to publish the results in an OpenNotebookScience (  philosophy (i.e. publicly and immediately on the web as the work is done, not retrospectively). This involves both personal research in several scientific fields and also collaborations in 3-4 funded projects:
  •  PLUTo (BBSRC, Univ Bath) – Ross Mounce
  • Metabolism mining (Andy Howlett, Unilever funded PhD and also with Christoph Steinbeck EBI, Hinxton, UK)
  • Chemical mining (TSB grant) Mark Williamson.
We are also collaborators in the final application stage for an NSF grant collaboration for chemical biodiversity in Lamiacae (mints, etc.). This is very exciting and mining may throw light on chemicals as signals of climate change.
I intend to mine responsibly and within UK law. I expect to mine about 1000-2000 papers per day – many will be subscription-based through CUL. I have access to these as I have an Emeritus position but as I am not paid by CU then this cannot be construed as commercial activity. Typically my software will ingest a paper, mine it for facts, and discard the paper – the process takes a few seconds.
As a responsible scientist I am required by scientific ethics and reproducibility/verifiability to make my results Open and this includes the following Facts:
  • bibliographic metadata of the article (but not the abstract)
  • citations (bibliographic references) within the article
  • factual lists of tables , figures and supplemental data.
  • sources of funding (to evaluate the motivations of researchers
  • licences
  • scientific facts (below)
I shall not reproduce the whole content but shall reproduce necessary textual metadata without which the facts cannot be verified. These include:
  • figure and table captions (i.e. metadata)
  • experimental methodology (e.g. procedures carried out)
I shall not reproduce tables and figures. However my software is capable, for many papers, of interpreting tables and diagrams and extracting Factual information (e.g. in CSV files). [My output will be more flexible and re-sable than traditional pixel-based graphs.]
I expect to extract and interpret the following types of Facts:
  • biological species
  • place names and geo-locations (e.g. lat/long)
  • protein and nucleic acid sequences
  • chemical names and structure diagrams
  • phylogenetic (e.g. evolutionary) trees
  • scatterplots, bar graphs, pie charts, etc.
 and several others as the technology progresses.
The load on publishers’ servers is negligible (this has been analysed by Cameron Neylon of PLoS).
I stress the the output is qualitatively no different from centuries of extraction from the literature – it is the automation of the procedure. Facts are not copyrightable and nor will my output be.
I shall publish the results on my personal open web pages, repositories such as Github and offer them to EuropePMC for incorporation if they wish . Everything I publish will be licensed under CC 0 (effectively public domain). I would also like to explore exposing the results through the CUL. I have already pioneered dspace@cam for large volumes of facts, but found that the search and indexing wasn’t appropriate at the time. If you have suggestions as to how the UL might help it could be a valuable example for other scholars.
I am not expecting any push-back or take-downs from publishers as this activity is now wholly legal.  The Statutory Instrument overrides any restrictive clauses from suppliers, including robots.txt. I therefore do not need or intend to ask anyone for permission. This will be a very public process – I have nothing to hide. However I wish to behave responsibly, the most likely problem being load on publishers’ servers. Richard S-U (Plant Sciences, Cambridge, copied) and I are developing crawling and scraping protocols which are publisher-friendly (e.g. delays and retries) – we  have also discussed this with PLoS (Cameron).
In the unlikely event of any problems from publishers I expect that CUL, as licensee/renter of content, would be the first point of contact. I will be happy to be available if CUL needs me. If publishers contact me directly I shall immediately refer them to CUL as CUL is the licensee.
I have written this in the first person (“I”) since the legislation emphasises personal use and because organised consortia may be seen as “commercial”. The law is for the UK. Fortunately the mining is wholly compatible:
  • I am a UK citizen from Birth
  • I live in the UK
  • I have a pension from the UK government (non-commercial activity)
  • My affiliation is with a UK university
  • The projects I outline are funded by UK organisations.
  • My collaborators are all UK.

I play a public domain version of “Rule Britannia!” incessantly and have a  Union Jack teddy bear. I shall however, vote for Britain to continue as a member of the EU and also urge my representatives (MEPs) to continue to press for similar legislation in  Europe. I personally thank Julian Huppert and David Willetts for their energy and consistency in pushing for this reform, which highlights the potential value of parliaments in a democracy.

I also thank my collaborators in the ContentMine ( where I shall be demonstrating and discussing our technology, which is the best that I know of outside companies like G**gle. As an academic I welcome offers of collaboration, but stress that we cannot run a mining service for you (though we can show you how to run our toolkit).  If the projects are interesting enough to excite me as a scientist I may be very happy to work with you as a co-investigator, though I cannot be paid for mining services.

Sadly, very few publishers come out of this with anything positive. Naturally the Open Access publishers (PLOS, BMC, eLife, MDPI, PeerJ, Ubiquity and others) have no problems as they can be and want to be mined. We have already had long discussions with them. The Royal Society (sic, not the RSC) has positively said that their content can be mined. All the rest, and especially the larger ones, have actively lobbied and FUDded to stop content mining. When you know that organisations are spending millions of dollars to stop you doing science it can be depressing, but we’ve had the faith to continue. I’m particularly proud of Jenny Molloy, Ross Mounce and others for their public energy in maintaining

“The Right To Read is the Right To Mine”

Now that the political battle (which has taken up 5 years of my life) is largely over, I’m devoting my energies to getting the ContentMine as a universal resource and building new next generation of intelligent scientific software.

And you can be an equal part of it, if you wish.


TheContentMine: Progress and our Philosophy

TheContentMine is a project to extract all facts from the scientific literature. It has now been going for about 6 weeks – this is a soft-launch. We continue to develop it and record our progress publicly. It’s a community project and we are starting to get offers of help right now.  We welcome these but we shan’t be able to get everything going immediately.

We want people to know what they are committing to and what they can expect in return. So yesterday I drafted an initial Philosophy – we welcome comments.

Our philosophy is to create an Open resource for everyone created by everyone. Ownership and control of knowledge by unaccountable organisations is a major current threat; our strategy is to liberate and protect content.

The Content Mine is a community and we want you to know that your contribution will remain Open. We will build safeguards into The Content Mine to protect against acquisition.

We are a meritocracy. We are inspired by Open communities such as the Open Knowledge Foundation, MozillaWikipedia and OpenStreetMap all of whom have huge communities who have developed a trustable governance model.

We are going ahead on several fronts – “breadth-first”, although some areas have considerable depth. Just like Wikipedia or OSM you’ll come across stubs and broken links – it’s the sign of an Open growing organisation.

There’s so much to do, so we are meeting today to draft maps, guidelines, architecture. We’re gathering the community tools – wikis, mail lists, blogs, Github, etc. As the community grows we can scale in several directions:

  • primary source. Contributors can choose particular journals or institutions/theses to mine from.
  • subject/discipline. You may be interested in Chemistry or Phylogenetic Trees, Sequences or Species.
  • technology. Concentrate on OCR, Natural Language Processing, CrawlingSyntax or develop your own extraction techniques
  • advocacy and publicity. A major aim is to influence scientists and policy makers to make content Open
  • community – its growth and practice.

We are developing a number of subprojects which will demonstrate our technology and how the site will work. Hope to report more tomorrow.

Shuttleworth Fellowship: Month 2; synergy with the Digital Enlightenment can change the world

I’m now finishing the second month of my Shuttleworth Fellowship – the most important thing in my whole career. My project The Content Mine aims to liberate all the facts in the scientific literature.

That’s incredibly ambitious and I don’t know in detail how it’s going to happen – but I am confident it will.

This week we posted our website – and showed how we create content. What’s modern is that this is a community website – we’re inspired by Wikipedia and OpenStreetmap where volunteers can find their own area of interest and contribute. Since there is no other Open resource for content-mining we shall provide that – we have 100 pages and intend to go beyond 1000. Obviously you can help with that. And of course Wikipedia’s information is invaluable.

We have an incredible team:

  • Michelle Brook .  Michelle is Manager and making a massive impression with her work on Open Access.
  • Jenny Molloy. Jenny has co-authored the foundations of Open Content Mining and ran the first workshop last year.
  • Ross Mounce. Ross has championed Open Content Mining in Brussels and is developing software for mining phylogenetics.
  • Mark MacGillivray. Co-authored Open Bibliography and founded CottageLabs who are supporting our web presence and IT infrastructure.
  • Richard Smith-Unna. Founder of the volunteer scientist-developer community to which he is pitching ContentMine to support Crawling.

But we have also masses of informal links and collaborations. Because we are Open, people want to find out what we are doing and offer help. It’s possible that much of our requirements for crawling may be provided by the community – and that’s happening over the last week. We’ve had an important contribution to our approach to Optical Character Recognition. Today I was skyped with suggestions about Chemistry in the ContentMine.

This all happens because of the Digital Enlightenment. People round the world are seeing the possibilities of zero-cost software, efficient voluntary Open communities and the value of liberated Knowledge. There’s many projects wanting to liberate bibliography, reform authoring, re-use bioscience, etc. Occasionally we wake up and think “wow! problem solved!”. If you think “we”, not “me”, the world changes.

The Fellows and Foundation are fantastic. I have an hour Skype every week with Karien, and another hour with the whole Fellowship. These are incredibly valuable.  With such a huge ambition we need focus.

There’s huge synergy with several formal and many informal projects. Once you decide that your software and output is Open, you can move several times faster. No tedious agreements to sign. No worries about secrecy, so no delays in making knowledge open.  Of the formal projects :

  • Andy Howlett is doing the 3rd year of his PhD in the Unilever Centre here on metabolism. He can use the 10 years’ worth of Open Source we have developed and because his contributions are also Open we’ll benefit in return.
  • Mark Williamson is  using our software in similar fashion.
  • Ross Mounce and Matt Wills at Bath are running the PLUTo project. Because it’s completely Open they can use our software and we can re-use their results.
  • we are starting work with Chris Steinbeck at EBI on automated extraction of metabolites and phytochemistry from the literature.

Informally we are working with Volker Sorge (Birmingham) and Noureddin Sadawi (Brunel) on scientific computer vision and re-use of information for Blind and Visually Impaired people. With Egon Willighagen and John May on the (Open) Chemistry Development Kit. With the Crystallography Open Database…

How can it possibly work?

In the same way that Steve Coast “single-handedly” and with zero-cash built up OpenStreetmap.

  • promoting the concept. We are already well known in the community and people are watching and starting to participate.
  • by building horizontal scalability.  By dividing the problem into separate journals, we can build per-journal solutions. By identifying independent disciplines (chemistry, species, phylogenetics…) we can develop independently.
  • an Open modular software and information architecture. We build libraries and tools, not applications. So it’s easy to reconfigure. If people want a commandline approach we can offer that.
  • By re-using what’s already Open. We need a chemical database? don’t build it ourselves – work with EBI and Pubchem. An Open bibliography? work with Europe PubMedCentral.
  • by attracting and honouring volunteers. RichardSU has discovered the key point is to offer evening-sized problems. Developers don’t want to tackle a complex infrastructure – they want something where the task is clear and they can complete before they go to bed. And we have to make sure that they are promoted as first-class citizens.

Much of what we do will depend on what happens every week. A month ago I hadn’t planned for; or Longan Java OCR; or Peer Library; or JournalToCs; or BoofCV; or …

… YOU!

PS: You might wonder what a 72-year-old is doing running a complex knowledge project. RichardSU asked that on hacker-news and I’m pleased that others value my response. If Neelie Kroes can change the world at 72, so can I – and so can YOU.

If you are retired you’re exactly the sort of person who can make massive contributions to the Content Mine. And it’s fun.

Jenny Molloy Awarded an AMI

Jenny Molloy is a central figure in the Open community and has been particularly active in campaigning for Content Mining. We are delighted that she is part of our core team on the ContentMine project (  AMI the kangaroo is the mascot of our content-mining software and when wonderful people do wonderful things they are awarded an AMI:


Previous AMI awardees are :

  • Ross Mounce
  • Michelle Brook
  • Helen Turvey (Shuttleworth)
  • Karien Bezuidehout (Shuttleworth – presentation next month)

Jenny runs the Open Knowledge Science Working Group and has co-authored our principles and practice of Open Content Mining.  She advocates Open Data (slides).

The Content Mine website – how we create it. And the community can edit and contribute.

We are now about 6 weeks into The Content Mine project and have now released our website ( In the spirit of living a web-friendly life this is a living object which is planned to be:

  • easy to update and maintain
  • re-usable
  • communal and collaborative.
  • scalable


© Raimond Spekking / CC BY-SA-3.0 (via Wikimedia Commons)

To do that we have taken a novel approach to creating the site. We want the material to be easy to edit and create, with potentially lots of contributors. That’s not always easy if you have to have login access to the website.

The best software is often on collaborative FLOSS software sites. That’s because it’s had hundreds of years of knowledgeable users and developers. So I turned to Github and its wiki.  A wiki is an excellent tool to develop one’s thoughts as the structure evolves as our insight develops. So I started off with a list of the most important things that I thought we would need and put them on the first page of the Wiki ( which looks/ed like:



This is how you see it after an initial edit. It’s very functional, with lots of editing icons, etc. The blue phrases are links to other pages or external pages. I created about 100 pages on Sunday – some are stubs but most have text and links to other pages. And the value is that we are building up a structured resource. It’s a set of pages that can be re-used for tutorials, reference and, we hope, additions by volunteers.

However to make it more like a normal web page Mark MacGillivray and his Cottage Labs colleagues have created software for transferring Github content to a standard website. It can be automated so that, for example, we can update the website from the wiki every midnight. Here’s the same page:



(The picture is RNA from some of Ross Mounce’s Openly extracted phytotaxa scraping.). Mark’s done a great job in almost no time. That’s partly because CL are  very smart and partly because CL build re-usable code. And it’s easy to change the look-and feel.

Most people hate keeping websites up to date,  but I like wikis.  So I’ll be adding more pages which will help to explain content mining, and create re-usable resource.

UK Copyright reforms set to become Law: Content-mining, parody and much more

I have been so busy over the last few days and the world has changed so much that I haven’t managed to blog one of the most significant news – the UK government has tables its final draft on the review of copyright. See .

This is fantastic. It is set to reform scientific knowledge. It means that scientific Facts can be extracted and published without explicit permission. The new law will give us that. I’m going to comment on detail on the content-mining legislation, but a few important general comments:

  • UK is among the world leaders here. I understand Ireland is following, and the EU process will certainly be informed by UK. Let’s make it work so well and so valuably that it will transform the whole world.
  • This draft still has to be ratified before it becomes law on June 1st. It’s very likely to happen but could be derailed by (a) Cameron deciding to go to war (b) the LibDems split from the government (c) freak storms destroy Parliament (d) content-holder lobbyists kill the bill in underhand ways.
  • It’s not just about content-mining. It’s about copying for private re-use (e.g. CD to memory stick), and parody. Reading the list of new exceptions make you realise how restrictive the law has become. Queen Anne in 1710 ( didn’t even  consider format shifting between technologies.  And e-books for disabled people??

So here’s guidance for the main issues in simple language:

and here are the details (I’ll be analysing the “data analytics” in detail in a later post):

And here’s the initial announcement – includes URLs to the IPO and government pages.

From: CopyrightConsultation
Sent: 27 March 2014 15:06
To: CopyrightConsultation
Subject: Exceptions to copyright law – Update following Technical Review

The Government has today laid before Parliament the final draft of the Exceptions to Copyright regulations. This is an important step forward in the Government’s plan to modernise copyright for the digital age. I wanted to take this opportunity to thank you for your response to the technical review and to tell you about the outcome of this process and documents that have been published.

As you will recall, the technical review ran from June to September 2013 and you were invited to review the draft legislation at an early stage and to provide comments on whether it achieved the policy objectives, as set out in Modernising Copyright in December 2012.

We found the technical review to be a particularly valuable process. Over 140 organisations and individuals made submissions and we engaged with a wide range of stakeholders before and after the formal consultation period. The team at the IPO have also worked closely with Government and Parliamentary lawyers to finalise the regulations.

No policy changes have been made, but as a result of this process we have made several alterations to the format and drafting of the legislation. To explain these changes, and the thinking behind them, the Government has published its response to the technical review alongside the regulations. This document sets out the issues that were raised by you and others, the response and highlights where amendments have been made.

It is common practice for related regulations such as these to be brought forward as a single statutory instrument. However, the Government is committed to enabling the greatest possible scrutiny of these changes and the nine regulations have been laid before parliament in five groups.  In deciding how to group the regulations, we have taken account of several factors, including any relevant legal interconnections and common themes. The rationale behind these groupings is set out in the Explanatory Memorandum.

The Government has also produced a set of eight ‘plain English’ guides that explain what the changes mean for different sectors. The guides explain the nature of these changes to copyright law and answer key questions, including many raised during the Government’s consultation process.  The guides cover areas including disability groups, teachers, researchers, librarians, creators, rights-holders and consumers. They also explain what users can and cannot do with copyright material.

The response to the Technical Review and the guidance can be accessed through the IPO’s website: <>.  This also provides links to the final draft regulations, explanatory memorandum and associated documents that appear on<>.

It is now for Parliament to consider the regulations, which will be subject to affirmative resolution in both Houses. If Parliament approves the regulations they will come into force on 1 June 2014.

Thank you again for your contribution.

Yours sincerely,

John Alty