Latest leak on EU copyright reforms including TDM and its scope – commercial included.

Yesterday evening, it released the following tweet:-

The document itself (PDF) is 182 pages in length and the section relating to text and data mining (TDM) can be found in pages 93 – 108.


4.3.1. What is the problem and why is it a problem?
Problem: Researchers are faced with legal uncertainty with regard to whether and under which conditions they can carry out TDM on content they have lawful access to.
Description of the problem: Text and Data Mining (TDM) is a term commonly used to
describe the automated processing (“machine reading”) of large volumes of text and data to uncover new knowledge or insights. TDM can be a powerful scientific research tool to analyse big corpuses of text and data such as scientific publications or research datasets.


It has been calculated that the overall amount of scientific papers published worldwide may be increasing by 8 to 9% every year and doubling every 9 years. In some instances,more than 90% of research libraries’ collections in the EU are composed of digital content. This trend is bound to continue; however, without intervention at EU level, the legal uncertainty and fragmentation surrounding the use of TDM, notably by research organisations, will persist. Market developments, in particular the fact that publishers may increasingly include TDM in subscription licences and develop model clauses and practical tools (such as the Cross-Ref text and data mining service), including as a result of the commitments taken in the 2013 Licences for Europe process to facilitate it may partly mitigate the problem. However, fragmentation of the Single Market is likely to increase over time as a result of MS adopting TDM exceptions at national level which could be based on different conditions.

Four options are suggested on TDM reform.

Option 1 – Fostering industry self-regulation initiatives without changes to the EU legal framework.

Option 2 – Mandatory exception covering text and data mining for non-commercial scientific research purposes.

Option 3 – Mandatory exception applicable to public interest research organisations covering text and data mining for the purposes of both non-commercial and commercial scientific research.

Option 4 – Mandatory exception applicable to anybody who has lawful access (including both public interest research organisations and businesses) covering text and data mining for any scientific research purposes.

The  recommendation is for option 3 which allows Public Interest Research Organisations (Universities and research institutes) to mine for Non-Commercial AND Commercial purposes. This appears mainly to support industrially funded research.  On the whole it seems to be slight progress.

Option 3 is the preferred option. This option would create a high level of legal certainty and reduce transaction costs for researchers with a limited impact on right holders’ licensing market and limited compliance costs. In comparison, Option 1 would be significantly less effective and Option 2 would not achieve sufficient legal certainty for researchers, in particular as regards partnerships with private operators (PPPs).Option 3 allows reaching the policy objectives in a more proportionate manner than Option 4, which would entail significant foregone costs for rightholders, notably as regards licences with corporate researchers. In particular, Option 3 would intervene where there is a specific evidence of a problem (legal uncertainty for public interest organisations) without affecting the purely commercial market for TDM where intervention does not seem to be justified. In all, Option 3 has the best costs-benefits trade off as it would bring higher benefits (including in terms of reducing transaction costs) to researchers without additional foregone costs for rightholders as compared to Option 2 (Option 3 would have similar impacts on right holders but through a different legal technique i.e. scope of the exception defined through the identification of specific categories of beneficiaries rather than through the “non-commercial” purpose condition). The preferred option is also coherent with the EU open access policy and would achieve a good balance between copyright as a property right and the freedom of art and science.

Text and Data Mining (i.e. Content Mining) for Research and Innovation – What Europe Must Do Next

LisbonCouncil Logo_FIN

The Right to Read is the Right to Mine

This post is based upon an important report released 08h00 CET 31 MAY 2016

The Lisbon Council launches Text and Data Mining for Research and Innovation: What Europe Must Do Next, an interactive policy brief which looks at the challenge and opportunity of text and data mining in a European context. Building on the Lisbon Council’s highly successful 2014 paper, which served as an important and early source of evidence on the uptake and interest in text and mining among academics worldwide, the paper revisits the data two years later and finds that recent trends have only accelerated. Concretely, Asian and U.S. scholars continue to show a huge interest in text and data mining as measured by academic research on the topic. And Europe’s position is falling relative to the rest of the world. The paper looks at the legal complexity and red tape facing European scholars in the area, and call for wholesale reform. The paper was prepared for and formally submitted as part of the European Commission’s Public Consultation on the Role of Publishers in the Copyright Value Chain and on the ‘Panorama Exception.’  Source.
 In an associated Press Release:-

 Text and Data Mining for Research and Innovation looks at the transformative role of text and data mining in academic research, benchmarks the role of European research against global competitors and reflects on the prospects for an enabling policy in the text and data mining field within the broader European political and economic context.

Among the key findings:

  • Asia leapfrogs EU in research on text and data mining. Over the last decade, Asia has replaced the European Union as the world’s leading centre for academic research on text and data mining as judged by number of publications. From 2011 to 2016, Asian scholars’ share of academic publications in the field rose to 32.4% of all global publications, up from 31.1% in 2000. The EU’s global share fell to 28.2%, down from 38.9% in 2000. North America remained in third place at 20.9% due to the relatively small size of the three-country region.
  • China ranks No.1 within Asia. As recently as 2000, Japan and Taiwan led Asia with 12.6% and 7% of all global text-and-data-mining-based publications. After a steady rise in interest, China now leads. On its own, it accounted for 11.7% of all global publications in 2015, up from zero in 2000. This gave China a No. 2 finish in the country rankings, second only to the United States. China’s ranking within Asia is now No. 1.
  • Chinese patents on data mining see unprecedented growth. China also led the global growth in the number of patents pertaining to data mining. While the number of patents granted by the U.S. Patent and Trademark Office (USPTO) remained relatively stable over the past decade, the number of patents granted for data-mining-related products by the State Intellectual Property Office of the People’s Republic of China (SIPO) rose to 149 in 2015, up from just one in 2005.
  • Chinese researchers are champions in patenting TDM procedures. Chinese researchers and organisations are patenting text-and- data-mining procedures at a faster rate than any other country in the world. This suggests that Chinese researchers attach a growing priority to the potential use of this new technique for stimulating scientific breakthroughs, disseminating technical knowledge and improving productivity throughout the scientific and technical community.
  • Middle East entering the game too. Some of the fastest growth and greatest interest was seen in relative newcomers: India, Iran and Turkey. Having shown virtually no interest in text and data mining as recently as 2000, the Middle East is now the world’s fourth largest region for research on text and data mining, led by Iran and Turkey.
  • Europe remains slow. Large European scientific, technical and medical publishers have added text-and-data-mining functionality to some dataset licences, but the overall framework in Europe remains slow and full of uncertainty. Many smaller publishers do not yet offer access of this type. And scholars complain that existing licences are too restrictive and do not allow for generating the advanced “big data” insights that come from detecting patterns across multiple datasets stored in different places or held by different owners.
  • Legal clarity also matters. Some countries apply the “fair-use” doctrine, which allows “exceptions” to existing copyright law, including for text and data mining. Israel, the Republic of Korea, Singapore, Taiwan and the United States are in this group. Others have created a new copyright “exception” for text and data mining – Japan, for instance, which adopted a blanket text-and-data-mining exception in 2009, and more recently the United Kingdom, where text and data mining was declared fully legal for non-commercial research purposes in 2014.
  • What Europe Must Do. New technologies make analysis of large volumes of text and other media potentially routine. But this can only happen if researchers have clearly established rights to use the relevant techniques, supported by the necessary skills and experience. Broadly speaking, the European ecosystem for engaging in text and data mining remains highly problematic, with researchers hesitant to perform valuable analysis that may or may not be legal. The end result: Europe is being leapfrogged by rising interest in other regions, notably Asia. European scholars are even forced, on occasion, to outsource their text and data mining needs to researchers elsewhere in the world, as has been reported repeatedly in past European Commission consultations. Anecdotally, we hear stories of university and research bureaux deliberately adding researchers in North America or Asia to consortia because those researchers will be able to do basic text and data mining so much more easily than in the EU.


OpenForum Europe publishes High Level Policy Paper on text and data mining

On 4 May 2015, OpenForum Europe (OFE) published an extremely significant policy paper on text and data mining.

OFE paper

From the OFE website:-

For the past months, OFE has been involved in an intensive research process regarding the various arguments and approaches relating to text and data mining (TDM) in Europe, which culminated with the paper published today, titled “An analytical review of text and data mining practices and approaches in Europe”.

The Commission should aim to achieve coherence in the legal provisions which it seeks to apply to TDM, with no consideration of ‘commercial’ versus ‘non-commercial’ purposes. Europe needs a regime which enables any researcher, citizen, company or other entity to engage in TDM activities, using material to which they have lawful access. The exact commercial rewards can be managed at subsequent stages, depending on the implementation of the mining outcome. The protection could be considered at the point at which some clearly commercially beneficial project, product, service, business or company has emerged.

From the report itself, mention is made that Peter Murray-Rust contributed to it:-

This paper is based on extensive desk research, including most of the benchmark reports, such as the European Commission funded Expert Group Report (2014), the study by De Wolf and Partners (2014), the UK IPO’s ‘Exceptions to Copyright’ brief (October 2014), as well as numerous other reports, position papers, articles and blog posts1.The initial findings have been discussed at the Round Table that OFE organised in October 2015, the conclusions of which are available in the follow-up White Paper. The desk research and Round Table discussion have been complemented by a series of interviews with academics, researchers, start-ups, and more established companies (including publishers and infrastructure providers)2.

A comprehensive list can be provided upon request.
The interviews were conducted between September 2015 and February 2016, with the following experts (in alphabetical order): Geoffrey Bilder (CrossRef), Vivian Chan (Sparrho), Elizabeth Crossick (RELX), Lucie Guibault (IViR), Prof. Ian Hargreaves, Rachael Lammey (CrossRef), Thomas Margoni (Openminted), Peter Murray-Rust (Content Mine), Cameron Neylon (Public Library of Science), Julia Reda (MEP), Tim Stok (RELX), Kalliopi Spyridaki (SAS).
From the conclusions of the paper:-
Even if TDM is to be allowed through a generalised exception, APIs will still be needed to do the actual mining. Trusted third party platforms which make APIs available should be encouraged. Having a trusted third party in the mining process could provide a middle ground where publishers feel more confident that their content is not about to be misappropriated, and where miners feel they can engage in TDM without their project being put at risk of plagiarism or other sharp practice.
Bringing all stakeholders around a table would appear to be the most advisable solution, not least because there remains a degree of mistrust between some publishers and some researchers. Sometimes the presence of diverging interests can motivate such tension, but in other cases there can indeed be factors or aspects to which one category of stakeholder rightfully points, but which are not always foreseeable or even obvious for other categories of stakeholder.
In order to be sustainable and to avoid the need for future legislative updates, the provision should be drafted in neutral terms, sufficient to withstand the passage of time and likely evolution of the associated technology.
France pushing for UK-like exception for TDM with possible library role

Details emerged on Twitter yesterday afternoon about copyright developments taking place in France.




Further details emerged overnight. The current situation is ongoing, but it would appear that progress might be being made.


Will France get its text and data mining exception?


Quoting in part from the above blog post by


The French parliament is currently examining a significant law on « The Digital Republic ». It covers a wide array of topics: open data, open access, Freedom of panorama, data portability, digital platform regulation… Contrary to the initial plan of the French governement, several MPs have put forward another item to this list: a text and data mining exception. While approved by the French lower house, the final fate of the exception is still uncertain at this stage.

The current intellectual property regulation in Europe actually restrains an extraordinary scientific opportunity: being able to grasp the main features, ideas and entities of millions of texts. Both massive digitization of corpora and sophistication of mining and clustering tool allows to deploy “distant reading” on an unprecedent scale. The Text2genome project can therefore draw an immense state of the art from genomic literature; ContentMine ambitions to liberate 100 millions facts from academic publications. I’m both an advocate and a pratician of text and data mining. My main focus concerns 19th century generalist and scientific periodicals — which are fortunately in the public domain, but I would be very happy to extend beyond my scope the beginning of the XXth century.

The French exception is roughly similar than its British counterpart. They are both limited to scientific research ; they both aim to link the right to read and the right to mine, by giving a license to extract to researchers who have a lawful access to a resource. There is nevertheless an additional restriction regarding the preservation of the output. So far the British exception does not seem to set any temporal limitation to the keeping of the “local” databases”.

A first draft of the French exception actually envisioned to have the output fully deleted once the research was over. This would have proved quite unpractical. As every scientific work, text and data mining project are perfectible: the output is so large that it can be of use for many unexcepted outlooks. Fortunately, the current amendment come to a better term: the output would be kept by a “certified” organisations (the French National Library is very likely to be one of them), and could therefore be reused.


The above post by Pierre-Carl Langlais has been updated. In part:-


Yet, the fight was not unequal. An extraordinary procedure gave scientific communities the ability to speak up in favor of better terms for the open access law and in favor of the reintroduction of mining exception: apparently for the first time in France, a law was examined, evaluated and completed on an online platform. The “République Numérique” website hosted several thousand contributions from virtually everyone: lobbies, companies, public institutions, communities and simple citizens… Several of them argued in favor of a mining exceptions, and they were extremely well received.


Mining proposal of the Couperin Consortium on République Numérique

This open consultation has given a lot of additional momentum to innovative measures that would have been otherwise upheld by powerful lobbies, or simply not conceived in the first place. MPs seem to have been particularly sensitive to original perspectives. Full Freedom of Panorama (that includes commercial use) was actively supported by one fifth of the lower house (more than one hundred MPs actually signed at least one favorable amendement). While MPs were more open-minded, scientific communities proved bolder : a few days ago, the President of Universities Collective (CPU) and the National Centre for Scientific Research (CNRS) signed a strong committment toward an open access law and a mining exception.

Rigthholders lobbies are still far from powerless. They were able to have a significant innovation partly removed from the law: a positive recognition of “informationnal commons” (a wider notion that cover both the public domain of “works”, the public domain of “informations” and voluntary Free knowledge such as Wikipedia). Nevertheless, they have never been in such a weak position in the recent past.

All in all, the game remains fully open. The current state of the “open access” law reflect this tensed debate between coalitions of actors of gradually equal strength and influence: while scientific institutions, public libraries and open knowledge communities were able to append data to the focus of the law (something their german counterpart has not succeeded in) the publishers have, no less successfully, vetoed the inclusion of collected conference contributions. The suspense should still last till the end of the legislative procedure (probably no more than a month).





2016: A bold new future for research?

This post today on the Wellcome Trust blog includes work by ContentMine on how researchers could benefit from copyright reform for TDM


Text and data mining also offers the opportunity to both speed up research and lower the overall cost. Preliminary analyses by ContentMine suggest that using these tools could help streamline systematic reviews – the gold standard of evidence-based medicine – be reducing the time taken to filter the thousands of papers required, reducing researchers’ workloads by at least 50%.

However, copyright law presents a barrier to using data mining tools, because computers have to make a copy of the material in order to mine it. As the vast majority of published research is protected by copyright, this puts data mining in conflict with copyright law. While a researcher may have access to material via licenses held by their institution, and is legally entitled to read it, they are not entitled to mine it. But we believe there is no reason why the right to read should not also be the right to mine.

European reform of copyright for Content Mining

Today the European Commission announced their action plan for copyright reform within the Digital Single Market strategy. There’s been a lot of whooping by the reforming twitterati, but I am less overjoyed. I’ll give it 1 cheer out of 3. Chris Hartgerink – a content-miner and statistician – has blogged a good overview here. There are also blogs from League of European Research Universities (LERU) and Copyright for Creativity. (this last one has a lot of useful yes/maybe/no evaluation ).

I and others like Chris want to mine the information, which is only available in rich universities who subscribe to expensive journals. In the Netherlands Chris tried to do this but the publisher Elsevier told the University to tell Chris to stop doing his research and the University told him to stop. This is, presumably,  because  his university has signed a restrictive contract with Elsevier giving up their right to do it.  So we have the blatant injustice that a publisher, to which the University pays huge amounts of money, effectively controls mining-based research

In the UK we don’t have this problem because in 2014 the government amended the law (“Hargreaves exception”) to allow mining for non-commercial research purposes. And added a clause that this right could not be overriden by contractual restrictions. So I can mine the literature and I and colleagues are tooling up to do so.

Tthis exception has been seen as focus for European reform. Julia Reda, MEP, who drafted an excellent proposal for the Parliament earlier this year, wanted to go further and allow commercial mining (which makes sense if you want to grow an information-based scientific industrial base). But there has been huge – really huge – opposition from the “content owners” and every new draft gets watered down.

Today we have the EC’s response. It’s draft – so there will be more changes. There’s still a lot of “we should think about X” rather than “we are going to do X”, but it’s more positive than I expected.

  • On the plus side it argues for all types of activity, and uses “mandatory exception” which I think means that we can ignore contractual clauses that override it. It also argues for “legal certainty” which we absolutely need – I have no guarantee that what I am doing is absolutely legal (and no one can tell me)
  • On the minus side it restricts the use to “public interest research organisations”. No one  knows what these are, and that’s a terrible place to be. I think it means universities and national laboratories. It might include medical charities like CRUK – who knows. But even with a broad sweep it’s likely only to be 500,000 people, whereas we have > 20 million adults. So this would only apply to 2% of the population – the other 98% are disenfranchised. That’s unacceptable in any code. Pragmatically it could mean that these institutions had to manage and regulate the mining. I would have to apply to Cambridge – the process might be arcane, and might be seriously influenced by content-owners. Universities are ultra-risk averse and the simplest thing to do in case of doubt is to forbid it.

And there’s a non-legal aspect. I have yet to meet anyone else in UK who is using UK Hargreaves legislation to legitimate their mining. (There may be people doing it without telling the University but this isn’t because of the legislation  – it’s possible they don’t even know about it). And I’ve not heard of any University who has set up training courses. Meanwhile publishers are making it as hard as possible to mine by adding CAPTCHAs and other traps. So some questions for readers:

  • do you know of any UK university which has facilitated mining in the last year? (WITHOUT signing additional contracts)
  • do you know of anyone in UK who has been banned from research using mining?

The point is that even changing the EC law – and that won’t happen fast – still needs the Universities to help their researchers rather than acting as police for the publishers.