Content mining for trend analysis

Let’s suppose you have assembled a large collection of papers (we’ll call that corpus) as a starting point for a literature review. Some of the first questions would be of an exploratory nature, you would like to get an intuition of what’s really in there. “Is there a certain structure, possibly a hidden bias I need to take into account? What is the coverage, are there some ‘holes’ in the data set, perhaps some missing months, or should I include another keyword in the search? How do certain keyword frequencies develop over time, is there a trend appearing?” We can help with getting this initial overview, and speeding up the process to get you working on the questions that really interest you.

Continue reading Content mining for trend analysis

Advertisements

Latest leak on EU copyright reforms including TDM and its scope – commercial included.

Statewatch is a non-profit organisation founded in 1991 that monitors the state and civil liberties in the European Union.

Yesterday evening, it released the following tweet:-

The document itself (PDF) is 182 pages in length and the section relating to text and data mining (TDM) can be found in pages 93 – 108.

INTRO

4.3. TEXT AND DATA MINING
4.3.1. What is the problem and why is it a problem?
Problem: Researchers are faced with legal uncertainty with regard to whether and under which conditions they can carry out TDM on content they have lawful access to.
Description of the problem: Text and Data Mining (TDM) is a term commonly used to
describe the automated processing (“machine reading”) of large volumes of text and data to uncover new knowledge or insights. TDM can be a powerful scientific research tool to analyse big corpuses of text and data such as scientific publications or research datasets.

Continued

It has been calculated that the overall amount of scientific papers published worldwide may be increasing by 8 to 9% every year and doubling every 9 years. In some instances,more than 90% of research libraries’ collections in the EU are composed of digital content. This trend is bound to continue; however, without intervention at EU level, the legal uncertainty and fragmentation surrounding the use of TDM, notably by research organisations, will persist. Market developments, in particular the fact that publishers may increasingly include TDM in subscription licences and develop model clauses and practical tools (such as the Cross-Ref text and data mining service), including as a result of the commitments taken in the 2013 Licences for Europe process to facilitate it may partly mitigate the problem. However, fragmentation of the Single Market is likely to increase over time as a result of MS adopting TDM exceptions at national level which could be based on different conditions.

Four options are suggested on TDM reform.

Option 1 – Fostering industry self-regulation initiatives without changes to the EU legal framework.

Option 2 – Mandatory exception covering text and data mining for non-commercial scientific research purposes.

Option 3 – Mandatory exception applicable to public interest research organisations covering text and data mining for the purposes of both non-commercial and commercial scientific research.

Option 4 – Mandatory exception applicable to anybody who has lawful access (including both public interest research organisations and businesses) covering text and data mining for any scientific research purposes.

The  recommendation is for option 3 which allows Public Interest Research Organisations (Universities and research institutes) to mine for Non-Commercial AND Commercial purposes. This appears mainly to support industrially funded research.  On the whole it seems to be slight progress.

Option 3 is the preferred option. This option would create a high level of legal certainty and reduce transaction costs for researchers with a limited impact on right holders’ licensing market and limited compliance costs. In comparison, Option 1 would be significantly less effective and Option 2 would not achieve sufficient legal certainty for researchers, in particular as regards partnerships with private operators (PPPs).Option 3 allows reaching the policy objectives in a more proportionate manner than Option 4, which would entail significant foregone costs for rightholders, notably as regards licences with corporate researchers. In particular, Option 3 would intervene where there is a specific evidence of a problem (legal uncertainty for public interest organisations) without affecting the purely commercial market for TDM where intervention does not seem to be justified. In all, Option 3 has the best costs-benefits trade off as it would bring higher benefits (including in terms of reducing transaction costs) to researchers without additional foregone costs for rightholders as compared to Option 2 (Option 3 would have similar impacts on right holders but through a different legal technique i.e. scope of the exception defined through the identification of specific categories of beneficiaries rather than through the “non-commercial” purpose condition). The preferred option is also coherent with the EU open access policy and would achieve a good balance between copyright as a property right and the freedom of art and science.

Some reactions thus far via social media.

//platform.twitter.com/widgets.js

Text and Data Mining (i.e. Content Mining) for Research and Innovation – What Europe Must Do Next

LisbonCouncil Logo_FIN

The Right to Read is the Right to Mine

This post is based upon an important report released 08h00 CET 31 MAY 2016

The Lisbon Council launches Text and Data Mining for Research and Innovation: What Europe Must Do Next, an interactive policy brief which looks at the challenge and opportunity of text and data mining in a European context. Building on the Lisbon Council’s highly successful 2014 paper, which served as an important and early source of evidence on the uptake and interest in text and mining among academics worldwide, the paper revisits the data two years later and finds that recent trends have only accelerated. Concretely, Asian and U.S. scholars continue to show a huge interest in text and data mining as measured by academic research on the topic. And Europe’s position is falling relative to the rest of the world. The paper looks at the legal complexity and red tape facing European scholars in the area, and call for wholesale reform. The paper was prepared for and formally submitted as part of the European Commission’s Public Consultation on the Role of Publishers in the Copyright Value Chain and on the ‘Panorama Exception.’  Source.
 In an associated Press Release:-

 Text and Data Mining for Research and Innovation looks at the transformative role of text and data mining in academic research, benchmarks the role of European research against global competitors and reflects on the prospects for an enabling policy in the text and data mining field within the broader European political and economic context.

Among the key findings:

  • Asia leapfrogs EU in research on text and data mining. Over the last decade, Asia has replaced the European Union as the world’s leading centre for academic research on text and data mining as judged by number of publications. From 2011 to 2016, Asian scholars’ share of academic publications in the field rose to 32.4% of all global publications, up from 31.1% in 2000. The EU’s global share fell to 28.2%, down from 38.9% in 2000. North America remained in third place at 20.9% due to the relatively small size of the three-country region.
  • China ranks No.1 within Asia. As recently as 2000, Japan and Taiwan led Asia with 12.6% and 7% of all global text-and-data-mining-based publications. After a steady rise in interest, China now leads. On its own, it accounted for 11.7% of all global publications in 2015, up from zero in 2000. This gave China a No. 2 finish in the country rankings, second only to the United States. China’s ranking within Asia is now No. 1.
  • Chinese patents on data mining see unprecedented growth. China also led the global growth in the number of patents pertaining to data mining. While the number of patents granted by the U.S. Patent and Trademark Office (USPTO) remained relatively stable over the past decade, the number of patents granted for data-mining-related products by the State Intellectual Property Office of the People’s Republic of China (SIPO) rose to 149 in 2015, up from just one in 2005.
  • Chinese researchers are champions in patenting TDM procedures. Chinese researchers and organisations are patenting text-and- data-mining procedures at a faster rate than any other country in the world. This suggests that Chinese researchers attach a growing priority to the potential use of this new technique for stimulating scientific breakthroughs, disseminating technical knowledge and improving productivity throughout the scientific and technical community.
  • Middle East entering the game too. Some of the fastest growth and greatest interest was seen in relative newcomers: India, Iran and Turkey. Having shown virtually no interest in text and data mining as recently as 2000, the Middle East is now the world’s fourth largest region for research on text and data mining, led by Iran and Turkey.
  • Europe remains slow. Large European scientific, technical and medical publishers have added text-and-data-mining functionality to some dataset licences, but the overall framework in Europe remains slow and full of uncertainty. Many smaller publishers do not yet offer access of this type. And scholars complain that existing licences are too restrictive and do not allow for generating the advanced “big data” insights that come from detecting patterns across multiple datasets stored in different places or held by different owners.
  • Legal clarity also matters. Some countries apply the “fair-use” doctrine, which allows “exceptions” to existing copyright law, including for text and data mining. Israel, the Republic of Korea, Singapore, Taiwan and the United States are in this group. Others have created a new copyright “exception” for text and data mining – Japan, for instance, which adopted a blanket text-and-data-mining exception in 2009, and more recently the United Kingdom, where text and data mining was declared fully legal for non-commercial research purposes in 2014.
  • What Europe Must Do. New technologies make analysis of large volumes of text and other media potentially routine. But this can only happen if researchers have clearly established rights to use the relevant techniques, supported by the necessary skills and experience. Broadly speaking, the European ecosystem for engaging in text and data mining remains highly problematic, with researchers hesitant to perform valuable analysis that may or may not be legal. The end result: Europe is being leapfrogged by rising interest in other regions, notably Asia. European scholars are even forced, on occasion, to outsource their text and data mining needs to researchers elsewhere in the world, as has been reported repeatedly in past European Commission consultations. Anecdotally, we hear stories of university and research bureaux deliberately adding researchers in North America or Asia to consortia because those researchers will be able to do basic text and data mining so much more easily than in the EU.

 

Some reactions on social media:-

 .@lisboncouncil calls for modern rules to enable #TDM https://t.co/35oueTMprC. Here’s #IFLA‘s (joint) call for this https://t.co/DICEepMkKz!

 Asia leapfrogs EU in research on text and data mining says latest @LisbonCouncil Report on #TDM https://t.co/isOpo1nkm2

 EU falling behind on #TDM according to latest brief from @lisboncouncil https://t.co/ploUaNlyid

 This makes me mad Asia is doing #tdm Publishers like #elsevier and Wiley are stopping me and colleagues. @Senficon

OpenForum Europe publishes High Level Policy Paper on text and data mining

On 4 May 2015, OpenForum Europe (OFE) published an extremely significant policy paper on text and data mining.

OFE paper

From the OFE website:-

For the past months, OFE has been involved in an intensive research process regarding the various arguments and approaches relating to text and data mining (TDM) in Europe, which culminated with the paper published today, titled “An analytical review of text and data mining practices and approaches in Europe”.

The Commission should aim to achieve coherence in the legal provisions which it seeks to apply to TDM, with no consideration of ‘commercial’ versus ‘non-commercial’ purposes. Europe needs a regime which enables any researcher, citizen, company or other entity to engage in TDM activities, using material to which they have lawful access. The exact commercial rewards can be managed at subsequent stages, depending on the implementation of the mining outcome. The protection could be considered at the point at which some clearly commercially beneficial project, product, service, business or company has emerged.

From the report itself, mention is made that Peter Murray-Rust contributed to it:-

This paper is based on extensive desk research, including most of the benchmark reports, such as the European Commission funded Expert Group Report (2014), the study by De Wolf and Partners (2014), the UK IPO’s ‘Exceptions to Copyright’ brief (October 2014), as well as numerous other reports, position papers, articles and blog posts1.The initial findings have been discussed at the Round Table that OFE organised in October 2015, the conclusions of which are available in the follow-up White Paper. The desk research and Round Table discussion have been complemented by a series of interviews with academics, researchers, start-ups, and more established companies (including publishers and infrastructure providers)2.

1
A comprehensive list can be provided upon request.
2
The interviews were conducted between September 2015 and February 2016, with the following experts (in alphabetical order): Geoffrey Bilder (CrossRef), Vivian Chan (Sparrho), Elizabeth Crossick (RELX), Lucie Guibault (IViR), Prof. Ian Hargreaves, Rachael Lammey (CrossRef), Thomas Margoni (Openminted), Peter Murray-Rust (Content Mine), Cameron Neylon (Public Library of Science), Julia Reda (MEP), Tim Stok (RELX), Kalliopi Spyridaki (SAS).
From the conclusions of the paper:-
Even if TDM is to be allowed through a generalised exception, APIs will still be needed to do the actual mining. Trusted third party platforms which make APIs available should be encouraged. Having a trusted third party in the mining process could provide a middle ground where publishers feel more confident that their content is not about to be misappropriated, and where miners feel they can engage in TDM without their project being put at risk of plagiarism or other sharp practice.
Bringing all stakeholders around a table would appear to be the most advisable solution, not least because there remains a degree of mistrust between some publishers and some researchers. Sometimes the presence of diverging interests can motivate such tension, but in other cases there can indeed be factors or aspects to which one category of stakeholder rightfully points, but which are not always foreseeable or even obvious for other categories of stakeholder.
In order to be sustainable and to avoid the need for future legislative updates, the provision should be drafted in neutral terms, sufficient to withstand the passage of time and likely evolution of the associated technology.
OFE are a great organization and you can also follow their work on Twitter via @OpenForumEurope

TDM at European Parliament – tweet-like report

Great meeting at Brussels EP yesterday. Would have liked to tweet but didn’t have password. – There *were* tweets by the MEPs. So I wrote my notes like tweets I would have made.  Maye be useful to some, mystifying to others…

ChDKk3ZW0AAON57

Also Julia Reda MEP was there at the start!

Here’s the panel (7-8) run by Catherine Stihler MEP (who chaired well and let everyone else speak)

Marco Giorello Head copyright Unit DG Connect

Problem: data analytics techniques involve making copies
These copies are relevant to copyright
Legal situation unclear;  some exceptions temporal copying, and copying for research purposes
(a) contractual conditions and policies
(b) legislation – UK exception – because there was already research exception (but leads to Euro fragmentation).
Other states have “research exception”. Other states e.g. France, and ?Germany we don’t want 15 different legislations
Dec 2015 – EC trying to find balance – PIRO [Public Interest Research Organization, yes I don’t what that is either, so asked later…] – to address Univs and research insts.
But aware that Univs have private partners
UK “non-commercial” has caused problems.
Not only about copyright – but also technology , standards …

John Boswell SAS (software company) – analysis of data.
TDM is just one form of data analysis. Copyright wider, bcos movies, images, voice all covered by copyright
analysis of 1 million docs to extract sentiment and time series, does not implicate (C).
(C) is protection of expression of an idea. Analysing this does not copy the expression or create a derivative work.(C) must not prevent TDM. Issue much bigger than Universities. World has so much (C) – ca 300, 000 every minute FB, Tweets, Instagram, etc. . Much covered by copyright
Analysis of social media is major good. Govs can use social media to predict economics
Debate must realise that TDM does  not implicate (C)

Theresa Comodini Cachia (MEP and meeting convener)

Don’t wish to have debate on copyright vs TDM
Startups need protection from copyright and also need to use TDM
Startup innovation are EU priority – social and economic development
TDM will lead to new economic development
Reda report focussed on academic reearch.
innovation not just economic but also health and social
would give good push to innovation

Jakub Czakon (Stermedia) – (data analyst Physics + finance + chess)
loves data
TDM = data -> information -> knowledge
example s/w that matches CVs onto job offers
extract important info from data
try to match qualifications- find connections and distances between documents
health care – diagnosis of tumour – used machine learning and public data – found public competition training set.
looks for cells and local structure. Created diagnostic indicators.
facial recognition
these skills and startups are critical for Europe

Adriana Homolova – data journalist and visualisation
dataScience >> data analysis (insight into data) >> data analytics (analysing large amounts of data) >> data mining
uses AI.
NeuralNets, RandomForests, NearestNeighbours
Data mining is starting in journalism
journalism qualitative vs quantitative – “Interview data”
makes journalism stronger
data analysis used to fliter professors for side jobs for “interesting people”
e.g. 3 side jobs per prof
BBC analysed tennis for match-fixing for repeated underperforming
published on github
revolutionary in journalism
Panama papers had 400 (competing) journalists to abandon secrecy “newsroom collaboration”
data are the raw material of our age.
copyright can do much harm.
data anslytics are extension of our thought proceses
we must look how to open up – e.g. copyleft

Jean-Francois Dechamp DG Research and Innovation
both policy creation and funding agency
FutureTDM and OpenMinTed
objective – best conditions to do their job
resarchers and both producers and consumers
researchers often don’t own copyright of their resaerch
competition fierce – merger of Springer and Nature
data journals
publishers => service providers

Sergey Filippov Lisbon Council (Brussels Innovation Think tank)
Report 2 years ago on TDM in Academic and Research Communities in Europe
Academic pubs 1.5 / year , 60 million in total
“Publish or perish” leads to distraction from teaching and poor research
Traditional k/w search, TDM can recognise concept s, facts realtions, preparatory
idea -> lit rev  (TDM)-> hypothesis (TDM) -> data methodology -> analys conclusions
what’s problem? copyright …
researched this…
scientific publications 1200 pubs 47% from US EU 26% EU cited less than US
applicable to all subjects, not just hard sciences
10-fold increase in Data mining, TDM papers in last 5 years
US 21%, EU 28, CN 10, IN 13%
Patents in data mining huge growth in China
Then he interviewed 20 researchers
most people don’t know about TDM or tech -savvy
many worried about copyright
leads to results of lower quality
academic want exceptions
R2RR2M
growth in CN and IN and US
Europeans concerned but worried about clarity
if we don’t manage to get TDM used, then far-reaching negative implications for EU

Questions:

Christoph Bruch: Open Science Coordination Office of the Helmholtz association,

lot of researchers want assurance
Must not be universities only
(to  Marco EC) must not limit how society can use information
limit will do very much damage

Marco – commercial vs nc. Current draft is not final.
Why not business activities. Exception would also be (C) but certan classes of beneficiaries.
must look at (C) with care
cause friction
Pharma already use licences
Existing lucrative Market for re-use so EC can’t easily sweep it away
attempt to give full legal certainty
will be positive for academia and neutral for others

Boswell SAS – there is broad exception for TDM as “fair use” if not used for other purpose
interim step – new work is not copy of expression
in EC temporary copy should be covered by 5.1 of InfoSoc directive
PPIs with universities – lines are blurred
Should not make lines between univs and others

PM-R gave TDMer point of view and asked about PIRO – more later

TDM at European Parliament – tweet-like report

Great meeting at Brussels EP yesterday. Would have liked to tweet but didn’t have password. – There *were* tweets by the MEPs. So I wrote my notes like tweets I would have made.  Maye be useful to some, mystifying to others…

ChDKk3ZW0AAON57

Also Julia Reda MEP was there at the start!

Here’s the panel (7-8) run by Catherine Stihler MEP (who chaired well and let everyone else speak)

Marco Giorello Head copyright Unit DG Connect

Problem: data analytics techniques involve making copies
These copies are relevant to copyright
Legal situation unclear;  some exceptions temporal copying, and copying for research purposes
(a) contractual conditions and policies
(b) legislation – UK exception – because there was already research exception (but leads to Euro fragmentation).
Other states have “research exception”. Other states e.g. France, and ?Germany we don’t want 15 different legislations
Dec 2015 – EC trying to find balance – PIRO [Public Interest Research Organization, yes I don’t what that is either, so asked later…] – to address Univs and research insts.
But aware that Univs have private partners
UK “non-commercial” has caused problems.
Not only about copyright – but also technology , standards …

John Boswell SAS (software company) – analysis of data.
TDM is just one form of data analysis. Copyright wider, bcos movies, images, voice all covered by copyright
analysis of 1 million docs to extract sentiment and time series, does not implicate (C).
(C) is protection of expression of an idea. Analysing this does not copy the expression or create a derivative work.(C) must not prevent TDM. Issue much bigger than Universities. World has so much (C) – ca 300, 000 every minute FB, Tweets, Instagram, etc. . Much covered by copyright
Analysis of social media is major good. Govs can use social media to predict economics
Debate must realise that TDM does  not implicate (C)

Theresa Comodini Cachia (MEP and meeting convener)

Don’t wish to have debate on copyright vs TDM
Startups need protection from copyright and also need to use TDM
Startup innovation are EU priority – social and economic development
TDM will lead to new economic development
Reda report focussed on academic reearch.
innovation not just economic but also health and social
would give good push to innovation

Jakub Czakon (Stermedia) – (data analyst Physics + finance + chess)
loves data
TDM = data -> information -> knowledge
example s/w that matches CVs onto job offers
extract important info from data
try to match qualifications- find connections and distances between documents
health care – diagnosis of tumour – used machine learning and public data – found public competition training set.
looks for cells and local structure. Created diagnostic indicators.
facial recognition
these skills and startups are critical for Europe

Adriana Homolova – data journalist and visualisation
dataScience >> data analysis (insight into data) >> data analytics (analysing large amounts of data) >> data mining
uses AI.
NeuralNets, RandomForests, NearestNeighbours
Data mining is starting in journalism
journalism qualitative vs quantitative – “Interview data”
makes journalism stronger
data analysis used to fliter professors for side jobs for “interesting people”
e.g. 3 side jobs per prof
BBC analysed tennis for match-fixing for repeated underperforming
published on github
revolutionary in journalism
Panama papers had 400 (competing) journalists to abandon secrecy “newsroom collaboration”
data are the raw material of our age.
copyright can do much harm.
data anslytics are extension of our thought proceses
we must look how to open up – e.g. copyleft

Jean-Francois Dechamp DG Research and Innovation
both policy creation and funding agency
FutureTDM and OpenMinTed
objective – best conditions to do their job
resarchers and both producers and consumers
researchers often don’t own copyright of their resaerch
competition fierce – merger of Springer and Nature
data journals
publishers => service providers

Sergey Filippov Lisbon Council (Brussels Innovation Think tank)
Report 2 years ago on TDM in Academic and Research Communities in Europe
Academic pubs 1.5 / year , 60 million in total
“Publish or perish” leads to distraction from teaching and poor research
Traditional k/w search, TDM can recognise concept s, facts realtions, preparatory
idea -> lit rev  (TDM)-> hypothesis (TDM) -> data methodology -> analys conclusions
what’s problem? copyright …
researched this…
scientific publications 1200 pubs 47% from US EU 26% EU cited less than US
applicable to all subjects, not just hard sciences
10-fold increase in Data mining, TDM papers in last 5 years
US 21%, EU 28, CN 10, IN 13%
Patents in data mining huge growth in China
Then he interviewed 20 researchers
most people don’t know about TDM or tech -savvy
many worried about copyright
leads to results of lower quality
academic want exceptions
R2RR2M
growth in CN and IN and US
Europeans concerned but worried about clarity
if we don’t manage to get TDM used, then far-reaching negative implications for EU

Questions:

Christoph Bruch: Open Science Coordination Office of the Helmholtz association,

lot of researchers want assurance
Must not be universities only
(to  Marco EC) must not limit how society can use information
limit will do very much damage

Marco – commercial vs nc. Current draft is not final.
Why not business activities. Exception would also be (C) but certan classes of beneficiaries.
must look at (C) with care
cause friction
Pharma already use licences
Existing lucrative Market for re-use so EC can’t easily sweep it away
attempt to give full legal certainty
will be positive for academia and neutral for others

Boswell SAS – there is broad exception for TDM as “fair use” if not used for other purpose
interim step – new work is not copy of expression
in EC temporary copy should be covered by 5.1 of InfoSoc directive
PPIs with universities – lines are blurred
Should not make lines between univs and others

PM-R gave TDMer point of view and asked about PIRO – more later

@TheContentMine preparing for largescale high-throughput Mining (TDM)

The ContentMine (contentmine.org) has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I’ll try to post frequent information.

The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:

  • Mark MacGillivray
  • Anusha Ranganathan
  • Richard Smith-Unna
  • Tom Arrow
  • Peter Murray-Rust
  • Chris Kittel
  • and voluntary contributions

The daily oprtation (as opposed to user-driven getpapers) consists of:

  • DOIs and URLs provided by CrossRef
  • downloading software
  • indexing of fulltext documents (closed as well as open, legal under the UK “Hargreaves” exception)
  • fact extraction
  • display

We’ll detail this later.

The sources include:

  • open repositories such as EuropePubMedCentral
  • arxiv and other repositories
  • closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.

All closed work will be carried out on closed machines run by the University’s computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.

It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.

This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small – perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.

The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.

And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.