Statewatch is a non-profit organisation founded in 1991 that monitors the state and civil liberties in the European Union.
Yesterday evening, it released the following tweet:-
The document itself (PDF) is 182 pages in length and the section relating to text and data mining (TDM) can be found in pages 93 – 108.
4.3. TEXT AND DATA MINING
4.3.1. What is the problem and why is it a problem?
Problem: Researchers are faced with legal uncertainty with regard to whether and under which conditions they can carry out TDM on content they have lawful access to.
Description of the problem: Text and Data Mining (TDM) is a term commonly used to
describe the automated processing (“machine reading”) of large volumes of text and data to uncover new knowledge or insights. TDM can be a powerful scientific research tool to analyse big corpuses of text and data such as scientific publications or research datasets.
It has been calculated that the overall amount of scientific papers published worldwide may be increasing by 8 to 9% every year and doubling every 9 years. In some instances,more than 90% of research libraries’ collections in the EU are composed of digital content. This trend is bound to continue; however, without intervention at EU level, the legal uncertainty and fragmentation surrounding the use of TDM, notably by research organisations, will persist. Market developments, in particular the fact that publishers may increasingly include TDM in subscription licences and develop model clauses and practical tools (such as the Cross-Ref text and data mining service), including as a result of the commitments taken in the 2013 Licences for Europe process to facilitate it may partly mitigate the problem. However, fragmentation of the Single Market is likely to increase over time as a result of MS adopting TDM exceptions at national level which could be based on different conditions.
Four options are suggested on TDM reform.
Option 1 – Fostering industry self-regulation initiatives without changes to the EU legal framework.
Option 2 – Mandatory exception covering text and data mining for non-commercial scientific research purposes.
Option 3 – Mandatory exception applicable to public interest research organisations covering text and data mining for the purposes of both non-commercial and commercial scientific research.
Option 4 – Mandatory exception applicable to anybody who has lawful access (including both public interest research organisations and businesses) covering text and data mining for any scientific research purposes.
The recommendation is for option 3 which allows Public Interest Research Organisations (Universities and research institutes) to mine for Non-Commercial AND Commercial purposes. This appears mainly to support industrially funded research. On the whole it seems to be slight progress.
Option 3 is the preferred option. This option would create a high level of legal certainty and reduce transaction costs for researchers with a limited impact on right holders’ licensing market and limited compliance costs. In comparison, Option 1 would be significantly less effective and Option 2 would not achieve sufficient legal certainty for researchers, in particular as regards partnerships with private operators (PPPs).Option 3 allows reaching the policy objectives in a more proportionate manner than Option 4, which would entail significant foregone costs for rightholders, notably as regards licences with corporate researchers. In particular, Option 3 would intervene where there is a specific evidence of a problem (legal uncertainty for public interest organisations) without affecting the purely commercial market for TDM where intervention does not seem to be justified. In all, Option 3 has the best costs-benefits trade off as it would bring higher benefits (including in terms of reducing transaction costs) to researchers without additional foregone costs for rightholders as compared to Option 2 (Option 3 would have similar impacts on right holders but through a different legal technique i.e. scope of the exception defined through the identification of specific categories of beneficiaries rather than through the “non-commercial” purpose condition). The preferred option is also coherent with the EU open access policy and would achieve a good balance between copyright as a property right and the freedom of art and science.
Some reactions thus far via social media.
I am Lars, and I am from the Netherlands, where I currently live. I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.
Practically, it is about collecting data about conifers and visualise it in a dynamic HTML page. This is done in three parts. The first part is to fetch, normalise, index papers with the ContentMine tools, and automatically process it to find relations between data, probably by analysing sentences with tools such as (a modified) OSCAR, (a modified) ChemicalTagger, simply RegEx, or, if it proves necessary, a more advanced NLP-tool like SyntaxNet.
For example, an article that would work well here is Activation of defence pathways in Scots pine bark after feeding by pine weevil (Hylobius abietis). It shows a nice interaction network between Pinus sylvestris, Hylobius abietis and chemicals in the former related to attacks by the latter.
The second part is to write a program to convert all data to a standardised format. The third part is to use the data to make a database. Because the relation between found data is known, it will have a structure comparable to Wikidata and similar databases. This will be shown on a dynamic website, and when the data is reliable and the error rate is small enough, it may be exported to Wikidata.
High-throughput genomics has fundamentally changed how life scientists investigate their research questions. Instead of studying a single candidate gene, we are able to measure DNA sequences and RNA expression of every gene in a large number of genomes. This technology advance requires us to become familiar with data wrangling, programming, and statistical analysis. To meet new challenges, bioinformaticians have created a large number of software packages to deal with genomic data.
For example, a molecular biologist could nowadays sequence a full human genome that results in billions of very short and randomly located DNA sequences. Such short DNA sequences (e.g., ~hundreds base pairs) must be aligned, so that we can infer a full genome that is over 3 billion base pairs. Among many available computational methods, it is often difficult to know the most appropriate tool for one’s specific need. I propose to track the usage of bioinformatics tools used in sequence alignment, variant calling, and other genomic studies.
This work will lead to a new kind of reviews, that is interactive and dynamic. Eventually, molecular biologists will have a convenient portal that quantitatively summarizes the trends of computational and statistical methods in genomics. Furthermore, the source code will be published in Github, so that questions about other research trends can be duplicated and extended.
“Our goal is to mine facts from global health research and provide automated referenced summaries to practitioners and agents who don’t have the means or the time to navigate the literature.
Plus, for extreme situations where decisions can’t wait for new research to be confirmed, we’ll do our best to highlight which preliminary facts, as published in the latest conferences, share more traits with previously confirmed results.”
Hi! My name is Ale, I was born and raised in Brazil, where I also obtained my PhD in Physics and later participated in large-scale projects in epidemiology and public health. I currently live in France, where I work at LISIS (Interdisciplinary Laboratory on Sciences, Innovations and Societies) in a project about the evolution of the field of oncology.
I am extremely happy to join this first cohort of ContentMine Fellows. I participated in a ContentMine workshop in 2014 and have been following the progress of the project ever since, looking for an opportunity to collaborate which now materializes.
The research field of my proposal is Global Health, and our goal is as stated in the opening paragraph. It should be noted that I am not alone, and my interest comes from a group formed during OpenCon 2015 by Neo Chung — who serendipitously was also awarded a ContentMine fellowship under an entirely different subject.
Being a fellow will enable me to develop this proposal within a motivating community, with my lab better understanding my time away from my primary job. It will also accelerate my learning and contributing to the platform, and hopefully will help attract other researchers excited by this idea!
My research is interested in understanding in vivo modelling of depression using systematic review and meta-analysis.
Systematic review is an incredibly useful tool to assess the relevant literature and achieve a clear overview of the current literature, which is becoming increasingly difficult with the amount of papers published in scholarly journals increasing exponentially (Bornmann & Mutz, 2014, www). It can also provide better understanding of the laboratory methods used to induce the condition, the range of outcome measures used to assess depressive-like phenotypes, and the variables that might impact on the efficacy of different treatments (de Vries et al., 2011, www).
However, systematic reviews are time consuming and often not produced quickly enough to inform the field before they need to be updated (Tsafnat et al., 2014, www.). Automation techniques such as machine learning and text mining can aid the systematic review process and reduce the workload at various stages of the systematic review process, mainly the screening and data extraction stages (Jonnalagadda et al., 2013, www).
The method of model induction, outcome measures, treatments tested, and study quality in preclinical investigations of depression are of key interest in this project. I aim to use ContentMine to help investigate these key areas by using dictionaries, such as genera and species, to aid with document classification and clustering. Identifying the language used when authors report measures to reduce bias in their experiments, using regular expressions, can improve document tagging and data extraction for study quality measures. Identifying these key features in papers can aid the systematic review process by making grouping easier, for example grouping similar models, outcome measures, or animal species, for follow up analyses. These tools increase the efficiency in which systematic reviews are carried out, shortening the time from search date to publication, and thus allow for more up-to-date reviews to be produced. All measures to automate and streamline the systematic review process can reduce human workload, decrease costs associated with human time and expedite scientific advances.
I have just submitted my PhD in Biomedical Sciences at VIB and Ghent University (Belgium), with a thesis entitled “An open data exchange ecosystem: forging a new path for cell migration data analysis and mining”. During my PhD I have tried to make cell migration research more ‘open’: I have developed computational open source tools and algorithms for the storage, management, dissemination and analysis of cell migration experiments. Furthermore, I have tried to push this ‘open’ concept a bit beyond my own PhD, and have succeeded in engaging a few researchers in this fight: I am now working in MULTIMOT, an EU-H2020 funded project that aims to build an open data ecosystem for cell migration research, with the ultimate goal to increase reproducibility, and allow analyses to take place on rich datasets that would otherwise remain unused.
As a ContentMine fellow, I want to text mine literature around cell migration and invasion, because I believe that there is a huge amount of information in all the papers continuously published in the field, and that this information simply cannot be processed by the human eye. Specifically, text mine cell migration articles will hopefully help in the following tasks:
- automatically detect a set of core information reported when describing experiments in the field, and therefore construct a collection of minimum reporting requirements from these. These requirements can then be used to aid experimental and computational reproducibility.
- check for nomenclature consistency, the use of common terms or ontologies to describe the same concept, again with the goal to increase reproducibility, and allow meta-analyses to take place.
- construct a knowledge map that could capture the current status of information in the field, especially in terms of cell motility-related compounds and (cancer) cell lines.
Two ContentMine team members featured in Nature News this week reporting difficulties they have faced in text mining Wiley publications due to ‘Trap’ URLs, designed to catch people using automated downloading.
Full article: ‘Publisher under fire for fake article webpages‘
Richard Smith-Unna, lead developer of ContentMine’s getpapers and quickscrape tools, encountered the issue while undertaking research that is fully supported by the UK copyright exception for text and data mining and where his University librarians had been informed. Ross Mounce, lead researcher on the ContentMine-supported PLUTo project has encountered similar issues before and states:
“I’m worried we’re seeing the beginning of an evolutionary arms race between legacy publishers and researchers.”
The methods employed were criticised as heavy-handed and unsophisticated by several commentators, with one librarian stating that they find the behaviour concerning “Because it demonstrates that supporting research is not the chief priority of these publishers.”
ContentMine continues to support the rights of researchers to build on our collective scientific knowledge through mining the academic literature and to call out barriers to that mission.