Horizon magazine featured an article on text and data mining and specifically the European Commission proposal for a copyright exception, currently covering “public or private organisations that are carrying out scientific research in the public interest”.
Dr Peter Murray-Rust is director of ContentMine, a not-for-profit organisation which has developed software that enables researchers to search through scientific papers on a particular subject. He gives the example of the Zika outbreak as an area where TDM can help to enhance knowledge.
‘We’re going to need to know a lot more about Zika, and much of it may already be in the scientific literature that’s been published but that we don’t read. We don’t read it because there’s so much, so we’ve built a machine, ContentMine, that will liberate the facts from the literature.’
In the past few weeks I have made a few pages to visualise ContentMine output of articles with several topics. For this to work, I needed to develop a program to convert ContentMine output to a single (JSON) file to access the facts more easily, and load this in a HTML page. These pages currently contain lists of articles, genus and species. Now, you can quickly glance at a lot of articles, and see common recurrences (although currently most “common recurrences” are typos in plant names on the author’s end).
For a more detailed description of my progress, see my blog.
I am a taxonomist, the kind of biologists who are charged with discovering, documenting and describing life on earth. I specialize on insects, the most diverse and successful form of life. My ContentMine Fellowship project will focus on mining weevil-plant associations from literature records. I will describe my project in the following.
Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils (Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals. Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access. This study will use ContentMine to extract, organize and synthesize knowledge of host plant associations of weevils from the literature. I have been doing literature mining manually and generated nearly 700 entries.
My depression project, starting with an animal model, depression-specific search carried out in Pubmed and Embase in May 2016, includes 70,363 unique records after deduplication using EndNote. I am currently evaluating the options as to what is the best solution to get my full search (or of much of it as possible) into the ContentMine pipeline.
Here is an outline of the options I am considering and my thoughts:
Wait until the screening section of my systematic review is complete and import only the included studies.
Pros: Less papers to deal with, which has benefits computationally. Only the most relevant papers would go through the ContentMine pipeline.
Cons: Do not get information from the annotation carried out with ‘ami’ tool, in order to more thoroughly test out machine learning options for screening in systematic reviews.
Use the ‘quickscrape’ function on a list of manually collated URLs.
Pros: Records are smoothly imported into the ContentMine pipeline and in the correct formats.
Cons: Time. It takes a long time to manually collect the correct URLs and group them into categories for different scrapers. There will likely be issues in that some records will not have a URL. These records would slow the process as a more thorough library search would need to be conducted to find these records or the authors of the record would need to be contacted for full text. A third possibility is that the record does not have an electronic copy. These records could then not be processed by the ContentMine pipeline.
Run a search using EuPMC with the ‘getpapers’ function. I am currently retrieving about 20,000 records using the search string “(“depressive disorder” OR “depression” OR “depressive behavior” OR “depressive behaviour” OR “dysthymia” OR “dysthymic” AND “animal”)” which has been developed with a librarian at the University of Edinburgh to roughly correspond with the more complex original PubMed & Embase searches.
Pros: Records are quickly and easily imported into the ContentMine pipeline.
Cons: Would need to later reconcile the records downloaded with EuPMC search with the included papers from screening in order to translate these forward in the systematic review pipeline, to the data extraction phase.
Retrieve pdfs for the records of my search using EndNote and run ‘norma’ to convert pdfs to text.
Pros: The process of downloading full text pdfs from EndNote will be carried out in any case, in order to import full text references into the systematic review database that we have at CAMARADES for further data extraction and meta-analysis.
Cons: Possible issues with different pdf formats and therefore the possible output from the PDF conversion. Will the reader be able to deal with the typical journal format of columns (particularly double columns)? Whether and how well it deals with tables and figures? Can legends be retrieved from figures?
Run my PubMed search with the Open Access Filter and download the xml versions of the articles using the FTP service. Running my PubMed search using the open access filter retrieves 20,381 records which is comparable to the records retrieved using the amended EuPMC search with ‘getpapers’.
Considerations: Seeing as the records retrieved are comparable, I do not see much added advantage of using this facility over that of ContentMine EuPMC API.
Use CrossRef to retrieve full text documents, using a list of DOIs. This method can download up to 10,000 full text records at a time.
Pros: Another method of retrieving full text documents in a machine readable format. Could be faster method compared to ‘quickscrape’ as there are DOIs for roughly a third of records.
Cons: This method has similar issues to the ‘quickscrape’ tool, namely the time it takes to manually collect the correct DOIs, as only about a third of records have corresponding DOIs that have been located thus far.
If anyone has any input or possible solutions I am unaware of or have not yet considered, please leave a comment down below.
As with the other fellows, I am also experiencing issues using the ‘getpapers’ function and getting time-out errors. This is an issue that lies with EuPMC and its API and this issue will be rectified shortly.
In the meantime, while I weigh the advantages and disadvantages of the above options, I am putting together dictionaries to aide annotation of my documents. The dictionaries I am putting together are; animal models of depression, molecular and cellular pathways, outcome measures (in particular behavioural, neurobiological and anatomical), and risk of bias terms in the animal modelling literature.
I have been transcribing host plant association data from Colonnelli (2004)’s catalogue of the Ceutorhynchinae (Fig. 1). With more than 1320 species, the Ceutorhynchinae is a relatively small (we are talking about weevils!) subfamily of weevils. Enzo Colonnelli worked on this catalogue for more than 20 years to bring it to fruition.
The document itself (PDF) is 182 pages in length and the section relating to text and data mining (TDM) can be found in pages 93 – 108.
4.3. TEXT AND DATA MINING
4.3.1. What is the problem and why is it a problem?
Problem: Researchers are faced with legal uncertainty with regard to whether and under which conditions they can carry out TDM on content they have lawful access to.
Description of the problem: Text and Data Mining (TDM) is a term commonly used to
describe the automated processing (“machine reading”) of large volumes of text and data to uncover new knowledge or insights. TDM can be a powerful scientific research tool to analyse big corpuses of text and data such as scientific publications or research datasets.
It has been calculated that the overall amount of scientific papers published worldwide may be increasing by 8 to 9% every year and doubling every 9 years. In some instances,more than 90% of research libraries’ collections in the EU are composed of digital content. This trend is bound to continue; however, without intervention at EU level, the legal uncertainty and fragmentation surrounding the use of TDM, notably by research organisations, will persist. Market developments, in particular the fact that publishers may increasingly include TDM in subscription licences and develop model clauses and practical tools (such as the Cross-Ref text and data mining service), including as a result of the commitments taken in the 2013 Licences for Europe process to facilitate it may partly mitigate the problem. However, fragmentation of the Single Market is likely to increase over time as a result of MS adopting TDM exceptions at national level which could be based on different conditions.
Four options are suggested on TDM reform.
Option 1 – Fostering industry self-regulation initiatives without changes to the EU legal framework.
Option 2 – Mandatory exception covering text and data mining for non-commercial scientific research purposes.
Option 3 – Mandatory exception applicable to public interest research organisations covering text and data mining for the purposes of both non-commercial and commercial scientific research.
Option 4 – Mandatory exception applicable to anybody who has lawful access (including both public interest research organisations and businesses) covering text and data mining for any scientific research purposes.
The recommendation is for option 3 which allows Public Interest Research Organisations (Universities and research institutes) to mine for Non-Commercial AND Commercial purposes. This appears mainly to support industrially funded research. On the whole it seems to be slight progress.
Option 3 is the preferred option. This option would create a high level of legal certainty and reduce transaction costs for researchers with a limited impact on right holders’ licensing market and limited compliance costs. In comparison, Option 1 would be significantly less effective and Option 2 would not achieve sufficient legal certainty for researchers, in particular as regards partnerships with private operators (PPPs).Option 3 allows reaching the policy objectives in a more proportionate manner than Option 4, which would entail significant foregone costs for rightholders, notably as regards licences with corporate researchers. In particular, Option 3 would intervene where there is a specific evidence of a problem (legal uncertainty for public interest organisations) without affecting the purely commercial market for TDM where intervention does not seem to be justified. In all, Option 3 has the best costs-benefits trade off as it would bring higher benefits (including in terms of reducing transaction costs) to researchers without additional foregone costs for rightholders as compared to Option 2 (Option 3 would have similar impacts on right holders but through a different legal technique i.e. scope of the exception defined through the identification of specific categories of beneficiaries rather than through the “non-commercial” purpose condition). The preferred option is also coherent with the EU open access policy and would achieve a good balance between copyright as a property right and the freedom of art and science.
I am Lars, and I am from the Netherlands, where I currently live. I applied to this fellowship to learn new things and combine the ContentMine with two previous projects I never got to finish, and I got really excited by the idea and the ContentMine at large.
Practically, it is about collecting data about conifers and visualise it in a dynamic HTML page. This is done in three parts. The first part is to fetch, normalise, index papers with the ContentMine tools, and automatically process it to find relations between data, probably by analysing sentences with tools such as (a modified) OSCAR, (a modified) ChemicalTagger, simply RegEx, or, if it proves necessary, a more advanced NLP-tool like SyntaxNet.
The second part is to write a program to convert all data to a standardised format. The third part is to use the data to make a database. Because the relation between found data is known, it will have a structure comparable to Wikidata and similar databases. This will be shown on a dynamic website, and when the data is reliable and the error rate is small enough, it may be exported to Wikidata.