Elsevier’s new API approach to Content Mining should be avoided by all Librarians

Yesterday Elsevier updated its approach to Text and Data Mining. This is a rapid response . Elsevier’s material is at http://www.elsevier.com/connect/how-does-elseviers-text-mining-policy-work-with-new-uk-tdm-law and is italicised here.  My emphasis in Elsevier’s text is [thus]. My comments interleaved.

TL;DR [summary] Elsevier’s new approach is unnecessary and should be avoided by all libraries and researchers.

Some arguments below suggest that mining is better and easier with Elsevier’s APIs. This is an untested assertion. There are many Free and Open tools that can mine content. Unix tools are quite satisfactory and we have developed Free and Open content mining tools at http://contentmine.org . 

There is a SUMMARY at the end

How does Elsevier’s text mining policy work with new UK TDM law?

By Gemma Hersh | Posted on 9 June 2014

In January, Elsevier announced a new text and data mining policy, which allows academic researchers at subscribing institutions to text mine subscribed content for non-commercial research purposes.

PMR: we and others showed that this was deeply and utterly flawed and contained many clauses which were solely for Elsevier’s benefit.

Last week, a new UK text and data mining copyright exception came into force which allows researchers with lawful access to works to make copies of these for the purposes of non-commercial text and data mining. Accordingly, it’s is a good opportunity to reflect on how our policy and the exception work together.

YES, and my blogs posts will reflect. Note that I have lawful access to all works I want to mine for my non-commercial research processes

Elsevier and the UK TDM copyright exception

A new UK text and data mining copyright exception came into force on June 1st. What is it and how do Elsevier’s systems accommodate this requirement?

  • An exception to copyright is when someone is allowed to copy a work without seeking the permission of the rights holder. In this instance, researchers with lawful access to works published by Elsevier can copy these without asking,  [using tools we have provided for this purpose], provided they are doing the copying to carry out non-commercial text and data mining.

The highlighted phrase is completely spurious. We can copy the material with OUR tools which are Open. or with anyone else’s such as GNU/Linux Tools. Many readers may misread this phrase as part of the legislation – it is FUD and its introduction is completely irresponsible

  • Elsevier offers an Application Programming Interface (API) to facilitate text and data mining of content held on Science Direct. This API makes the process [easier and more efficient] for researchers compared to manual downloading and mining of articles. It also helps us to provide a good experience to human readers and to miners at the same time.

This is an untested assertion and written from a marketing perspective rather than an actual study. It is unlikely to be easier than FreeOpen tools which work for all publishers’ output.

  • Under the UK legislation, publishers can use “reasonable measures to maintain the stability and security” of their networks, and so the [requirement to use] this API is fully compatible with the copyright exception. 

So this appears to a MANDATORY API; if we do not use it Elsevier will take action. This is INCOMPATIBLE with the new legislation that allowed miners to ignore restrictions imposed by publishers.

  • Our approach to TDM remains under review and continual refinement. We have already made changes based on [researcher feedback during our pilot] and will continue to do so in order to support researchers.

PMR: Where is this “researcher feedback”? NAME them and publish the full details. No one has consulted me or many of the other proponents of unrestricted mining under the law. It’s always possible to find someone who will provide support for some case, but that’s neither scientific or responsible.

  • We believe text and data mining is important for advancing science, and we are keen to provide tools to support researchers who wish to mine no matter where they are located.

 This is vacuous marketing mumble.

Related resources

Elsevier has provided [text and data mining support for researchers since 2006].

PMR: Not for me. I spent years trying to get a reasonable approach. https://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years-negotiating-with-elsevier/

We designed our policy framework to span across all legal environments as research is global, and this framework complements the UK exception. Since the beginning of the year, in accordance with our policy, we have started to include text and data mining rights for non-commercial purposes in all new ScienceDirect subscription agreements and upon renewal for existing academic customers. [The UK law adds weight to our position; we are ensuring that those with “lawful access” (in UK legislation speak) have the right to mine our works].

PMR The UK law allows ME to mine Elsevier content WITHOUT the rights included in contracts. Read those clauses carefully , LIBRARIANS. It is highly likely that you will be giving up some of MY rights

Contrary to what some have suggested, [our policy was not designed to undermine library lobbying for copyright exceptions for text and data mining], but rather to position us to continue to offer flexible and scalable solutions to support researchers no matter where they are based.

PMR: Last year the massed mainstream publishers INCLUDING ELSEVIER  fought against the European libraries, funders, JISC, SURF, etc to require licences for content mining. “Licences 4 Europe”. The talks in Brussels broke down. Neelie Kroes stated that licences were not the answer.

PMR So it was all a misunderstanding? Elsevier wasn’t fighting us? Orwell calls this DOUBLESPEAK. Just reading the previous sentence should convince your that publishers are not “our partners”.

What the law alone cannot do – in the UK or elsewhere – [is resolve some of the technical sticking points that often frustrate a researcher’s mining experience]. That’s why our policy facilitates text mining via an Application Programming Interface (API).

PMR The FreeOpen software can already deal with the technical sticking points

The advantages of using APIs for text mining

As users of many popular websites will know, it is standard best practice for users (well, their machines) to be asked to use APIs or other download mechanisms when the website in question holds a lot of content. That’s the case with ScienceDirect, which holds over 12.5 million articles and almost 20,000 books, and we are among many other large platforms, including Wikipedia PubMed Central and Twitter, in asking for our API to be used for downloading and mining content. We do this to provide researchers with an optimum text mining experience.  

PMR Wikipedia and PubMedCentral (on whose advisory board I am) have public and democratic approaches to governance and control. Elsevier’s API is developed without any significant community input. If I saw an Elsevier API Advisory Board, with public minutes and transparency of the stature of PubMedCentral I would be prepared to engage

PMR APIs also allow websites to monitor (Snoop) on who uses the API for what purpose and when. It also allows the provider to provide the particular view (often limited or distorted) that they wish to promote.

For starters, access via the API provides full-text content of ScienceDirect in XML and plaintext formats, [which researchers tell us they prefer to HTML] for mining.

PMR Weasel words (Wikipedia term). I (PMR) find good standards-conformant HTML totally acceptable and often superior. I will be happy to report publicly whether Elsevier’s HTML is standards-conformant.

Similarly, experience in our pilots has indicated that text miners prefer API access for automated text mining for several other reasons, one being that content is available from our APIs without all of the extraneous information that is added to web pages intended for human consumption but which make text mining more difficult (e.g., presentational JavaScript, navigational controls and images, website branding, advertisements). Access via our API also provides content to researchers in stable, well-documented formats; by contrast, HTML coding can change at any time, making it arduous to keep “screen-scraping” scripts up to date.

PMR Human readers are no doubt clamouring for the extraneous information ,  yearning for website branding, and reading the site for the advertisements. Our content mining tools can avoid this clutter.

It’s not just text miners who benefit from our API, but users of ScienceDirect who are there to read content rather than download and mine it. Their user experience of ScienceDirect can be maintained at the highest level, as bulk downloading needed for mining is done elsewhere, via our API. If bulk downloading over a short period of time took place on the ScienceDirect site, [the system’s stability would be compromised, affecting researchers of every hue]. By contrast, our API is designed to cope with high-frequency requests from automated bots and crawlers in a very efficient manner which enables us to scale our systems to meet demand.

PMR I shan’t comment on what human ScienceDirect readers want;  Cameron Neylon has already demolished the idea that commercial publishers cannot provide robust servers for all types of use.

PMR: I do not understand why the hue (=colour) of researchers is important; In the UK and many other countries this is objectionable language and should not appear on a reputable publisher’s site. Please apologise and remove or I shall report this.

The Explanatory Notes published alongside the UK legislation make clear that publishers are able to impose “reasonable measures to maintain the stability and security” of their networks, as long as researchers are able to benefit from the exception to carry out non-commercial research. In other words, researchers with lawful access to works can copy these for the purposes of non-commercial text and data mining, and publishers have a role to play in managing this process. [1]The “reasonable measures” include requesting that miners to carry out text mining via a separate API], in line with Elsevier’s existing policy, and we have received numerous reassurances from the UK Government [2]that use of our API will be in compliance with the law].

PMR [1] You may request but you may not require.

PMR [2] And ignoring your API is ALSO in compliance with the law.

PMR. If the law is interpreted as “the publisher decides whether an activity is compliant with the law” then the law is pointless.

We will continue to monitor how our API is used and to make tweaks and changes to our policy in response to community feedback. We have already made several adjustments. For example, we no longer request a project description as part of the API registration process, and we now allow TDM output to be hosted in an institutional repository. We also know, for example, [that researchers would like to mine third-party images and graphics that they cannot currently download automatically via our API].

PMR: Yes. I would like to mine images and I will mine images. If Elsevier does not provide images through their API this is an unassailable argument for getting them directly from the website as the law allows.

[We of course make this content available to researchers on request],

PMR You didn’t (“of course”) make anything available to me during the three years I “negotiated” with you.

 

. but we are looking at how we might ensure that the rights of [third-party content owners] are respected whilst at the same time providing researchers with all of the content they want immediately via our API.

PMR. More FUD. We have a complete right to mine third-party content as well. Elsevier’s “ensuring rights” is a process that is of indeterminate duration.

And we are a signatory to the new CrossRef Prospect text and data mining service, which aims to allow researchers to mine content from a range of publishers through one single portal.

PMR CrossRef is set up by publishers and guided by the publishers who finance it.

Further, we’re looking at how we ensure that researchers [know what they can and cannot do with content, or where to go for further information], without giving the impression that we are claiming ownership over non-copyrightable facts and data.

PMR. I know what I can do and where I can go without Elsevier’s  help. And it’s likelythat miners may choose to come to http://contentmine.org  and similar community sites for information provided by the community for the community.

 

We’ve already altered our output terms, so that researchers can redistribute 200 characters in addition to text entity matches; [researchers] told us that our previous inclusion of text entity matches within that 200 character limit sometimes caused problems when displaying lengthy chemical formulas.

PMR “Researchers” was actually me. It’s polite to credit sources.

In short, we will continue to do what we have always done: work with the research community to support their research, listen to feedback and respond to changing needs. Our text and data mining policy is a reflection of this and will continue to evolve accordingly.

PMR More FUD and mumble.

SUMMARY.

  • LIBRARANS: DO NOT SIGN AWAY ANY RIGHTS.
  • NO-ONE ABSOLUTELY NEEDS ELSEVIERS API
  • IF ELSEVIER “MANDATES” AN API  WE CAN IGNORE IT UNDER UK LAW.
  • ELSEVIER’S CURRENT API PROVIDES MUCH LESS THAN THE WEBSITE
  • THERE ARE FRRE/OPEN TOOLS THAT ARE AN ACCEPTABLE ALTERNATIVE APPROACH
Advertisements

Published by

steelgraham

Scotland's (main, but not only) #OpenScience #OpenAccess #OpenData #OpenSource #OpenKnowledge & #PatientAdvocate Loves blogging http://figshare.com/blog Glasgow, Scotland.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s