[I’ll publish a second blog explaining how CrossRef and ContentMine will be working together. This is about the mechanics of use, which are intricate.]
Yesterday I travelled to CrossRef to explore how we can work together to mine the literature for science.
We are Crossref, a not-for-profit membership organization for scholarly publishing working to make content easy to find, link, cite and assess. We do it in five ways: rallying the community; tagging the metadata; running a shared infrastructure; playing with new technology; and making tools and services to improve research communications.
CrossRef plays a critically important important role in Scholarly infrastructure. It receives notification of (most) scholarly publications (e.g. journal articles), manages the metadata, and provides ways for everyone to access the metadata. Geoff Bilder runs CrossRef and last week at OpenCon gave a sensationally memorable talk about how scholarly infrastructure should be Open/Free. Geoff and I see eye-to-eye on everything that matters and we are very fortunate that he is running this service.
I learnt a great deal about CrossRef yesterday. It’s effectively a publishers’ organization (though libraries etc. can join) and numerically it’s dominated by a long-tail of small publishers). There’s a board – this is however dominated by large publishers. Unlike (say) STMPublishers Association or CopyrightClearanceCenter – whose primary effect is to support publishers, CrossRef provides a real interface between publishers and readers (people and machines who read the literature).
CrossRef provides technology and help on how to access the literature. I warn potential users that they should read very carefully as there are publisher click-through licences that are unrenounceable. IMO this is very serious, people may click these without realising them, and then be bound by all time. I have not consciously clicked any and will not consciously click any. I spent quite a lot of time yesterday exploring what CrossRef require for the use of its service and what is additionally added by publishers. (I also query whether individuals can legally sign a click-through that relates to an organizational subscription – has your library authorised click-through for you?)
The CrossRef service provides a RESTful API (see their Github repository) which has a rich set of options, based essentially on metadata (titles, journals, authors, etc.) and not on fulltext, e.g.
Multiple filters can be specified in a single query. In such a case, different filters will be applied with AND semantics, while specifying the same filter multiple times will result in OR semantics – that is, specifying the filters:
would locate documents that are updates, were published on or after 3rd March 2014 and were funded by either the National Science Foundation (
10.13039/100000001) or the National Heart, Lung, and Blood Institute (
10.13039/100000050). These filters would be specified by joining each filter together with a comma:
Before proceeding I’ll try to explain what I think is the current position and then correct this document if I am wrong in details.
An “API” is an agreed specification for retrieving information from a server, normally with a URL that implements this specification. An API is often the best technical way of accessing information but a considerable amount has to be taken on trust (“am I seeing the whole data?”, “am I anonymous to the provider?”, “how stable is the information over time?”). I am, for example, prepared to use EuropePMC’s API; I am not prepared to use Elsevier’s because it requires unacceptable licence conditions.
A licence is a legally binding contract between parties. My understanding is that CrossRef does not require anyone to sign CrossRef licences and I have not done so. Licences are a major sticking point between miners and publishers. This was the impasse in “Licences4Europe” and it has not been resolved. I, along with other authors and signatories to the Hague Declaration believe that “The Right to Read is the Right to Mine”. Many publishers wish to impose licences that limit what and how we can mine, and may also allow charges and quotas to be imposed.
CrossRef is effectively acting as an agent for a number of publishers in providing services where miners may (not necessarily must) sign licences through click-through buttons. You should distinguish very carefully who has issued the licence and whether you should sign it.
Publishers have seriously confused APIs and Licences. They promote APIs as benefitting the miner, while omitting to point out that the miner has to sign an additional licence where they give up some or all of their rights. Many publishers also imply or state that it is illegal to scrape their landing pages and that the miner must use their API and therefore sign licences. Since Licences are very prominent on CrossRef’s site I warn miners that they should always find out whether they are mandatory, and if so challenge and refuse to sign them.
A (licence)token is a key that allows a particular miner to access content on a publisher’s site. This normally operates when the content is paywalled. The miner can obtain a token by:
specifying the publisher that they wish to mine
identifying themselves and/or their institution to (or through) CrossRef so their right to access can be checked by the publisher
receiving and storing the (multi-character) token – effectively a machine-readable key. I am not clear how long tokens live for. The token is then included in the query to a given publisher.
A researcher/miner therefore creates a query (without tokens) that expresses what they want and possibly adds tokens if required. The CrossRef API then performs the query and can return a number of fields (I am limiting discussion to journal articles):
a list of URLs (DOIs) that point to the fulltext of target articles and allow them to be accessed through the publisher-site. These URLs may represent a different access point from those from the landing page (web page). In principle the content from that access point should be the same as from the landing page (this is certainly not true for Elsevier at present, where PDFs and images are only available from the landing page, and XML only from the miningAPI).
The licence/s for that article (if a licence specifically exists)
The token allowing the miner to access the full text
If this sounds complex, that’s because it is not simple and depends critically on details. It’s also not something that most people are involved in. To recap:
if miner is prepared to sign away some of their rights: They list the publishers they are prepared to sign up to, create a query, receive the list of resultant metadata results (bibliography, titles, authors, link to landing pages?) and fulltext URLs and then download or otherwise search the fulltext on the publisher’s sites. They must remember that the licences may severely restrict what they can do at all stages of the process.
If miner is not prepared to sign away their rights. They submit a query as above, and (I believe) get back the same list of metadata, but without the licences and tokens. They are then technically able to use the URLs to scrape the publishers’ landing pages. Whether the publisher will try to stop them by legal and technical means we shall probably find out.
There is no formal limit on how and why the miner can use CrossRef’s services.