How ContentMine at Cambridge will use CrossRef’s API to mine Science

I’ve described how CrossRef works – now I’ll show how ContentMine will use it for daily mining.

ContentMine sets out to mine the whole scientific literature “100 million facts”. Up till now we’ve been building the technical infrastructure, challenging for our rights, understanding the law, and ordering the kit. We’ve built and deployed a number of prototypes. But we are now ready to start indexing science in earnest.

Since ContentMining has been vastly underused, and because publisher actions have often chilled researchers and libraries, we don’t know in detail what people want and how they would tackle it. We think there are many approaches – here are a few

  1. daily examination of the complete daily literature (“Current awareness”).

  2. Query-based search of part of science and medicine. Thus a researcher might wish to study “obesity and smoking”. This is often likely to go back several years

  3. search for entities and identifiers. Which papers report Clinical trial data? What diseases are reported in Liberia?

  4. Search for associated data files.

(2-4) will be specific to the researcher, but (1) is general and we plan to do that. We’ll set up a set of search filters and apply them to every new paper that appears. For Open papers we can do anything, for Closed papers it has to be personal non-commercial research. A typical filter is for endangered species (I am personally concerned about endangered species and see it as a valid research topic). See where we index all papers in PloSOne and BMC for species and then look those up in the IUCN “red list”.

This filter currently aggregates all Open articles. Here’s our favourite species, Ursus maritimus , and when we search in our we find a key paper, . But this can be extended to closed articles. So we can search the full-text of every paper for “Ursus maritimus” (and the other ca. 40,000 species on the IUCN list. Doing that over all papers (perhaps 150 million) is a problem, but it’s straightforward to do it for each day.

CrossRef estimate there are about 6000 journal articles a day (1440 minutes) – that won’t break any publisher servers – it’s ~1 per minute at worst even for Elsevier. So we’ll get all the daily papers and search them for species. We can store them on disk temporarily, then extract the polar bears, and then delete the files after a reasonable time.

So a daily search of papers is a trivial workload and a trivial impact on publisher servers. I’ve seen humans scanning several papers per minute.

CrossRef have even set up a template for polar bears It includes the license option, but we’ll ignore that (see previous posts). So the workflow is:

  1. Fetch the URLs

python fetch

2. Download URLs ./

Output will be in result/<date>. This also contains a copy of the URLs file.

We can also put this under a GUI wrapper.

Of course there will be other queries than polar bears so rather than download the papers every time for each query, we can batch the queries together (as long as they are PMR’s personal non-commercial research). Or PMR can copy the files temporarily to his non-commercial research disk and re-run the non-commercial research query to do research.

[I’m not being silly with the language. If I did someone else’s research for them, I might be challenged. You see how content-miners have to think always of the law first and science second. And how this could lead to flawed science planning. However I am very happy (a) to do joint personal research in any field that involves content-mining and (b) help any Cambridge scientist to do her non-commercial research on our joint Cambridge system.]

And we can publish the factual output. It may look something like Everything there is either a fact or a snippet of < 200 characters surrounding the fact. We point to the original paper – that’s sufficient acknowledgement , publish snippets as CC 0. If you can read the original closed paper then you are one of the lucky 0.1% of the world’s population that subscribes to the scholarly literature or you can pay 40 USD for 24 hrs access with no re-use rights.

So how many fact-types will be extracted per day? That’s up to you. I’ll do about 10 – species, sequences, chemistry, phylogenetics, word frequencies, templates, clinical trials … all things where I am a legitimate expert accepted by the scientific world. As a responsible scientist I am required to publish these facts and to make them CC0.

So how many facts? Let’s say 10 per paper? You do the sums…

so we’ve ordered servers capable of managing the computational and disk loads for this daily activity.

And Cambridge could become an example site for the Hargreaves-enabled UK and an inspiration for reformers in EU to get the laws changed.


Published by

the bear

I have another blog in real life...

One thought on “How ContentMine at Cambridge will use CrossRef’s API to mine Science”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s