The ContentMine (contentmine.org) has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I’ll try to post frequent information.
The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:
- Mark MacGillivray
- Anusha Ranganathan
- Richard Smith-Unna
- Tom Arrow
- Peter Murray-Rust
- Chris Kittel
- and voluntary contributions
The daily oprtation (as opposed to user-driven getpapers) consists of:
- DOIs and URLs provided by CrossRef
- downloading software
- indexing of fulltext documents (closed as well as open, legal under the UK “Hargreaves” exception)
- fact extraction
We’ll detail this later.
The sources include:
- open repositories such as EuropePubMedCentral
- arxiv and other repositories
- closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.
All closed work will be carried out on closed machines run by the University’s computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.
It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.
This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small – perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.
The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.
And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.