I have written today to my collaborators in ContentMine – staff, volunteers, advisory board and Shuttleworth funders and mentors. It’s on the legal aspects of mining. It’s long, but laws are complex. It’s meant to put everyone ‘s minds at rest – us, universities, Shuttleworth, etc. it’s not authoritative, but may be a useful guide. We’d love to have your feedback. tl;dr I’ve assessed the main problems and most people should assume we have taken a responsible and public approach.
ContentMine is preparing to mine the complete scholarly literature every day – about 10,000 scholarly articles.People from inside CM and from outside have recently raised the question of whether CM is breaking or intends to break the law. This has arisen in parts because of our intention to use the UK Copyright exception to mine the whole literature, and because of speculation about the possible use of our technology by “illegal” sites such as Sci-Hub.
NOTE: I am not a lawyer (IANAL) but I have spoken to several and am aware of general principles and practice.
The simple answer is simple:
CM does not intend to break the law and intends not to break the law.
and to my colleagues.
Do not worry. You will not end up in court. If anyone does – and it is unlikely – it will be me and I am prepared.
I shall expand on this in blog posts, but please be assured that I am actively assessing areas where the laws might be broken, especially inadvertently. Note, of course, that there are many other laws where we have to observe on a continual basis, and include health and safety, employment, racial discrimination, libel, immigration, etc. I get frequent updates from the Chemistry Department as to what procedures we have to observe. You, I, contentmine.org and everyone are bound to observe and practice these laws. They are complex in detail, extent, interpretation and we generally manage by knowing the outline of the law. We don’t steal, and we don’t read the small print of what is and is not a theft (e.g. “illegal borrowing”). But in others, e.g. animal experiments or immigration, the small print is critical. “Ignorance of the law is no defence”.
But I will take the responsibility of guiding you and making sure that you don’t transgress inadvertently.
The laws particularly relevant to contentmine.org in question include:
* copyright law
* sui generis database rights (Europe only)
* computer fraud law
* technological protection measures (TPM) and digital rights management (DRM)
* national security laws
Most of these laws have a concern about geo-location. We shall attempt to make sure that all our activities are carried out by UK staff, “in the UK”, on UK machines. But what is legal here may be illegal elsewhere and vice versa. Note also that many laws, especially new ones cannot have definite answers until they are tested in a courtcase. Lawyers may give opinions (for fees) but ultimately the court decides.
These laws are complex and often recent and – like many laws – it is possible to transgress unknowingly. We have have to educate ourselves and to behave responsibly in actions and language. If anyone is unsure they should raise the issue.
Note that by discussing this in public we will show our good faith and also be alerted by others to potential problems and misinterpretations.
Copyright law is exceedingly complex and also depends on the country. What is legal in the US may not be in Britain and vice versa. It includes:
* the process of copying for the purpose of mining for non-commercial research
* storage of copied material
* republication of the (transformed) output as part of the research/audit/verifiability requirement.
We continually discuss this with lawyers and with librarians. No one can predict precisely what is allowed and what is not – it may depend on “impact on the market of the rights-holder”. All law includes a balance of risks – It is my responsibility and (for some content) the librarians to make sure that we have a balanced assessment.
We believe that our mining is fully allowed under the UK 2014 reform (“Hargreaves”). It would not be allowed if we took money from commercial companies and mined the literature solely for their benefit. Europe has noted that much research is a public/private partnership (I worked for 15 years in the Cambridge Unilever Centre, for example). Was this non-commercial? I would take the view that all the projects I worked on were. If I was paid extra to do private contract research for a company which would not be published it would be commercial.
Since I and ContentMine are probably the only group in UK at present who publicly intend to use Hargreaves there is no case law to answer these questions. We read the current public discourse and form a balanced judgment.
What copyright material can we hold on our machines? It is common for researchers to have thousands of copies of copyright material on their machines and no one is challenged. Unlike them, our material is in a secure computer room in Cambridge with physical access only by trusted staff and e-access only to 2-3 named and authorised people. If anyone wishes to “steal” the literature from our server we will actively prevent and report this. We are not, of course, ourselves redistributing any of the University subscription content other than facts and fair quotations. If, as we hope, the resource becomes useful in the University, we will work with library staff to create a legally acceptable approach where any Cambridge scholar can use the system.
How long can we hold it for? Mining is often an iterative process, so we may wish to re-run searches with new parameters. It would be a technical waste to have to re-download everything everyday. It would also put additional workload on the publisher’s servers. We can’t give an answer in days or months or years until we know what the likely usage patterns are.
What can we republish? Since facts are uncopyrightable we can publish them without permission (although in Europe we cannot systematically republish the contents of databases protected by sui generis. Journals and supplemental data are not databases). But:
is not a useful fact.
“The average snout-vent-length (SVL, see https://sizes.com/natural/lizards.htm ) of the common lizards (Zootoca vivipara) found on Borchester Common ( https://en.wikipedia.org/wiki/Borchester ) was 42 mm (+- 5) measured by 3 independent researchers using the Graduated Ruler and Eyeball Method (see http://www.wikihow.com/Use-a-Ruler )”
is a useful fact. We intend to publish some or all of the facts we extract without formal permission from the publisher.
Note that a fact does not have to be “true”. I don’t actually know the sizes of newborn sandlizards. But what I have stated is a fact. The result might be a misprint for 142 mm (which is possible for an adult). It is still a (potentiallly falsifiable) fact. It remains a fact regardless of further lizard research.
I will blog more on facts as “facts” are uncopyrightable.
* sui generis database rights. We do NOT currently intend to systematically extract facts from factual databases described as such and specifically created for the purpose of holding facts.
* computer fraud laws
. We scrupulously avoid breaking these laws. They carry the additional features that they are criminal, and so prosecution would be by the police. The UK takes these very seriously and wishes to extend the maximum term of imprisonment to 10 years:http://arstechnica.com/tech-policy/2016/04/uk-file-sharing-10-years-jail-time/
(I personally protest against this, but I do it legally).You should therefore take especial care not to share files “illegally”. This means that ContentMine cannot have any dealings with Sci-Hub as it is seen by many as an “illegal” filesharing . Read Ars technica:
<quote>The UK government has responded to that issue by saying that it accepts there are concerns, and writes: “the policy intention is that criminal offences should not apply to low level infringement that has a minimal effect or causes minimum harm to copyright owners, in particular where the individuals involved are unaware of the impact of their behaviour.”
Another major worry was the use of the term “affect prejudicially” in judging copyright infringements, which many felt was too vague and could mean a single infringing file would fulfil the requirement—for example, if it were widely shared online. Many thought this set the threshold for committing an offence far too low.
The UK government said it was not aware of any cases where minor infringement had resulted in a criminal prosecution, but “agrees that the undefined term ‘affect prejudicially’ could give rise to an element of ambiguity.” The government is now proposing to introduce “re-worded offence provisions” to address that.
It is extremely unlikely that we will trigger this law as we don’t deliberately intend to break it and deliberately don’t intend to break it. However #icanhazpdf is almost certainly “illegal” and also breaks the rules of the University. I have never used #icanhazpdf in either direction and never sent files to people who weren’t subscribed. ContentMine staff should not use #icanhazpdf.
In some cases crawling has been held to be a violation of the CFA acts of various flavours. I am not aware of any cases where scholarly publishers have used this to prosecute bona fide researchers, nor where the police have.,
Note also that many publishers know that I and others (e.g. Crystallography Open Database) have been crawling their sites for many years and by implication permit it. This includes Nature, Elsevier, American Chemical Society, Royal Society of Chemistry, Acta Crystallographica, Science. We are careful to adhere to responsible mining practice (see https://contentmining.files.wordpress.com/2015/06/responsible-content-mining-1.pdf )
Aaron Swartz’s case was – for many, including me – a serious miscarriange of justice. From Wikipedia:
<quote>In the wake of the prosecution and subsequent suicide of Aaron Swartz, lawmakers have proposed to amend the Computer Fraud and Abuse Act. Representative Zoe Lofgren has drafted a bill that would help “prevent what happened to Aaron from happening to other Internet users”. Aaron’s Law (H.R. 2454, S. 1196) would exclude terms of service violations from the 1984 Computer Fraud and Abuse Act and from the wire fraud statute, despite the fact that Swartz was not prosecuted based on Terms of Service violations.
In addition to Lofgren’s efforts, Representatives Darrell Issa and Jared Polis (also on the House Judiciary Committee) raised questions about the government’s handling of the case. Polis called the charges “ridiculous and trumped up,” referring to Swartz as a “martyr.” Issa, who also chairs the House Oversight Committee, announced an investigation of the Justice Department’s prosecution.
As of May 2014, Aaron’s Law was stalled in committee, reportedly due to tech company Oracle‘s financial interests.
* TPM and DRM
These are technical methods of prevent access to material and can include firewalls, encryption, specific tools, and possibly Captcha. We have bought legal advice and the result is not clear about whether Hargreaves allows us to circumvent them. The rule for all of us is that if there is any technical barrier to mining we should identify it and alert the librarians and possibly computer officers. Deliberately breaking this law could have serious consequences. Rest assured that I will publicize and comment on publishers who impose TPM.
* national security. It is very unlikely that we shall trigger this very serious offence. However, overzealous prosecutors or government departments – particularly in the US – have used such provisions.
There is a simplistic tendency of some companies and government departments to demonize all “hacking” as security violations. My laptop carries “Wget is not a crime” https://ttdphx.com/2014/10/23/digital-rights-wget-is-not-a-crime/ , after
was jailed for its use. See Slashdot for the link to Snowden and hackerbabble:
Contentmine is in the business of scraping websites – scholarly publishers , academic departments, etc. Is this legal? People have been prosecuted for scraping (https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity
from a company selling anti-scraping software). Wiley and Elsevier caused Tilburg to cut off Chris Hartgerink for downloading (“stealing”) material to which he had legal access. Their accusations have not been made public and it seems most unlikely he had done anything illegal. However I have scraped publishers for 12 years (for legally accessible materials) with no complaints and I do not expect any.
*incitement to commit a crime.
in general it is a serious offence to encourage others to break the law. See http://www.cps.gov.uk/legal/h_to_k/inchoate_offences/#a01
for the official (and complex) UK law. For example I believe that any formal contact with Sci-hub or recommendation to use it could be interpreted as a crime. Whether the same applies to breaking contract law is less clear, but ContentMine will not , knowingly, break this either.
Please let me know whether I have omitted an important item or have misrepresented one.