CSV Conf, a fringe event of the OKFest, Berlin




Wikipedia Science Conference

2/9/2015 – 3/9/2015

We are pleased to report that at least four of the ContentMine team will be speaking at this event.

First up will be Peter Murray-Rust with a Keynote Talk


Ross Mounce will be assisting.

ross mounce 2

Followed by Jenny Molloy “Challenges and opportunities for Wikipedia and Wikidata in synthetic biology”

Jenny Molloy

And then Stefan Kasberger on Wikipedia in an Open Science workflow




FORCE2016 Conference

17/4/2016 – 19/4/2016

Peter Murray-Rust will be attending this event.

The FORCE2016 Research Communication and e­Scholarship Conference brings together a diverse group of people interested in changing the way in which scholarly and scientific information is communicated and shared. The goal is to maximize efficiency and accessibility. The conference is non­traditional, with all stakeholders coming to the table for open discussion on an even playing field in support of innovation and coordination across perspectives. The conference is intended to create new partnerships and collaborations and support implementation of ideas generated at the conference and subsequent working groups.

I Annotate 2016 Conference

19/5/2016 – 20/5/2016

Peter Murray-Rust will be attending this event.

Web Annotation continues to gain traction as a compelling collaborative activity with a diverse range of uses and users that will take society substantially forward. I Annotate 2016 will focus particularly on the vision of an interoperable annotation fabric that can serve diverse use cases and the challenges, technologies, standards, best practices, integrations and other elements necessary to accelerate its adoption as a powerful new paradigm. This year, our first in Europe, we’ll focus on bringing annotation to the larger international community, and building these relationships and networks across the annotation ecosystem.

CSV Conference 2016

3/5/2016 – 4/5/2016

People involved with ContentMine including  Mark MacGillivray, Chris Kitell, Stefan Kasberger & Richard Smith-Unna  are attending this event in various capacities.

Follow along on Twitter via #csvconf

csv,conf,v2 is a community driven data conference.  Its an event thats not literally about CSV file format, but rather about what CSV represents in regards to our wider community ideals (data interoperability, hackability, simplicity, etc.).

Keeping with the first csv conference, we have put together a heavily curated program that maintains an unconference feel.  This will include quick, rapid fire, 20-minute presentations hand selected by the program committee.  Talks will be about a range of data-related topics from a diverse range of speakers.  Our focus this year is connecting key areas of open science, data journalism, and open government with the wider data software/tools community.

ContentMine at WOSP2014: Text and Data Mining: III What Elsevier’s Chris Shillum thinks we can do; Responsible Mining

My last post (http://blogs.ch.cam.ac.uk/pmr/2014/09/15/wosp2014-text-and-data-mining-ii-elseviers-presentation-gemma-hersh/ ) described the unacceptable and intransigent attitude of Elsevier’s Gemma Hersh at WOSP2014 Text and Datamining of Scientific Documents.

But there was a nother face to Elsevier at the meeting, Chris Shillum. (http://www.slideshare.net/cshillum ). Charles Oppenheim and I talked with him throughout lunch and also later Richard Smith-Unna also talked. None of the following is on public record but it’s a reasonable approximation to what he told us.

Firstly he disagrees with Gemma Hersh that we HAVE to use Elsevier’s API and sign their Terms and Conditions (which gives away several rights and severely limit what we can do and publish.) We CAN mine Elsevier’s content through their web pages that we have the right to read and we cannot be stopped by law. We have a reasonable duty of care to respect the technical integrity of their system, but that’s all we have to worry about.

I *think* we can move on from that. If so, thanks, Chris. And if so, it’s a model for other publishers.

We told him what we were planning to do – read every paper as it is published and extract the facts.

CS:  Elsevier published ca 1000 papers a day.

PMR : that’s one per minute;  that won’t break your servers

CS: But it’s not how humans behave….

I think this means that is we appear to their servers as just another human then they don’t have a load problem. (For the record I sometimes download manually in sequence as many Elsevier papers as I can to (a) check the licence or (b) to see whether the figures contain chemistry, or sequences. Both of these are legitimate human activities either for subscribers or unsubscribers. For (a) I can manage 3 per minute, 200/hour – I have to scroll to the end of the paper because that’s often where the “all rights reserved, C Elsevier” is. For (b) it’s about 40/hour if there are an average of 5 diagrams (I can tell within 500 milliseconds whether a diagram has chemistry, sequences, phylogenetic trees, dose-response curves…).  [Note: I don’t do this for fun – it isn’t – but because I am fighting for our digital rights.]

But it does get boring and error-prone which is why we use machines.

The second problem is what we can publish. The problem is copyright. If I download a complete Closed Access paper from Elsevier and post it on a public website I am breaking copyright. I accept the law. (I am only going to talk about the formal law, not morals or ethics at this stage).

If I read a scientific paper I can publish facts. I MAY be able to publish some text as comment, either as metadata, or because I am critiquing the text. Here’s an example (http://www.lablit.com/article/11) :

In 1953, the following sentence appeared near the end of a neat little paper by James Watson and Francis Crick proposing the double helical structure of DNA (Nature171: 737-738 (1953)):

“It has not escaped our notice that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material.”

Of course, this bon mot is now wildly famous amongst scientists, probably as much for its coyness and understatement …

Lablit can reasonably assume that this is fair comment and quote C+W (Copyright who? they didn’t have transfer in those days). I can quote Lablit in similar fashion.  In the US this is often justified under “fair use” , but there is not such protection in UK. Fair use is, anyway, extremely fuzzy.

So Elsevier gave a guide as to what they allow IF you sign their TDM restrictions. Originally it was 200 characters, but I pointed out that many entities (facts) , such as chemical names or biological sequences, were larger. So Chris clarified that Elsevier would allow 200 characters of surrounding context.  This could means something like (Abstract, http://www.mdpi.com/2218-1989/2/1/39 ) showing ContentMine markup:

“…Generally the secondary metabolite capability of <a href=”http://en.wikipedia.org/wiki/Aspergillus_oryzae”&gt; A.oryzae</a> presents several novel end products likely to result from the domestication process…”

That’s 136 characters without the Named Entity “<a…/a>”

Which means that we need guidelines for Responsible Content Mining.

That’s what JISC have asked Jenny Molloy and me to do (and we’ve now invited Charles Oppenheim to be a third author). It’s not easy as there are few agreed current practices. It’s possible for us to summarise what people currently do, what has been challenged by rights holders and then for us to suggest what is reasonable. Note that this is not a negotiation. We are not at that stage.

So I’d start with:

  1. The right to read is the right to mine.
  2. Researchers should take reasonable care not to violate the integrity of publishers’ servers.
  3. Copyright law applies to all parts of the process  although it is frequently unclear
  4. Researchers should take reasonable care not to violate publishers’ copyright.
  5. Facts are uncopyrightable.
  6. Science requires the publication of source material as far as possible to verify the integrity of the process. This may conflict with (3/4).


CC BY publishers such as PLoS and BMC will only be concerned with (2). Therefore they act as a yardstick, and that’s why we are working with them. Cameron Neylon of PLoS has publicly stated that single text miners cause no problems of PLoS servers and there are checks to counter irresponsible crawling.

Unfortunately (4) cannot be solved by discussions with publishers as they have shown themselves to be in conflict with Libraries, Funders, JISC, etc. Therefore I shall proceed by announcing what I intend to do. This is not a negotiation and it’s not asking for permission. It’s allowing a reasonable publisher to state any reasonable concerns.

And I hope we can see that as a reasonable way forward.