Peter talks about ContentMine and WikiData at Wikimania





Wikipedia Science Conference

2/9/2015 – 3/9/2015

We are pleased to report that at least four of the ContentMine team will be speaking at this event.

First up will be Peter Murray-Rust with a Keynote Talk


Ross Mounce will be assisting.

ross mounce 2

Followed by Jenny Molloy “Challenges and opportunities for Wikipedia and Wikidata in synthetic biology”

Jenny Molloy

And then Stefan Kasberger on Wikipedia in an Open Science workflow




Wikipedia and Wikidata. Massive Open resources for Science.

I think Wikipedia is a wonderful creation of the XXIst Century, the Digital Enlightenment. It has arisen out of the massive cultural change enabled by digital freedom – the technical ability for over half the world (and hopefully soon almost all) to read and write what they want.

I was invited to give the plenary lecture at Wikipedia Science – a new venture, and one which was wonderfully successful. Here’s me, promoting “The Right to Read is The Right to Mine“.


I’m not sure whether there is a recording or transcript – I’d certainly value them as I don’t read prepared speeches.

My theme was that Wikidata – one of the dozen major sections of Wikimedia – should be the first stopping place for people who want to find and re-use scientific data. It doesn’t mean that WD necessarily contains all the data itself, but it will have structured validated link to where it can be found.

Here’s my slides which contain praise for Wikim/pedia, the problems of closed information, and the technology of liberating it through ContentMining.  In ContentMine we are mining the daily literature for science and Wikidata will be one of the places that we shall look to for recording the results .

One of the great aspects of Wikipedia is that it has an Open approach to governance. Last year at Wikimania I was impressed by the self-analysis of Wikipedia – how can we run a distributed, vibrant, multicultural, multidisciplinary organisation? If anyone can find the answer it’ Wikimedia.

But running societies has neve been and never will be easy. People will always disagree about what is right and what is wrong; what will work and what won’t.

And that’s what the next post is about. Wikipedia has embarked on a collaboration with Elsevier to read the closed literature. Many people think it’s a good way forward. Others like Michael Eisen and I think it’s a dereliction of our fundamental values.

It’s healthy that we debate this loudly in public. During that process we may lose friends and make new ones, but we advance our communal processes.

What’s supremely unhealthy is that larged closed monopolistic capitalist organisations make decisions in private, colluding with governments and constrain and control the Digital Enlightenment.

Should Wikipedia work with Elsevier?

This story has erupted in the last 2 days – if it had been earlier I would have covered it at my talk to Wikipedia Science].

TL;DR. Elsevier has granted accounts to 45 top editors at Wikipedia so they can read closed access publications as part of their editing. I strongly oppose this and say why. BTW I consider myself a committed Wikipedian.]

Glyn Moody in Ars Technica has headlined:

WikiGate” raises questions about Wikipedia’s commitment to open access

Glyn mailed me for my opinion and the piece, which is accurate, also highlights Michael Eisen’s opposition to the new move. I’ll cut and paste large chunks and then add additional comment.

Scientific publisher Elsevier has donated 45 free ScienceDirect accounts to “top Wikipedia editors” to aid them in their work. Michael Eisen, one of the founders of the open access movement, which seeks to make research publications freely available online, tweeted that he was “shocked to see @wikipedia working hand-in-hand with Elsevier to populate encylopedia w/links people cannot access,” and dubbed it “WikiGate.” Over the last few days, a row has broken out between Eisen and other academics over whether a free and open service such as Wikipedia should be partnering with a closed, non-free company such as Elsevier.

Eisen’s fear is that the free accounts to ScienceDirect will encourage Wikipedia editors to add references to articles that are behind Elsevier’s paywall. When members of the public seek to follow such links, they will be unable to see the article in question unless they have a suitable subscription to Elsevier’s journals, or they make a one-time payment, usually tens of pounds for limited access.

Eisen went on to tweet: “@Wikipedia is providing free advertising for Elsevier and getting nothing in return,” and that, rather than making it easy to access materials behind paywalls, “it SHOULD be difficult for @wikipedia editors to use #paywalled sources as, in long run, it will encourage openness.” He called on Wikipedia’s co-founder, Jimmy Wales, to “reconsider accommodating Elsevier’s cynical use of @Wikipedia to advertise paywalled journals.” His own suggestion was that Wikipedia should provide citations, but not active links to paywalled articles.

Agreed. It is not only providing free advertising, but worse, it implicitly legitimizes Elsevier’s control of the scientific literature. Rather than making it MORE accessibile to the citizens of the world, it makes it LESS.

Eisen is not alone in considering the Elsevier donation a poisoned chalice. Peter Murray-Rust is Reader Emeritus in Molecular Informatics at the University Of Cambridge, and another leading campaigner for open access. In an email to Ars, he called the free Elsevier accounts “crumbs from the rich man’s table. It encourages a priesthood. Only the best editors can have this. It’s patronising, ineffectual. And I wouldn’t go near it.”

This arbitrary distinction between the 45 top editors and everyone else is seriously divisive. Even if this was a useful approach (it isn’t) why should Elsevier decide who can, and who can’t, be a top Wikipedia editor? Wikipedia has rightful concerns about who and how editors are “appointed” – it’s meritocratic and, though imperfect, any other solution (cf. Churchil on democracy) is worse.

You may think I am overreacting – that Elsevier will behave decently and collaboratively. I’ve spent 6 years trying to “negotiate” with Elsevier about Content Mining – and it’s one smokescreen after another. They want to develop and retain control over scholarship.

And I have additional knowledge. I’ve been campaigning for reform in Europe (including UK) and everywhere the publishers are fighting us. Elsevier wants me and collaborators to “licence” the right to mine – these licences are desiged to make Elsevier the central control. I would strongly urge any Wikipedian to read the small print and then run a mile.

This isn’t the first time that Wikipedia has worked closely with a publisher in this way. The Wikipedia Library “helps editors access reliable sources to improve Wikipedia.” It says that it supports “the broader move towards open access,” but it also arranges Access Partnerships with publishers: “You would provide a set number of qualified and prolific Wikipedia editors free access to your resources for typically 1 year.” As Wikipedia Library writes: “We also love to collaborate on social media, press releases, and blog posts highlighting our partnerships.”

It is that cosy relationship with publishers and their paywalled articles that Eisen is concerned about, especially the latest one with Elsevier, whom he described in a tweet as “#openaccess’s biggest enemy.” Eisen wrote: “it is a corruption of @Wikipedia’s principles to get in bed with Elsevier, and it will ultimately corrupt @Wikipedia.” But in a reply to Wikipedia Library on Twitter, Eisen also emphasised: “don’t get me wrong, i love @wikipedia and i totally understand everything you are doing.”

Murray-Rust was one of the keynote speakers at the recent Wikipedia Science Conference, held in London, which was “prompted by the growing interest in Wikipedia, Wikidata, Commons, and other Wikimedia projects as platforms for opening up the scientific process.” The central question raised by WikiGate is whether the Wikipedia Library project’s arrangements with publishers like Elsevier that might encourage Wikipedia editors to include more links to paywalled articles really help to bring that about.

Elsevier and other mainstream publishers have no intention of major collaboration, nor of releasing the bulk of their material to the world. Witness the 35-year old paper, which is hidden behind a paywall, that predicted that Ebola could break out in Liberia. It’s still behind an Elsevier paywall.

[These problems aren’t confined to Elsevier, many of the major publishers do similar things to restrict the flow of knowledge.  When it appeared that ContentMining might become a reality, Wiley recently added “Captcha’s” to its site to prevent ContentMining . But Elsevier is the largest and most unyielding publisher, often taking the lead in devising restrictions,  and so it gets most coverage.]

Wikimedian Martin Poulter, who is the organiser of the Wikipedia Science Conference, has no doubts. In an email, he told Ars: “Personally, I think the Wikipedia Library project (which gives Wikipedia editors free access to pay-walled or restricted resources like Science Direct) is wonderful. As a university staff member, I don’t use it myself, but I’m glad Wikipedians outside the ivory towers get to use academic sources. Wikipedia aims to be an open-access summary of reliable knowledge—not a summary of open-access knowledge. The best scholarly sources are often not open-access: Wikipedia has to operate in this real world, not the world we ideally want.”

The debate will continue publicly in Wikip/media. That’s good.

The STM publishers, Rightslink, and similar organisations are working to lobby politicians, librarians, to prevent the liberation of knowledge. That must be fought every day

Wikimania: I argue for human-machine symbiotes to read and understand science

I have the huge opportunity to present a vision of the future of science at @WikimaniaLondon  (Friday 2014-08-08:1730) . I am deeply flattered. I am also deeply flattered that the Wikipedians have created a page about me (which means I never have to write a bio!). And that I have been catalogued as an Activist in the Free culture and open movements.

I have always supported Wikipedia. [Purists, please forgive “Wikipedia” as synonym for Wikimedia, Wikispecies, Wikidata…). Ten years ago I wrote  in support of WP (recorded in ):

The bit of Wikipedia that I wrote is correct.

That was offered in support of Wikipedia – its process , its people and its content. (In Wikipedia itself I would never use “I” , but “we”, but for the arrogant academics it gets the message across). I’ll now revise it:

For facts in physical and biological science I trust Wikipedia.

Of course WP isn’t perfect. But neither is any other scientific reference. The difference is that Wikipedia:

  • builds on other authorities
  • is continually updated

Where it is questionable then the community can edit it. If you believe, as I do, that WP is the primary reference work of the Digital Century then the statement “Wikipedia is wrong” is almost meaningless. It’s “we can edit or annotate this WP entry to help the next reader make a better decision”.

We are seeing a deluge of scientific information. This is a good thing, as 80% of science is currently wasted. The conventional solution, disappointingly echoed by Timo Hannay (whom I know well and respect) is that we need a priesthood to decide what is worth reading

“subscription business models at least help to concentrate the minds of publishers on the poor souls trying to keep up with their journals.” [PMR: Nature is the archetypal subscription model, and is owned by Macmillan, who also owns Timo Hannay’s Digital Science]. “The only practical solution is to take a more differentiated approach to publishing the results of research. On one hand funders and employers should encourage scientists to issue smaller numbers of more significant research papers. This could be achieved by placing even greater emphasis on the impact of a researcher’s very best work and less on their aggregate activity.”

In other words the publishers set up an elite priesthood (which they have already) and academics fight to get their best work published. Everything else is lowgrade. This is so utterly against the Digital Enlightenment – where everyone can be involved – that I reject it totally.

I have a very different approach – knock down the ivory towers; dissolve the elitist publishers (the appointment of Kent Anderson to Science Magazine locks us in dystopian stasis).

Instead we must open scholarship to the world.  Science is for everyone. The world experts in Binomial names (Latin names) of dinosaurs are 4 years old. They have just as much right to our knowledge as professors and Macmillan.

So the next premise is

Most science can be understood by most human-machine symbiotes.

A human-machine scientific symbiote is a social machine consisting of (explained later):

  1. one (or preferably more) humans
  2. a discovery mechanism
  3. a reader-computer
  4. a knowledgebase

This isn’t science fiction. They exist today in primitive form. A hackathon is a great example of a symbiote – a group of humans hacking on a communal problem and sharing tools and knowledge. They are primitive not because of the technology, but because of our lack of vision and restrictive practices. They have to be built from OPEN components (“free to use, re-use, and redistribute”). So let’s take the components:

  1. Humans. These will come from those who think in a Digitally Enlightened way. They need to be open to sharing, putting group above self, of exposing their efforts, of not being frightened, or regarding “failure”as a valuable experience. Unfortunately such humans are beaten down by academia throughout much of the education process, through research; non-collaboration is often a virtue as is conformity. Disregard of the scholarly poor is universal. So either Universities must change or the world outside will change and leave them isolated and irrelevant
  2. Discovery. We’ve got used to universal knowledge through Google. But Google isn’t very good for science – it only indexes words, not chemical structures or graphs or identifiers or phylogenetic trees … We must build our own discovery system for science. It’s a simpler task than building a Google – there’s 1.5 million papers a year, add theses and grey literature and it’s perhaps 2 million documents. That’s about 5000 a day or 3 a minute. I can do that on my laptop. (I’m concentrating on documents here – data needs different treatment).

The problem would be largely solved if we had an Open Bibliography of science (basically a list of all published scientific documents). That’s easy to conceive and relatively easy to build.The challenge is sociopolitical – libraries don’t do this any more – they buy rent products from commercial companies – who have their own non-open agendas. So we shall probably have to do this as a volunteer community – largely like Open StreetMap – but there are several ways we can speed it up using the crowd and exhaust data from other processes (such as Open AccessButton and PeerLibrary).

And an index. When we discover a fact we index it. We need vocabularies and identifier systems. IN many subjects these exist and are OPEN but in many more they aren’t – so we have to build them or liberate them. All of this is hard, drawn out sociopolitical work. But when the indexes are built, then they create the scientific search engines of the future. They are nowhere near as large and complex as Google. We citizens can build this if we really want.

3. A machine reader-computer.  This is software which reads and processes the document for you. Again it’s not science fiction, just hard work to build. I’ve spent the last 2 years building some of it! and there are others. It’s needed because the technical standard of scholarly publishing is often appalling – almost no-one uses Unicode and standard fonts, which makes PDF awful to read. Diagrams which were created as vector diagrams are trashed to bitmaps (PNGs and even worse JPEGs). This simply destroys science. But, with hard work, we are recovering some of this into semantic form. And while we are doing it we are computing a normalised version. If we have chemical intelligent software (we do!) we compute the best chemical representation. If we have math-aware software (we do) we compute the best version. And we can validate and check for errors and…

4. A knowledge base. The machine can immediately  look up any resource – as long as it’s OPEN. We’ve seen an increasing number of Open resources (examples in chemistry are Pubchem (NIH) and ChEBI and ChEMBL (EBI)) .

And of course Wikipedia. The quality of chemistry is very good. I’d trust any entry with a significant history and number of edits to be 99% correct in its infobox (facts).

So our knowledgebase is available for validation, computation and much else. What’s the mass of 0.1 mole of NaCl? Look up WP infobox and the machine can compute the answer. That means that the machine can annotate most of the facts in the document – we’re going to examine this in Friday.

What’s Panthera leo? I didn’t know, but WP does. It’s  So WP starts to make a scientific paper immediately understandable. I’d guess that a paper has hundreds of facts – we shall find out shortly.

But, alas, the STM publishers are trying to stop us doing this. They want to control it. They want to licence the process. Licence means control, not liberation.

But, in the UK, we can ignore the STM publisher lobbyists. Hargreaves allows us to mine papers for factual content without permission.

And Ross Mounce and I have started this. With papers on bacteria. We can extract tens of thousands of binomial names for bacteria.

But where can we find out what these names mean?

maybe you can suggest somewhere… :-)