Shuttleworth Gathering Budapest, Content Mine Dogfood

Twice a year the Shuttleworth Fellowship meets in a Gathering – could be anywhere in the world (subject to a minimum travel costs algorithm). This is my first and we are in Budapest – one of Europe’s loveliest cities. (I’ve been here before, luckily, as our programme has been very full and we only got out once formally for a river cruise.

It’s Chatham House Rule so no details but see our web page for the 13 fellows. This is one of the most coherent, inspiring, groups I have ever been in. So much is common ground – we agree on doing Open, the questions are why? what and how? and we’ve explored those. I’ve found so much in common – we are in the area of liberating knowledge and inspiring innovation , mixed with democracy and justice. I’m finding out about how to build communities, annotation, education while being able to help with computer vision, information extraction, metadata, etc.

We each ran a 75 minute slot on “eating our own dogfood”. NOT a lecture. We had to bring the practice of our project and ask the others – everyone – to grok it and hack it. Often this was in small groups and so for mine we had 5 groups of 5. Here’s my rough summary with comments:

  • Why are we doing ContentMining? economics, openness/democracy, innovations, disruption.  Hargreaves

Very useful discussion (as would be expected)

  • Manual markup (highlighters) of two articles

Worked very well. Lots of questions about “should we mark this?”. 

  • Demo (PMR) of semantic content  (chemistry)

  • Crawling exercise (manual)

Good involvement. “Why doesn’t publisher X have an RSS feed?”, etc.

  • Scraping exercise (manual and software)

Again worked very well

  • Extraction (software and manual design)

Mainly concentrated on manual markup but showed chemical tagger, etc.

  • Where are we going?

 

I deliberately put far too much in – so people could test the software worked, etc. But the main idea was to see how non-biologists managed. I chose a paper on evolutionary biology of Lions in Africa and everyone got the point. In fact it reinforced how needlessly exclusive scientific language is. The first part of the introduction could be rewritten without loss to read something like

“African Lions are dying out because of hunting and environment change. DNA analyses show that lions in different parts of Africa have evolved in different ways. By studying the DNA and historical specimens we can understand the evolution and perhaps use this for conservation.”

There wasn’t enough time for everyone to run the software – deliberately – but we got very useful feedback.  I shall be tweaking it over the weekend to make sure it’s working for our Vienna workshop.

Advertisements

Shuttleworth Fellowship: Month 2; synergy with the Digital Enlightenment can change the world

I’m now finishing the second month of my Shuttleworth Fellowship – the most important thing in my whole career. My project The Content Mine aims to liberate all the facts in the scientific literature.

That’s incredibly ambitious and I don’t know in detail how it’s going to happen – but I am confident it will.

This week we posted our website – and showed how we create content. What’s modern is that this is a community website – we’re inspired by Wikipedia and OpenStreetmap where volunteers can find their own area of interest and contribute. Since there is no other Open resource for content-mining we shall provide that – we have 100 pages and intend to go beyond 1000. Obviously you can help with that. And of course Wikipedia’s information is invaluable.

We have an incredible team:

  • Michelle Brook .  Michelle is Manager and making a massive impression with her work on Open Access.
  • Jenny Molloy. Jenny has co-authored the foundations of Open Content Mining and ran the first workshop last year.
  • Ross Mounce. Ross has championed Open Content Mining in Brussels and is developing software for mining phylogenetics.
  • Mark MacGillivray. Co-authored Open Bibliography and founded CottageLabs who are supporting our web presence and IT infrastructure.
  • Richard Smith-Unna. Founder of the volunteer scientist-developer community solvers.io to which he is pitching ContentMine to support Crawling.

But we have also masses of informal links and collaborations. Because we are Open, people want to find out what we are doing and offer help. It’s possible that much of our requirements for crawling may be provided by the community – and that’s happening over the last week. We’ve had an important contribution to our approach to Optical Character Recognition. Today I was skyped with suggestions about Chemistry in the ContentMine.

This all happens because of the Digital Enlightenment. People round the world are seeing the possibilities of zero-cost software, efficient voluntary Open communities and the value of liberated Knowledge. There’s many projects wanting to liberate bibliography, reform authoring, re-use bioscience, etc. Occasionally we wake up and think “wow! problem solved!”. If you think “we”, not “me”, the world changes.

The Fellows and Foundation are fantastic. I have an hour Skype every week with Karien, and another hour with the whole Fellowship. These are incredibly valuable.  With such a huge ambition we need focus.

There’s huge synergy with several formal and many informal projects. Once you decide that your software and output is Open, you can move several times faster. No tedious agreements to sign. No worries about secrecy, so no delays in making knowledge open.  Of the formal projects :

  • Andy Howlett is doing the 3rd year of his PhD in the Unilever Centre here on metabolism. He can use the 10 years’ worth of Open Source we have developed and because his contributions are also Open we’ll benefit in return.
  • Mark Williamson is  using our software in similar fashion.
  • Ross Mounce and Matt Wills at Bath are running the PLUTo project. Because it’s completely Open they can use our software and we can re-use their results.
  • we are starting work with Chris Steinbeck at EBI on automated extraction of metabolites and phytochemistry from the literature.

Informally we are working with Volker Sorge (Birmingham) and Noureddin Sadawi (Brunel) on scientific computer vision and re-use of information for Blind and Visually Impaired people. With Egon Willighagen and John May on the (Open) Chemistry Development Kit. With the Crystallography Open Database…

How can it possibly work?

In the same way that Steve Coast “single-handedly” and with zero-cash built up OpenStreetmap.

  • promoting the concept. We are already well known in the community and people are watching and starting to participate.
  • by building horizontal scalability.  By dividing the problem into separate journals, we can build per-journal solutions. By identifying independent disciplines (chemistry, species, phylogenetics…) we can develop independently.
  • an Open modular software and information architecture. We build libraries and tools, not applications. So it’s easy to reconfigure. If people want a commandline approach we can offer that.
  • By re-using what’s already Open. We need a chemical database? don’t build it ourselves – work with EBI and Pubchem. An Open bibliography? work with Europe PubMedCentral.
  • by attracting and honouring volunteers. RichardSU has discovered the key point is to offer evening-sized problems. Developers don’t want to tackle a complex infrastructure – they want something where the task is clear and they can complete before they go to bed. And we have to make sure that they are promoted as first-class citizens.

Much of what we do will depend on what happens every week. A month ago I hadn’t planned for solvers.io; or Longan Java OCR; or Peer Library; or JournalToCs; or BoofCV; or …

… YOU!

PS: You might wonder what a 72-year-old is doing running a complex knowledge project. RichardSU asked that on hacker-news and I’m pleased that others value my response. If Neelie Kroes can change the world at 72, so can I – and so can YOU.

If you are retired you’re exactly the sort of person who can make massive contributions to the Content Mine. And it’s fun.