Wellcome Trust Workshop

13/4/2015 – 14/4/2015


Peter Murray-Rust @petermurrayrust
Jenny Molloy @jenny_molloy
The ContentMine Team @TheContentMine

Location: Darwin Room, Wellcome Trust, 215 Euston Road, London NW1 2BE
Dates: 13-14 April 2015
13 April 2015 14 April 2015
Training Workshop Hackday & Policy Panel Session
10:00 – 18:00 09:00 – 17:00


Contact us via @TheContentMine or *protected email*

Please read the Pre-workshop Installation Instructions

We would also appreciate your feedback

Wellcome Trust Workshop – 2 Days of Hacking, Networking & Collaboration

Wellcome building And a follow on from this earlier post ahead of our Workshop at the Wellcome Trust. ContentMine Workshop at Wellcome Trust

On the 13th & 14th April 2015, we hosted our most recent Workshop, this time in London at the Wellcome Trust. This particular Workshop received a very high level of interest from the moment that we released tickets in late January.

Here was the Training Workshop Agenda for Monday 13th.

And the Workshop Hackday and Policy Panel Session Agenda for the following day can be found here.

Day1 was about tutorials, simple hands on about technology, aspects of policy and protocols and planning projects

Day2 involved hacking projects for 6 hours in groups followed by a 2-hour policy/advocacy session with key UK and EU attendees.

At the start of Day1,  we heard from Robert Kiley, Head of Digital Services at the  Wellcome Trust, one of the world’s largest medical research charities.

ContentMine Workshop at Wellcome Trust There were three separate sections after that in the morning including:- Think like a content miner

  • Hands-on activity facilitated by ContentMine staff introducing entity extraction techniques, precision and recall.

ContentMine Workshop at Wellcome Trust After lunch, there were four separate sessions including:-

Legality of content mining: what can I mine?

  • Presentation and Q&A by ContentMine staff covering copyright, database and contract law. Special attention will be paid to the UK copyright exception

ContentMine Workshop at Wellcome Trust

Responsible content mining: how should I mine?

  • Presentation and Q&A by ContentMine staff covering server limits, online scraping etiquette and responsible technology use.

These sessions also included one very important one from Mark MacGillivray.

Wellcome Mark


http://catalogue.cottagelabs.com/browse The formal part of the day ended with:- Hackday pitches and forming teams

  • Presentations by individuals and groups followed by discussion in newly formed teams facilitated by ContentMine staff.

Wellcome Trust Around a dozen of the group then retired to The Euston Flyer for some socializing. Euston Flyer

Day2 involved hacking in teams from 09:00 until 15:30

Thereafter a number of delegates from a dozen funders, policy makers, etc joined us. These included the likes of Wellcome Trust, HEFCE, BIS, RLUK, British Library, the Royal Society and Nature Publishing Group.

This was followed with:- Introduction to content mining

  • Presentation delivered by Peter Murray-Rust to new attendees

Wellcome Trust

Next was:- Presentation of hackday projects

  • Presentations delivered by participants, including future scope for development of their projects.

We heard from:- 1 Michael Baron MichaelB

A potentially very widely appreciated use of content mining would be for a more in depth literature review. Currently, PubMed’s search algorithm is limited to the content of titles and abstracts. Publications cannot be screened on e.g. keywords in the method section.

The scenario we explored at the content mine workshop centred on a bio-bank for respiratory human tissues. We aimed to find new collaborators that could make use of the resource.

For this two lists of keywords were compiled: One to search PubMed Central for publications in the field and a second list to scrape the papers’ method sections for relevant techniques and materials.

Eventually, author information was to be extracted from papers with relevant methods in the field of respiratory research. For example, groups that worked with mouse tissue so far.

A similar approach could be very useful for general, but very specific, literature collection – find all papers in the field of ‘x’ that used technique ‘y’ or cell-line ‘z’. Any information that is not likely to be found in the abstract or title.

2 Frank Hellwig FrankH

Frank showed how text mining can be used in cases where beside journal articles and books relevant sources often include large numbers of documents not available in XML or HTML, but e.g. in PDF or Word format only. He demonstrated – after converting PDFs to Scholarly HTML – the use of regular expressions to identify documents and present in one list snippets of text extracted from a document corpus based on whether certain terms and their variants occur – enabling time-saving searching and inspection

3 Elise Ruark Elise

4 Jasper Poort & Dafne Zuleima Morgado Ramirez


“In the second day of the workshop we wanted to use the ContentMine software for two purposes. First, we wanted to try to describe and summarize the currently available (open-access) scientific literature on a specific topic (data description). Second, we hoped that this summary could generate new ideas for further literature searches (data exploration).

In the first stage, we used specific criteria to generate a list of the web addresses of relevant scientific publications, using the PLoS ONE website (but other sources such as Europe PubMed Central could also be used). In our example, we used the search terms ‘human’, ‘robotic’ and ‘exoskeleton’. However, we note that any combination of search terms could be used at this stage, for example, to search for specific gene and disease combinations, or to search for conjunctions of brain areas and cognitive functions.

In the second stage, we used a sequence of different ContentMine tools. We first extracted the full text from each study (‘scraping’ step), and then converted each study to the same format (‘scholarly html’, ‘normalization’ step). Finally, we used the ‘bag of words’ tools to count word frequencies in each study. The ‘bag of words’ tool has the option to use a list of common English ‘stop words’, such as ‘the’, ‘for’ and ‘in’, that can be excluded, because they do not reflect relevant content.

In the third stage, we took the list of word frequency counts within each study, and wrote code to summarize and visualize this across studies (see http://www.matthewgthomas.co.uk/my-research/contentmine-workshop/ for a more detailed description). We first generated a list of all the unique words that occurred across all studies. Next, we sorted the list. We started with sorting by the simple word count (or ‘raw term frequency’). This returned a list of the most frequent words, and also listed for each word in which of the studies it occurred. We then visualized the relation between the most frequent words and the studies in which they occurred with network graphs, where the nodes in the network represented the words and studies, and the connections indicated which words were linked to which study. This gave us both a sense of the relations between different studies, and also indicated which words and concepts were highly relevant in our selection of studies. As suggested by the workshop organizers, we also used an alternative sorting method, the so-called ‘term frequency–inverse document frequency’, which down weighs the ‘raw term frequency’ by how frequently it occurs across all studies, in order to highlight more uniquely ‘informative’ words (words that do not occur in each study).

In the next stage, one could take a subset of the words that are considered most relevant, and use it to start a new search.

In conclusion, we believe that this type of analysis is a valuable tool to summarize an ever growing body of literature, and that it has the potential to inspire new insights in relevant topics and concepts for scientific research questions”.

 5 Helena Patching Helena And the final session of the whole event was :- Panel discussion on accelerating uptake of content mining.

  • Panel and Q&A with audience including workshop participants.

ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust This session included:-

  • TDM policy and advocacy, Robert Kiley
  • Why are Wellcome interested in TDM ?
  • Ben White of LIBER and an IPO employee gave a 3min to talk to the questions posed below, and then opened-ed it up for discussion.
  • What do you think needs to happen to drive TDM innnovation in the UK?
  • How do we reconcile the non-commercial nature of the UK copyright exception for TDM with the clear intent for copyright reform to bring economic benefits?
  • One way each organisation/individual can contribute to promoting TDM in the UK and Europe

Peter Murray-Rust has already blogged about this event, see:-

TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.

All of the pictures of this event can be found here on Flickr under CC-BY.

TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.



It’s a year since I started TheContentMine (contentmine.org) – a project funded by the Shuttleworth Foundation. In ContentMine we intend to extract all the world’s scientific and medical facts from the scholarly literature and make them available to everyone under permissive Open licences. We have been so busy – writing code, lobbying politically, building the team, designing the system, giving workshops, creating content, writing tutorials, etc. that I haven’t had time to blog.

This week we launched, without fanfare, at a workshop sponsored by Robert Kiley of the Wellcome Trust:


[RK presented with an AMI, the mascot of TheContentMine]

Robert (and WT) have been magnificent in supporting ContentMining. He has advocated, organised, corralled, pushed, challenged over many years. The success of the workshop owes a great deal to him.

On Monday and Tuesday (2015-04-13/14) we ran a 2day workshop – training , hacking and advocacy/policy. We advertised the workshop, primarily for Early Career Researchers and were overwhelmed – FOUR TIMES oversubscribed [1]. Jenny Molloy organised the days, roughly as follows:

  • Day1
  • tutorials and simple hands on about technology
  • aspects of policy and protocols
  • planning projects
  • Day2
  • hacking projects for 6 hours
  • 2-hour policy/advocacy session with key UK and EU attendees.

It worked very well and showed that ContentMine is now viable in many areas:

  • We have unique software that has a completely new approach to searching scientific and medical literature.
  • We have an infrastructure that allows automatic processing of the literature through CRAWLing, SCRAPE-ing, NORMAlising and MINING (AMI).
  • architecture
  • We have a back-end/server CATalogue (contracted through CottageLabs) which has ingested and analysed a million articles.
  • We have novel search interfaces and display of results.
  • We have established that THE RIGHT TO READ IS THE RIGHT TO MINE. in the UK
  • We have built a team, and shown how to build communities.
  • We have tested training sessions that can be used to train trainers and spread the word.
  • And we are credible at the policy level.


[Part of the policy session]

We are delighted that a dozen funders, policy makers, etc came. They included JISC, IPO, LIBER, RLUK, RCUK, HEFCE, CUL, WT, BIS, UbiquityPress, NatureNews. The discussion took for granted that ContentMining is critically important and addressed how it could be suported and encouraged.

My slides for the policy session are at http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust.

I will blog more details later and show more pictures and so will Graham McDawg Steel. But the highlight for me was the speed and effciency of the Early Career Researchers in adopting, using, modifying and promoting the system. They came mainly from bioscience, /medical and ranged from UNIX geeks to those who hadn’t seen a commandline. In their projects they were able to make the CM software work for them and extract facts from the literature. One group wrote additional processing software, another created a novel display with D3.

Best of all they said they’d be happy to learn how to run a workshop and take the ideas and software (which is completely Open Apache2/CC BY/CC0) to their communities.

NOTE: Hargreaves allows UK researchers to mine ANYTHING (that they have legal right to read) for non-commercial use. The publishers cannot stop them, either by technical means or contracts with libraries.

This should make the UK the content-mining capital of the world. Please join us!

Wellcome Trust Workshop – Before the Event


On the 13th & 14th April 2015, we will be hosting our next Workshop in London at the Wellcome Trust. This particular Workshop received a very high level of interest from the moment that we released tickets in late January.

Here is the Training Workshop Agenda for Monday 13th.

And the Workshop Hackday and Policy Panel Session Agenda for the following day can be found here.

cmteam_720The following members of the ContentMine team will be present:-

Peter Murray-Rust, Jenny Molloy, Mark MacGillivray, Richard Smith-Unna, Stephanie Smith-Unna and Graham Steel

25 individuals with broad and diverse areas of academic backgrounds will be in attendance. In addition, 10-15 individuals from funding bodies, institutions, learned societies and other policy-related organisations will join the final afternoon session on the 14th including Wellcome Trust, HEFCE, BIS, RLUK, British Library, the Royal Society and Nature Publishing Group.

Workshop Objectives

  1. Raise awareness of content mining among researchers.
  2. Train 20-25 researchers in using the ContentMine pipeline tools.
  3. Promote legal and responsible content mining practices.
  4. Prototype or at least explore developing a range of new applications and tools that might be useful to funders and researchers.
  5. Showcase the scope of potential applications for content mining to workshop participants and invited policy staff.
  6. Discuss and suggest requirements to drive uptake of content mining e.g. training, policy, use cases, commercial applications.

Day one is a full day from 10:00 – 18:00. Afterwards, we have reserved space for around 25 people for an informal social event to discuss hackday projects. This shall be held at the Euston Flyer which is situated only a few minutes walk away from Wellcome.

Day two is also a full day from 09:00 – 17:00. There are currently no plans for any social activities at the end of day two.

Anticipated Workshop Outcomes

  1. Research or assessment performed using ContentMine tools by a proportion of workshop participants.
  2. Production of two or more prototype applications designed to be useful to the funders and/or researchers.
  3. Continuing collaboration and development of a proportion of hackday prototypes by participants and ContentMine staff.
  4. One or more ContentMine trainers recruited from pool of participants, increasing the reach of training for content mining.
  5. Better informed policy staff as a result of panel discussion and publication of outputs.
  6. Demand for training by further organisations or individuals linked to the Wellcome Trust, therefore potential for increased impact beyond this particular workshop.

We shall publish a further blog post after the event !