Wellcome Trust Workshop – 2 Days of Hacking, Networking & Collaboration

Wellcome building And a follow on from this earlier post ahead of our Workshop at the Wellcome Trust. ContentMine Workshop at Wellcome Trust

On the 13th & 14th April 2015, we hosted our most recent Workshop, this time in London at the Wellcome Trust. This particular Workshop received a very high level of interest from the moment that we released tickets in late January.

Here was the Training Workshop Agenda for Monday 13th.

And the Workshop Hackday and Policy Panel Session Agenda for the following day can be found here.

Day1 was about tutorials, simple hands on about technology, aspects of policy and protocols and planning projects

Day2 involved hacking projects for 6 hours in groups followed by a 2-hour policy/advocacy session with key UK and EU attendees.

At the start of Day1,  we heard from Robert Kiley, Head of Digital Services at the  Wellcome Trust, one of the world’s largest medical research charities.

ContentMine Workshop at Wellcome Trust There were three separate sections after that in the morning including:- Think like a content miner

  • Hands-on activity facilitated by ContentMine staff introducing entity extraction techniques, precision and recall.

ContentMine Workshop at Wellcome Trust After lunch, there were four separate sessions including:-

Legality of content mining: what can I mine?

  • Presentation and Q&A by ContentMine staff covering copyright, database and contract law. Special attention will be paid to the UK copyright exception

ContentMine Workshop at Wellcome Trust

Responsible content mining: how should I mine?

  • Presentation and Q&A by ContentMine staff covering server limits, online scraping etiquette and responsible technology use.

These sessions also included one very important one from Mark MacGillivray.

Wellcome Mark


http://catalogue.cottagelabs.com/browse The formal part of the day ended with:- Hackday pitches and forming teams

  • Presentations by individuals and groups followed by discussion in newly formed teams facilitated by ContentMine staff.

Wellcome Trust Around a dozen of the group then retired to The Euston Flyer for some socializing. Euston Flyer

Day2 involved hacking in teams from 09:00 until 15:30

Thereafter a number of delegates from a dozen funders, policy makers, etc joined us. These included the likes of Wellcome Trust, HEFCE, BIS, RLUK, British Library, the Royal Society and Nature Publishing Group.

This was followed with:- Introduction to content mining

  • Presentation delivered by Peter Murray-Rust to new attendees

Wellcome Trust

Next was:- Presentation of hackday projects

  • Presentations delivered by participants, including future scope for development of their projects.

We heard from:- 1 Michael Baron MichaelB

A potentially very widely appreciated use of content mining would be for a more in depth literature review. Currently, PubMed’s search algorithm is limited to the content of titles and abstracts. Publications cannot be screened on e.g. keywords in the method section.

The scenario we explored at the content mine workshop centred on a bio-bank for respiratory human tissues. We aimed to find new collaborators that could make use of the resource.

For this two lists of keywords were compiled: One to search PubMed Central for publications in the field and a second list to scrape the papers’ method sections for relevant techniques and materials.

Eventually, author information was to be extracted from papers with relevant methods in the field of respiratory research. For example, groups that worked with mouse tissue so far.

A similar approach could be very useful for general, but very specific, literature collection – find all papers in the field of ‘x’ that used technique ‘y’ or cell-line ‘z’. Any information that is not likely to be found in the abstract or title.

2 Frank Hellwig FrankH

Frank showed how text mining can be used in cases where beside journal articles and books relevant sources often include large numbers of documents not available in XML or HTML, but e.g. in PDF or Word format only. He demonstrated – after converting PDFs to Scholarly HTML – the use of regular expressions to identify documents and present in one list snippets of text extracted from a document corpus based on whether certain terms and their variants occur – enabling time-saving searching and inspection

3 Elise Ruark Elise

4 Jasper Poort & Dafne Zuleima Morgado Ramirez


“In the second day of the workshop we wanted to use the ContentMine software for two purposes. First, we wanted to try to describe and summarize the currently available (open-access) scientific literature on a specific topic (data description). Second, we hoped that this summary could generate new ideas for further literature searches (data exploration).

In the first stage, we used specific criteria to generate a list of the web addresses of relevant scientific publications, using the PLoS ONE website (but other sources such as Europe PubMed Central could also be used). In our example, we used the search terms ‘human’, ‘robotic’ and ‘exoskeleton’. However, we note that any combination of search terms could be used at this stage, for example, to search for specific gene and disease combinations, or to search for conjunctions of brain areas and cognitive functions.

In the second stage, we used a sequence of different ContentMine tools. We first extracted the full text from each study (‘scraping’ step), and then converted each study to the same format (‘scholarly html’, ‘normalization’ step). Finally, we used the ‘bag of words’ tools to count word frequencies in each study. The ‘bag of words’ tool has the option to use a list of common English ‘stop words’, such as ‘the’, ‘for’ and ‘in’, that can be excluded, because they do not reflect relevant content.

In the third stage, we took the list of word frequency counts within each study, and wrote code to summarize and visualize this across studies (see http://www.matthewgthomas.co.uk/my-research/contentmine-workshop/ for a more detailed description). We first generated a list of all the unique words that occurred across all studies. Next, we sorted the list. We started with sorting by the simple word count (or ‘raw term frequency’). This returned a list of the most frequent words, and also listed for each word in which of the studies it occurred. We then visualized the relation between the most frequent words and the studies in which they occurred with network graphs, where the nodes in the network represented the words and studies, and the connections indicated which words were linked to which study. This gave us both a sense of the relations between different studies, and also indicated which words and concepts were highly relevant in our selection of studies. As suggested by the workshop organizers, we also used an alternative sorting method, the so-called ‘term frequency–inverse document frequency’, which down weighs the ‘raw term frequency’ by how frequently it occurs across all studies, in order to highlight more uniquely ‘informative’ words (words that do not occur in each study).

In the next stage, one could take a subset of the words that are considered most relevant, and use it to start a new search.

In conclusion, we believe that this type of analysis is a valuable tool to summarize an ever growing body of literature, and that it has the potential to inspire new insights in relevant topics and concepts for scientific research questions”.

 5 Helena Patching Helena And the final session of the whole event was :- Panel discussion on accelerating uptake of content mining.

  • Panel and Q&A with audience including workshop participants.

ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust ContentMine Workshop at Wellcome Trust This session included:-

  • TDM policy and advocacy, Robert Kiley
  • Why are Wellcome interested in TDM ?
  • Ben White of LIBER and an IPO employee gave a 3min to talk to the questions posed below, and then opened-ed it up for discussion.
  • What do you think needs to happen to drive TDM innnovation in the UK?
  • How do we reconcile the non-commercial nature of the UK copyright exception for TDM with the clear intent for copyright reform to bring economic benefits?
  • One way each organisation/individual can contribute to promoting TDM in the UK and Europe

Peter Murray-Rust has already blogged about this event, see:-

TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.

All of the pictures of this event can be found here on Flickr under CC-BY.


Published by


Scotland's (main, but not only) #OpenScience #OpenAccess #OpenData #OpenSource #OpenKnowledge & #PatientAdvocate Loves blogging http://figshare.com/blog Glasgow, Scotland.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s