And a follow on from this earlier post ahead of our Workshop at the Wellcome Trust.
On the 13th & 14th April 2015, we hosted our most recent Workshop, this time in London at the Wellcome Trust. This particular Workshop received a very high level of interest from the moment that we released tickets in late January.
Here was the Training Workshop Agenda for Monday 13th.
And the Workshop Hackday and Policy Panel Session Agenda for the following day can be found here.
Day1 was about tutorials, simple hands on about technology, aspects of policy and protocols and planning projects
Day2 involved hacking projects for 6 hours in groups followed by a 2-hour policy/advocacy session with key UK and EU attendees.
At the start of Day1, we heard from Robert Kiley, Head of Digital Services at the Wellcome Trust, one of the world’s largest medical research charities.
- Hands-on activity facilitated by ContentMine staff introducing entity extraction techniques, precision and recall.
|Legality of content mining: what can I mine?
|Responsible content mining: how should I mine?
These sessions also included one very important one from Mark MacGillivray.
http://catalogue.cottagelabs.com/browse The formal part of the day ended with:- Hackday pitches and forming teams
Day2 involved hacking in teams from 09:00 until 15:30
This was followed with:- Introduction to content mining
A potentially very widely appreciated use of content mining would be for a more in depth literature review. Currently, PubMed’s search algorithm is limited to the content of titles and abstracts. Publications cannot be screened on e.g. keywords in the method section.
The scenario we explored at the content mine workshop centred on a bio-bank for respiratory human tissues. We aimed to find new collaborators that could make use of the resource.
For this two lists of keywords were compiled: One to search PubMed Central for publications in the field and a second list to scrape the papers’ method sections for relevant techniques and materials.
Eventually, author information was to be extracted from papers with relevant methods in the field of respiratory research. For example, groups that worked with mouse tissue so far.
A similar approach could be very useful for general, but very specific, literature collection – find all papers in the field of ‘x’ that used technique ‘y’ or cell-line ‘z’. Any information that is not likely to be found in the abstract or title.
Frank showed how text mining can be used in cases where beside journal articles and books relevant sources often include large numbers of documents not available in XML or HTML, but e.g. in PDF or Word format only. He demonstrated – after converting PDFs to Scholarly HTML – the use of regular expressions to identify documents and present in one list snippets of text extracted from a document corpus based on whether certain terms and their variants occur – enabling time-saving searching and inspection
4 Jasper Poort & Dafne Zuleima Morgado Ramirez
“In the second day of the workshop we wanted to use the ContentMine software for two purposes. First, we wanted to try to describe and summarize the currently available (open-access) scientific literature on a specific topic (data description). Second, we hoped that this summary could generate new ideas for further literature searches (data exploration).
In the first stage, we used specific criteria to generate a list of the web addresses of relevant scientific publications, using the PLoS ONE website (but other sources such as Europe PubMed Central could also be used). In our example, we used the search terms ‘human’, ‘robotic’ and ‘exoskeleton’. However, we note that any combination of search terms could be used at this stage, for example, to search for specific gene and disease combinations, or to search for conjunctions of brain areas and cognitive functions.
In the second stage, we used a sequence of different ContentMine tools. We first extracted the full text from each study (‘scraping’ step), and then converted each study to the same format (‘scholarly html’, ‘normalization’ step). Finally, we used the ‘bag of words’ tools to count word frequencies in each study. The ‘bag of words’ tool has the option to use a list of common English ‘stop words’, such as ‘the’, ‘for’ and ‘in’, that can be excluded, because they do not reflect relevant content.
In the third stage, we took the list of word frequency counts within each study, and wrote code to summarize and visualize this across studies (see http://www.matthewgthomas.co.uk/my-research/contentmine-workshop/ for a more detailed description). We first generated a list of all the unique words that occurred across all studies. Next, we sorted the list. We started with sorting by the simple word count (or ‘raw term frequency’). This returned a list of the most frequent words, and also listed for each word in which of the studies it occurred. We then visualized the relation between the most frequent words and the studies in which they occurred with network graphs, where the nodes in the network represented the words and studies, and the connections indicated which words were linked to which study. This gave us both a sense of the relations between different studies, and also indicated which words and concepts were highly relevant in our selection of studies. As suggested by the workshop organizers, we also used an alternative sorting method, the so-called ‘term frequency–inverse document frequency’, which down weighs the ‘raw term frequency’ by how frequently it occurs across all studies, in order to highlight more uniquely ‘informative’ words (words that do not occur in each study).
In the next stage, one could take a subset of the words that are considered most relevant, and use it to start a new search.
In conclusion, we believe that this type of analysis is a valuable tool to summarize an ever growing body of literature, and that it has the potential to inspire new insights in relevant topics and concepts for scientific research questions”.
Peter Murray-Rust has already blogged about this event, see:-
TheContentMine is Ready for Business and will make scientific and medical facts available to everyone on a massive scale.
All of the pictures of this event can be found here on Flickr under CC-BY.