Responsible content mining: how should I mine?
- Presentation and Q&A by ContentMine staff covering server limits, online scraping etiquette and responsible technology use.
These sessions also included one very important one from Mark MacGillivray.
THIS IS MASSIVE
http://catalogue.cottagelabs.com/browse The formal part of the day ended with:- Hackday pitches and forming teams
- Presentations by individuals and groups followed by discussion in newly formed teams facilitated by ContentMine staff.
Around a dozen of the group then retired to The Euston Flyer for some socializing.
Day2 involved hacking in teams from 09:00 until 15:30
Thereafter a number of delegates from a dozen funders, policy makers, etc joined us. These included the likes of Wellcome Trust, HEFCE, BIS, RLUK, British Library, the Royal Society and Nature Publishing Group.
This was followed with:- Introduction to content mining
- Presentation delivered by Peter Murray-Rust to new attendees
Next was:- Presentation of hackday projects
- Presentations delivered by participants, including future scope for development of their projects.
We heard from:- 1 Michael Baron
A potentially very widely appreciated use of content mining would be for a more in depth literature review. Currently, PubMed’s search algorithm is limited to the content of titles and abstracts. Publications cannot be screened on e.g. keywords in the method section.
The scenario we explored at the content mine workshop centred on a bio-bank for respiratory human tissues. We aimed to find new collaborators that could make use of the resource.
For this two lists of keywords were compiled: One to search PubMed Central for publications in the field and a second list to scrape the papers’ method sections for relevant techniques and materials.
Eventually, author information was to be extracted from papers with relevant methods in the field of respiratory research. For example, groups that worked with mouse tissue so far.
A similar approach could be very useful for general, but very specific, literature collection – find all papers in the field of ‘x’ that used technique ‘y’ or cell-line ‘z’. Any information that is not likely to be found in the abstract or title.
2 Frank Hellwig
Frank showed how text mining can be used in cases where beside journal articles and books relevant sources often include large numbers of documents not available in XML or HTML, but e.g. in PDF or Word format only. He demonstrated – after converting PDFs to Scholarly HTML – the use of regular expressions to identify documents and present in one list snippets of text extracted from a document corpus based on whether certain terms and their variants occur – enabling time-saving searching and inspection
3 Elise Ruark
4 Jasper Poort & Dafne Zuleima Morgado Ramirez
“In the second day of the workshop we wanted to use the ContentMine software for two purposes. First, we wanted to try to describe and summarize the currently available (open-access) scientific literature on a specific topic (data description). Second, we hoped that this summary could generate new ideas for further literature searches (data exploration).
In the first stage, we used specific criteria to generate a list of the web addresses of relevant scientific publications, using the PLoS ONE website (but other sources such as Europe PubMed Central could also be used). In our example, we used the search terms ‘human’, ‘robotic’ and ‘exoskeleton’. However, we note that any combination of search terms could be used at this stage, for example, to search for specific gene and disease combinations, or to search for conjunctions of brain areas and cognitive functions.
In the second stage, we used a sequence of different ContentMine tools. We first extracted the full text from each study (‘scraping’ step), and then converted each study to the same format (‘scholarly html’, ‘normalization’ step). Finally, we used the ‘bag of words’ tools to count word frequencies in each study. The ‘bag of words’ tool has the option to use a list of common English ‘stop words’, such as ‘the’, ‘for’ and ‘in’, that can be excluded, because they do not reflect relevant content.
In the third stage, we took the list of word frequency counts within each study, and wrote code to summarize and visualize this across studies (see http://www.matthewgthomas.co.uk/my-research/contentmine-workshop/ for a more detailed description). We first generated a list of all the unique words that occurred across all studies. Next, we sorted the list. We started with sorting by the simple word count (or ‘raw term frequency’). This returned a list of the most frequent words, and also listed for each word in which of the studies it occurred. We then visualized the relation between the most frequent words and the studies in which they occurred with network graphs, where the nodes in the network represented the words and studies, and the connections indicated which words were linked to which study. This gave us both a sense of the relations between different studies, and also indicated which words and concepts were highly relevant in our selection of studies. As suggested by the workshop organizers, we also used an alternative sorting method, the so-called ‘term frequency–inverse document frequency’, which down weighs the ‘raw term frequency’ by how frequently it occurs across all studies, in order to highlight more uniquely ‘informative’ words (words that do not occur in each study).
In the next stage, one could take a subset of the words that are considered most relevant, and use it to start a new search.
In conclusion, we believe that this type of analysis is a valuable tool to summarize an ever growing body of literature, and that it has the potential to inspire new insights in relevant topics and concepts for scientific research questions”.
5 Helena Patching And the final session of the whole event was :- Panel discussion on accelerating uptake of content mining.
- Panel and Q&A with audience including workshop participants.
This session included:-
- TDM policy and advocacy, Robert Kiley
- Why are Wellcome interested in TDM ?
- Ben White of LIBER and an IPO employee gave a 3min to talk to the questions posed below, and then opened-ed it up for discussion.
- What do you think needs to happen to drive TDM innnovation in the UK?
- How do we reconcile the non-commercial nature of the UK copyright exception for TDM with the clear intent for copyright reform to bring economic benefits?
- One way each organisation/individual can contribute to promoting TDM in the UK and Europe
Peter Murray-Rust has already blogged about this event, see:-
All of the pictures of this event can be found here on Flickr under CC-BY. |