Announcing ContentMine Fellowships!

ContentMine Fellows Twitter Graphic.jpegContentMine is a scholarly assistant for the 21st Century – we’re a team of researchers and developers building an open source pipeline for mining facts from the scientific literature: from genes to species, diseases to chemicals. We’re looking for early adopters who want to fast forward their literature-based research and extract information from thousands of papers for collation and analysis. If you have a research project that involves manually searching through thousands of documents, we could help!

In this first round, we will fund up to five fellows for six months to work on research projects related to the life sciences.

Eligibility

Anyone is welcome to apply with an interesting idea to explore using the scientific literature and a basic set of programming skills. We welcome and encourage applicants from outside academia. Due to the software still being in an alpha state we require applicants to have basic knowledge of:

  • UNIX command line
  • At least one programming language such as javascript, java, python, R, MATLAB or other data science/statistical programming languages.
  • Version control using git and github

What ContentMine Fellows can expect from us:

  • £1000 and some ContentMine merchandise!
  • A one-day webinar workshop to help you get started with the software.
  • Fortnightly support calls with the ContentMine developers.
  • Access to support via ContentMine Slack (chat app).
  • Priority support and bug fixes to help keep your research up and running.
  • Excitement and enthusiasm for your project! We are researchers and we love science.

What we expect from ContentMine Fellows:

  • We promote open notebook science! You will record your progress via github and the ContentMine discussion forum.
  • Three blog posts summarising progress over the course of six months, one of which will be the final report.
  • Willingness to explore new methods of research and research communication.
  • Attendance at fortnightly calls with members of the ContentMine team to help each other and discuss bugs or features for the software.
  • Detailed bug reporting and feedback on our software.

Application Process

Please submit the following as email attachments to Jenny Molloy (via contact [at] contentmine [dot] org): a one-page summary of your research idea, a CV and cover letter explaining your eligibility and why you would like to be a ContentMine Fellow. Shortlisted applicants will be asked to perform a simple task with the ContentMine software and attend a brief online interview. Applications close on 3 June 2016, interviews will take place shortly afterwards and the fellowships will run 1 July – 31 December 2016.

ContentMine Cambridge CMunity Meetup

8/9/2015

ContentMine aims to extract 100 million facts from the scientific literature. To achieve this, we are building a CMunity of content miners in Cambridge and will teach you how to extract data and text from papers using our open source tools.

Whether you’re interested in finding chemistry, phylogenetic trees, DNA sequences, tracking species, gathering papers for a systematic review, automatically extracting data from graphs and more, come and join us for the informal and informative meetup at the Panton Arms.

All are welcome – you could be new to mining and want to find out what it could offer you and your research through to an experienced coder interested in adding to our tool chain – either way we look forward to seeing you there!

October 2015 ContentMine Cambridge CMunity Meetup

6/10/2015

ContentMine aims to extract 100 million facts from the scientific literature. To achieve this, we are building a CMunity of content miners in Cambridge and will teach you how to extract data and text from papers using our open source tools.

Whether you’re interested in finding chemistry, phylogenetic trees, DNA sequences, tracking species, gathering papers for a systematic review, automatically extracting data from graphs and more, come and join us for the informal and informative meetup at the Panton Arms.

All are welcome – you could be new to mining and want to find out what it could offer you and your research through to an experienced coder interested in adding to our tool chain – either way we look forward to seeing you there!

Daily updates on IUCN Red List species

The ContentMine team have been working on user stories for our daily stream of facts. One such stand-out user story that we can easily cater-for with our tools is that of conservation biologists & practitioners looking to stay up-to-date with the very latest literature on IUCN Red List species.

Take for instance the journal PLOS ONE. It’s open access but the high volume and broad subject scope mean that people sometimes struggle to keep-up with relevant content published there. Recently (May, 2015) an interesting article on an endangered species of frog was published in PLOS ONE; we shall use this as an example in this post henceforth: Roznik EA, Alford RA (2015) Seasonal Ecology and Behavior of an Endangered Rainforest Frog (Litoria rheocola) Threatened by Disease. PLoS ONE 10(5): e0127851. doi:10.1371/journal.pone.0127851

 

Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA
Common Mist Frog (Litoria rheocola) of northern Australia. Credit: Damon Ramsey / Wikimedia / CC BY-SA

 

This kind of peer-reviewed, published information is vitally important to conservation organisations. Typically, the Red List status of many groups is assessed and re-assessed by experts only every 5 years. It is extremely expensive, time-consuming, and tedious for humans to do these kinds of systematic literature reviews. We suggest that intelligent machines should do most of this screening work instead.

We think we could make the literature review process; cheaper, more rigorouscontinuous and transparent by publishing a daily stream of facts related to all Red List species. For the above paper, our 26 summary snippet facts extracted from the full text, labelled by section, might look something like the below (bold emphasis is mine to highlight the entity we would match). Note this reduces the full text from over 6000 words to a more bite-size summary of just ~700. Multiply this effect across thousands of papers and searches for thousands of different species and you might begin to understand the usefulness of this:

* Note that because PLOS ONE is an openly-licensed journal we can re-post as much context around each entity as we wish. <!–Other publishers such as Elsevier wish to impose a strict limit of only 200 characters of context around an entity. –>

 

Text as extracted by our ami-species plugin :

From the Introduction: section:

  • ...One such species is the common mistfrog (Litoria rheocola), an IUCN Endangered species [22] that occurs near rocky, fast-flowing rainforest streams in northeastern Queensland, Australia [23]…
  • Litoria rheocola is a small treefrog (average male body size: 2.0 g, 31 mm; average female body size: 3.1 g, 36 mm [20])
  • Habitat modification and fragmentation also threaten L.rheocola [23,27]...
  • Very little is known of the ecology and behavior of L. rheocola. Individuals of this species call and breed year-round, although reproductive behavior decreases during the coolest weather [19,23]…
  • We used harmonic direction finding [32,33] to track individual L. rheocola and study patterns of movement, microhabitat use, and body temperatures during winter and summer…
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation.
  • The goal of our study was to understand the behavior of L. rheocola, and how it is affected by season and by sites that vary in elevation…
  • We provide the first detailed information on the ecology and behavior of L. rheocola and suggest ecological mechanisms for observed patterns of infection dynamics…

and from other sections:

Methods:

  • Because L. rheocola are too small to carry radiotransmitters, we tracked frogs using harmonic direction finding [32,33].
  • However, this was unlikely to cause a bias toward shorter movements in our study; L. rheocola has strong site fidelity and when a frog was not found on a particular survey (or surveys), it was almost always subsequently found less than 2 m from its most recent known location.
  • Litoria rheocola is a treefrog, and individuals move along and at right angles to the stream and also climb up and down vegetation; therefore, they use all three dimensions of space, with their directions of movement largely unconstrained in the horizontal plane but largely restricted to movements up and down individual plants in the vertical direction.
  • These models lose and gain water at rates similar to frogs, and temperatures obtained from these permeable models are closely correlated with L. rheocola body temperatures [43].

Results

  • Fig 1. Distances moved by common mistfrogs (Litoria rheocola).
  • Fig 2. Proximity of common mistfrogs (Litoria rheocola) to the stream.
  • Fig 3. Distribution of estimated body temperatures of common mistfrogs (Litoria rheocola) within categories relevant to Batrachochytrium dendrobatidis growth in culture (<15°C, 15–25°C, >25°C).
  • Fig 4. Mean estimated body temperatures of common mistfrogs (Litoria rheocola) over the 24-hr diel period.
  • Table 1. Characteristics of microhabitats used by common mistfrogs (Litoria rheocola) during the day and night at two rainforest streams (Frenchman Creek and Windin Creek) during the winter (cool/dry season) and summer (warm/wet season).
  • Table 2. Results of separate one-way ANOVAs comparing characteristics of nocturnal perch sites that were available to and used by common mistfrogs (Litoria rheocola) at two rainforest streams (Frenchman Creek and Windin Creek) during winter (cool/dry season).

Discussion:

  • Our study provides the first detailed information on the ecology and behavior of the common mistfrog (Litoria rheocola), an IUCN Endangered species [22].
  • Overall, we found that L.rheocola are relatively sedentary frogs that are restricted to the stream environment, and prefer sections of the stream with riffles, numerous rocks, and overhanging vegetation (Table 2).
  • Our data confirm that L. rheocola are active year-round, but their behavior varies substantially between seasons.
  • Retallick [31] also found that juvenile and adult L. rheocola in field enclosures altered their behavior by season in similar ways; frogs used elevated perches more often in summer, and aquatic microhabitats more often during winter.
  • Additionally, Hodgkison and Hero [30] observed more L. rheocola at the stream during warmer months, suggesting that during that period frogs used perch sites that were more exposed and elevated than those used during cooler months, when frogs were seen less frequently.
  • The sedentary behavior of L. rheocola also may increase the vulnerability of this species to chytridiomycosis, particularly during winter, when movements are reduced.
  • Our results suggest that seasonal differences in environmental temperatures and L. rheocola body temperatures should cause this species to be more likely to develop B. dendrobatidis infections during cooler months and at higher elevations (Figs 3 and 4); this matches observed patterns of infection prevalence [9].
  • Our study provides detailed information on the movements, microhabitat use, and body temperatures of uninfected L. rheocola, and reveals how these behaviors differ by season and between sites varying in elevation.

 

NOTE: We have shown all phrases in the PLoSONE paper about L. rheocola. However our ami software allows us to select particular sections for display, or to restrict our search and filtering (e.g. to the Methods section).

This approach provides far far more information than is indicated in the abstract for the paper. Yet it also condenses it down to useful & relevant facts. We think this could be very useful to many…

To see the full stream of facts output by the ami-species plugin go to http://facts.contentmine.org/. It isn’t filtered specifically for IUCN RedList species yet, but if you’re interested in seeing this happen or something similar, please get in contact with us over at the forum: http://discuss.contentmine.org/ or via twitter @theContentMine.

Audio interview with Peter Murray-Rust on the Data Skeptic Podcast (53 minutes)

In August, Peter Murray-Rust agreed to doing an interview with Kyle Polich at Data Skeptic “The podcast that is skeptical of and with data”. The interview was published online on 28th August 2015.

Data Skeptic is a podcast that alternates between short mini episodes with the host explaining concepts from data science to his non-data scientist wife, and longer interviews featuring practitioners and experts on interesting topics related to data, all through the eye of scientific skepticism.

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program’s founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copyright, and several other interesting topics.

Full transcript available here: ContentMine full transcript.

The draft transcript is  ~98% accurate and Peter will edit this in due course.

//platform.twitter.com/widgets.js

 

Both the audio and transcript are licensed under Creative Commons CC-BY.

With so much valuable content, in due course, we shall break this down into more sizable segments, but for now, enjoy the interview in full.

Just some of the topics covered:-

 

Guidance on how to use our Communities (CMunity)

Launched in July 2015, we have integrated the Discourse platform with our own thus creating our CMunity web resources.

About the CMunities category.

Follows some general guidelines for using these resources.

This is a Civilized Place for Public Discussion

Please treat this discussion forum with the same respect you would a public park. We, too, are a shared community resource — a place to share skills, knowledge and interests through ongoing conversation.

These are not hard and fast rules, merely aids to the human judgment of our community. Use these guidelines to keep this a clean, well-lighted place for civilized public discourse.

Improve the Discussion

Help us make this a great place for discussion by always working to improve the discussion in some way, however small. If you are not sure your post adds to the conversation, think over what you want to say and try again later.

The topics discussed here matter to us, and we want you to act as if they matter to you, too. Be respectful of the topics and the people discussing them, even if you disagree with some of what is being said.

One way to improve the discussion is by discovering ones that are already happening. Please spend some time browsing the topics here before replying or starting your own, and you’ll have a better chance of meeting others who share your interests.

Keep It Tidy

Make the effort to put things in the right place, so that we can spend more time discussing and less cleaning up. So:

  • Don’t start a topic in the wrong category.
  • Don’t cross-post the same thing in multiple topics.
  • Don’t post no-content replies.
  • Don’t divert a topic by changing it midstream.
  • Don’t sign your posts — every post has your profile information attached to it.

Rather than posting “+1” or “Agreed”, use the Like button. Rather than taking an existing topic in a radically different direction, use Reply as a Linked Topic.

Post Only Your Own Stuff

You may not post anything digital that belongs to someone else without permission. You may not post descriptions of, links to, or methods for stealing someone’s intellectual property (software, video, audio, images), or for breaking any other law.

Powered by You

This site is operated by your friendly local staff and you, the community. If you have any further questions about how things should work here, open a new topic in the meta category and let’s discuss! If there’s a critical or urgent issue that can’t be handled by a meta topic or flag, contact us via the staff page.

Terms of Service

Yes, legalese is boring, but we must protect ourselves – and by extension, you and your data – against unfriendly folks. We have a Terms of Service describing your (and our) behavior and rights related to content, privacy, and laws. To use this service, you must agree to abide by our TOS.

The Content Mine one year on

workshops

This is my first blog post on the splendid new ContentMine website, and thanks to all those who have worked so hard. It’s exactly a year since we launched The Content Mine in Austria:

We launch The Content Mine In Vienna, Interviews, Talks and our first public Workshop

Here are some of the comments I made in that post:

Last week was one of the most exciting in my life – but also among the hardest I have worked. I travelled from Budapest to Vienna to be the guest of the Austrian Science Fund (FWF) and to give a lecture: .. I changed the title to “Open Notebook Science” in honour of the late Jean-Claude Bradley and to promote his ideas. My talk’s on Slideshare.

Then the “launch” of The Content Mine ( http://contentmine.org ), my Shuttleworth Fellowship project, which aims to extract 100,000,000 facts from the scientific literature. The philosophy is not that *I* do this but that *WE* do this.

have reliable, compelling, distributable software. That’s hard. But we’ve got one of the best small teams in the world – it would be harder to think of a better one. That’s because we are developer-scholars – we are not only very experienced]in the coding and design of information , but we are also expert in our own right in our fields (Chemistry, Phylogenetics, Plant Genetics, and Informatics/Scholarly Publishing). That means we know where we are going, know what works (or rather what *doesn’t* work!) and know who else in the world is doing similar stuff. And because I’m funded by the Shuttleworth Foundation there’s a guarantee  that we won’t get bought by Elsevier or Macmillan or Thomson-Reuters. I wouldn’t swap any of the team for ten million dollars – that’s how important they are to my life.

we are reaching out through workshops. We’re doing several this summer [2014] – Edinburgh, Berlin/OKFest, Wikimania, OK Brazil, and one or two more yet to be finalised. We’re informed by the Software Carpentry philosophy, where we run a workshop for a sponsor, and during the workshop train apprentices. Then these apprentices will be able to help run new workshops and then their own workshops.

Then the next day an all-day hack run by OKFN Austria (Stefan Kasberger [now on our team!] and Peter Kraker (Panton Fellow) – report here. A wonderful hackspace (metalab), couches, soft drinks + honour payment, bits of kit lying around – graffiti – you know the sort of thing.

And then at the end 4 invited speakers (including PMR). We are very impressed by OKFN Austria – the day drew perhaps 25 people. And a lovely city.

But Exhausting! At the  end I crashed for a long night. (In writing my Shuttleworth Quarterly report I was asked “What was your greatest loss during this quarter?” Answer: SLEEP!)

A year on we have remained true to the vision – we have run lots of workshops, developed lots of software and made lots of friends. And I now feel we have achieved “critical mass”. People know about us, want to share the vision and want mutual help. I now hope to blog frequently on this blog and to discuss strategy, technology and the wider social and political dimensions of content mining.

[Credits: Judith Murray-Rust, OKF Volunteers, license CC BY 4.0]