Content Mining for Science at OAI9


Next week The Content Mine will be at CERN Workshop on Innovations in Scholarly Communication (OAI9) presenting the our poster. (I’ll outline the various sections in later posts in larger images). In summary:

  • COMMUNITY is at the heart. We are here for your benefit. Our strategy is to support subcommunities in learning  and deploying the technology. Our models are Wikipedia, OpenStreetMap, OpenKnowledge (OKFN), and may other communities which scale by having freedom to innovate round central technologies and visions.
  • We have an exciting new deployable architecture (top panel). This shows our main strategy:
    • CRAWL. We now have many sources of scholarship, both Open and Closed. Remember that we are legally allowed to mine closed material to which we have legal access and also to publish Facts from this process as these are uncopyrightable.
    • SCRAPE. We can scrape material from many publishers and because of our growing community can rapidly add those which are needed. It takes as little as 15 minutes to write some scrapers. Scraping not only downloads full-text but also figures, supplemental data and other resources.
    • NORMAlize. Unfortunately, with a few honourable exceptions, the technical processes of most publishers destroys information. To mine it we have to recover this loss, often heuristically. The result is publisher-independent ScholarlyHTML. Mining normalized material is vastly easier than raw publications.
    • MINE. <tt>AMI</tt> makes it straightforward to interface your algorithms with the ContentMine.  We expect communities of practice, in their own scientific domains, to write specific plugins.
    • CATalogue. The results of mining go directly into an ElasticSearch catalogue, which can be searched online.
    • Canary. The tools above can be launched online through a web-interface, Canary.
  • Mining algorithms and tools (mid-level panels). We have a wide range of new approaches to mining, beyond the traditional “text-mining” (first panel). We can extra images, tables and diagrams from papers and supplementary material, and in many cases extract valuable semantic scientific information. The examples below show:
    • DATA PLOTS. In many cases data is published in graphical form, effectively destroying its content. Even sighted humans can only extract it using rulers or clicking pixels. Here we show the compete automatic (< 1second) extraction of data from an PDF file with effectively no loss of semantics or precision. We are also tackling BAR  plots and FLOWCHARTS.
    • PHYLOGENETIC TREES. Phylogenetic trees cost hundreds of millions USD to calculate but the data are only published in graphs. In many cases we can extract trees from those graphs and create structured accurate information.
    • CHEMISTRY, COMPOUNDS and REACTIONS. (not shown) This is a particular, and largely unique, strength. My group in Chemistry showed that it was possible to extract 100,000 reactions/per day from the patent literature.
  • Outreach and services. Because content-mining has not been generally available to scientists we run workshops and we’ve found that people can install our tools and start mining after 30 minutes instruction. Contact us if you are interested. Content-mining is now a very active political topic, especially in Europe where advocates promote its valuable and many mainstream publishers are lobbying against giving greater freedom to those who want to use the results of scientific work they have paid for. Support us in our mantra:
  • “The RIGHT to READ is the RIGHT to MINE”

Published by

the bear

I have another blog in real life...

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s