Mining Images for Identifiers

Figure images in scholarly articles typically contain a wealth of interesting data. In terms of textual data, identifiers such as GenBank / ENA accession numbers can be relatively common in certain types of figure images. Take for example the below figure reproduced from an Open Access article in IJSEM:

figure 3 from Nakai et al (2014) doi: 10.1099/ijs.0.060798-0 CC-BY
Figure 3, Nakai et al (2014) doi: 10.1099/ijs.0.060798-0 Licence: CC-BY

 

Most tip labels of the phylogeny above specify the exact accession number of the nucleotide sequence data that underlies the result. e.g. AB540021 – the accession number for Oligoflexus tunisiensis 16S rRNA partial sequence. This aids the reproducibility of the research presented.

But at scale, how can we discover, index and/or re-use this identifier information? It’s worthy of noting that for this paper and all others like it in IJSEM, only the accession number of the taxon of interest is printed in the textual content of the paper or its supplementary materials. All the other accession numbers only appear in this image, as pixels, utterly unindexed, undiscoverable by anyone. So if you want to query: Which papers have used or referred to accession number ‘JF181808’? –you really have delve into OCR technology to unlock this data from these image pixels.

With the ContentMine toolset, I’ve been using tesseract-ocr to perform OCR on a set of over 4,000 of these phylogentic tree figures from the journal IJSEM. From this, it’s becoming obvious that these accession numbers were never designed with OCR-retrieval in mind! Unlike other identifier systems, these identifiers have no ‘check digit‘ to help validate them, thus some OCR-errors can be easier to detect and correct than others.

 

Some examples of OCR error

  • (A8486128)  — if OCR returns this eight character string between parentheses, we know it is an error of some sort because GenBank/ENA accession numbers have a specific format, which this example does not quite conform to. An error is detected because it’s an identifier of eight characters in length but the second character is a number (it should be a letter to conform). It is highly likely that the ‘8’ was originally a ‘B’ in the source image and thus in this instance automatic error correction can be safely applied, post-OCR.
  • (FJ71077Z) — in a similar example the last character is a letter, not a number which does not conform to what we expect from the proscribed accession number pattern. In this instance the ‘Z’ must be corrected to a ‘2’ to get the real accession number.
  • (AB2B8551) — this example demonstrates the difference in ease between error detection and error correction. The 4th character should be a number, but OCR has determined it incorrectly to be a ‘B’. This is easy to detect. However, what is the original number that the ‘B’ should be? OCR errors often mistake either ‘3’, ‘6’ or ‘8’ for a ‘B’ depending on what font is used. Without further lookup and/or checking, this error cannot be safely corrected with certainty.
  •   (DQ289039) — this is an example of difficult error detection and difficult error correction. If the OCR incorrectly returns ‘8’ in the numerical section of the identifier instead of a ‘3’ or a ‘6’ it is difficult to detect that an error has occurred because the string is still a valid accession number. Furthermore, the most likely alternatives DQ239039 and DQ269039 are also valid accession numbers. So to detect and correct this error one really needs to cross-reference with another data source e.g. the taxon name or the strain information that is also given in the diagram.

 

Automatic error detection is thus somewhat easier than automatic error correction and both are not infallible. In processing this accession data from the 4,000+ figure images I must work hard to provide a valid estimate of both the accuracy of my error detection and the accuracy of my error correction routines, for the data to be usable.

 

What can we do with this identifier data? One example…

 

There’s a trend towards increased data citation in the biological sciences at the moment. Use and re-use of molecular sequence data in phylogenetic analyses is almost never formally cited. But by extracting these accession numbers from figure images, combined with some simple lookup code to determine authors, we can retrospectively look at who’s data people have using — who was responsible for originally sequencing the phylogenetically useful sequences (typically 16S rRNA)  from these organisms. Using some very rough, first pass OCR data obtained from phylogenetic tree images from 3864 IJSEM papers I can generate a tantalizing ‘leaderboard’ of microbial sequence contributors (most sequences have multiple authorship like academic papers):

 

Author Name (String)

Number of Contributed Sequences

Used in 3864 IJSEM papers

Stackebrandt,E. 5472
Rainey,F.A. 3749
Collins,M.D. 2890
Woese,C.R. 2810
Schumann,P. 2787
Swings,J. 2244
Yoon,J.H. 2241
Kroppenstedt,R.M. 1936
Christen,R. 1885
Vancanneyt,M. 1694

 

It’s no surprise to see Carl Woese up there in the top 10 as he was one of the early pioneers of the use of 16S rRNA for bacterial phylogenetics. The others were news to me as I’m not a specialist in the area. Amazingly, on average Stackebrandt appears to have contributed to the sequencing of more than one sequence per paper across my sample. This isn’t too surprising, there are often 30 or more sequences used in a single phylogenetic analysis, but still – that’s an impressive impact on the field!

Could this be the start of altmetrics for data-usage? I don’t know. But it sure is interesting to see who’s behind all this sequencing data that is so commonly re-used by others. Credit where credit is due, even if it’s many years after the fact.

Advertisements

Content Mining: Extraction of data from Images into CSV files – step 0

Last week I showed how we can automatically extract data from images. The example was a phylogenetic tree, and although lots of people think these are wonderful, even more will have switched off. So now I’m going to show how we can analyse a “graph” and extract a CSV file. This will be in instalments so that you will  be left on a daily cliff-edge… (actually it’s because I am still refining and testing the code).  I am taking the example from “Acoustic Telemetry Validates a Citizen Science Approach for Monitoring Sharks on Coral Reefs” (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0095565) [I’ve not read it, but I assume they got volunteers to see how long they could evade being eaten with and without the control).

Anyway here’s our graph. I think most people  can understand it. There’s:

  • an x-axis, with ticks, numbers (0-14), title (“Sharks detected”) and units (“Individuals/day”)
  • a y-axis, with ticks, numbers (0-20), title (“Sharks observed”) and units (“Individuals/day”)
  • 12 points (black diamonds)
  • 12 error bars (like Tie-fighters) appearing to be symmetric
  • one “best line” through the points

raw

We’d like to capture this as CSV. If you want to sing along, follow: http://www.bitbucket.org/petermr/diagramanalyzer/org.xmlcml.diagrams.plot.PlotTest (the link will point to a static version – i.e. not updated as I add code).

This may look simple, but let’s magnify it:

antialias

Whatever has happened? The problem is that we have a finite number of pixels. We might paint them black (0) or white (255) but this gives a jaggy effect which humans don’t like. So the plotting software adds gray pixels to fool your eye. It’s called antialiasing (not a word I would have thought of). So this means the image is actually gray.

Interpreting a gray scale of images is tough, and most algorithms can only count up to 1 (binary) so we “binarize” the image. That means that  pixel becomes either 0 (black) or 1 (white). This has the advantage that the file/memory can much smaller and also that we can do toplogical analyses as in the last blog post. But it throws information away and if we are looking at (say) small characters this can be problematic. However it’s a standard first step for many people and we’ll take it.

The simplest way to binarize a gray scale (which goes from 0 to 255 in unit steps) is to classify 0-127 as “black” and 128-255 as “white”. So let’s do that:

defaultUnthinnedBinary

 

Now if we zoom in we can see the pixels are binary:

binary

So this is the next step on our journey – how are we going to turn this into a CSV file? Not quite as simple as I have made it out – keep your brain in gear…

I’ll leave you on the cliff edge…