ContentMining Conifers and Visualising Output: Extracting Semantic Triples

Last month I wrote about visualising ContentMine output. It was limited to showing and grouping ami results which — in this case — were simply words extracted from the text.

A few weeks back, I got to know the ContentMine fact dumps. These are not just extracted words: they are linked to Wikidata Entities, from which all sorts of identifiers and properties can be derived. This is visualised by factvis. For example, when you hover over a country’s name, the population appears.

Now, I’m experimenting with the next level of information: semantic triples. Because NLP a step too far at the moment, I resorted to extracting these from tables. To see how this went in detail, please take a look at my blogpost. A summary: norma — the ContentMine program that normalises articles — has an issue with tables, table layout is very inconsistent across papers (which makes it hard to parse), and I’m currently providing Wikidata IDs for (hopefully) all three parts of the triple.

About that last one: I’ve made a dictionary of species names paired with Wikidata Entity IDs with this query. I limited the number of results because the Web Client times out if I don’t. I run the query locally using the following command:

curl -i -H "Accept: text/csv" --data-urlencode query@<queryFile>.rq -G https://query.wikidata.org/bigdata/namespace/wdq/sparql -o <outputFile>.csv

Then I can match the species names to the values I get when extracting from the table. If I can do this for the property as well, we’ll have a functioning program that creates Wikidata-grade statements from hundreds of articles in a matter of minutes.

There are still plenty of issues. Mapping Wikidata property names to text from tables, for example, or the fact that there are several species whose name are, when shorted, O. bicolor. I can’t know which one just from the context of the table. And all the table layout issues, although probably easier to fix, are still here. That, and upscaling, is what I’m going to focus on for the next month.

But for now, there’s a demo similar to factvis, to preview the triples:

ctj-factvis

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s