Examining variation in XML section tagging

This post is a brief overview, with examples, of the variation one can encounter in section tagging of academic journal article full text, in XML. The ContentMine aims to provide tools with which to mine academic articles by section (as an option) e.g. abstract, introduction, materials and methods, results, conclusion, acknowledgements, and references. The process of actually identifying and tagging-up distinct sections of articles on an automated basis is not actually as easy as you might think.

New software published this year allows EuropePMC to offer search by section on their full text open access subset:

Kafkas, Ş., Pi, X., Marinos, N., Talo’, F., Morrison, A., and McEntyre, J. R. 2015. Section level search functionality in Europe PMC. Journal of Biomedical Semantics

From testing against a set of 100 randomly-chosen, open access XML full text articles, the authors report that their open source software can tag sections with over 99% precision and over 96% recall. Impressive stuff. By why not 100%? In this post I hope to give you an insight into why tagging-up distinct sections of XML full text articles isn’t a completely solved problem yet.

Current examples of sectioning from NLM XML

From what I’ve looked at so far, and I haven’t looked at ALL tags, just some, I think PeerJ has logical and consistent section tagging. It has the traditional IMRAD sections and order (IMRAD is shorthand for “introduction, materials & methods, results and discussion”). As well as less commonly tagged-up sections such as the funding statement (funding-group).

<abstract>
<funding-group>
<sec sec-type="intro">
<sec sec-type="materials|methods">
<sec sec-type="results">
<sec sec-type="discussion">
<sec sec-type="conclusions">
<sec sec-type="supplementary-material" ... >
<ack> #acknowledgements
<sec sec-type="additional-information">
<ref-list content-type="authoryear">

I was a little disappointed with eLife XML . They appear to only tag the IM of IMRAD – no named section for the results and discussion. I shall talk to Ian about this next time I see him on my way to Cambridge.

<abstract>
<funding-group>
<sec sec-type="intro" ... >
<sec id="s2"> #Typically (always?) the start of the results section
<sec sec-type="materials|methods"  ... >
<sec sec-type="funding-information">
<sec sec-type="additional-information">
<sec sec-type="supplementary-material">
<ack>
<ref-list>

PLOS ONE similarly eschews a named section for results and discussion. In the article I chose for the example (below), results and discussion are labelled “sec014” and Conclusions are labelled “sec021” – in other papers these section numbers will almost certainly represent different types of section so such labels are not as helpful as they could be.

<abstract>
<funding-group>
<custom-meta-group><custom-meta id="data-availability">
<sec sec-type="intro" ... >
<sec sec-type="materials|methods" ... >
<sec id="sec014"><title>Results & Discussion</title>
<sec id="sec021"><title>General Discussion and Conclusions</title>
<sec sec-type="supplementary-material" ... >
<ack>
<ref-list>

The F1000 Research article I examined (PMC4431385), had an unlabelled discussion and conclusions section:

<abstract>
<funding-group>
<sec sec-type="intro">
<sec sec-type="methods">
<sec sec-type="results">
<sec><title>Discussion and conclusions</title>
<ack>
<ref-list>

BMC Evolutionary Biology seems to label only MRAD, with the introduction or as they call it ‘Background’ section not properly labelled with sec-type=”intro” as most other OA publishers do:

<abstract>
<sec><title>Background</title>
<sec sec-type="methods">
<sec sec-type="results">
<sec sec-type="discussion">
<sec sec-type="conclusion">
<sec sec-type="supplementary-material">
<sec><title>Acknowledgments</title>
<sec><title>Authors’ contributions</title>
<sec><title>Competing interests</title>
<sec><title>Availability of supporting data</title>
<sec><title>Acknowledgements</title>
<ref-list>

MDPI’s Toxins has precisely zero IMRAD sections labelled, nor could I find any discernible funding statements, below is one example article:

<abstract>
<sec><title>1. Introduction</title><p>The contamination of feed
<sec><title>2. Results and Discussion</title><sec><title>2.1. P
<sec><title>2.2. Zearalenone Adsorption Screening and Correlati
<sec><title>3. Experimental Section </title><sec><title>3.1. My
<sec><title>3.2. CEC and Exchangeable Base Cations</title><p>A 
<sec><title>3.3. Other Characterization Tests</title><p>To meas
<sec><title>3.4. Zearalenone Adsorption Screening</title><p>A s
<sec><title>3.5. Statistical Analysis</title><p>Analysis of var
<sec><title>4. Conclusions</title><p>Twenty-seven frequently-us
<ack>
<ref-list>

NPG’s Nature varied in labelling depending upon the article type. ‘Letters’ do not appear to have labelled “intro” sections – too short? Whereas full ‘Articles’ in Nature’s parlance do have labelled introduction sections.

<abstract>
<sec sec-type="intro" ... >
<sec sec-type="methods" ... >
<sec sec-type="supplementary-material" ... >
<sec sec-type="results" ... >
<sec sec-type="discussion" ... >
<sec sec-type="extended-data" ... >
<ack ... >
<ref-list>

Elsevier’s Biological Conservation also appears to go for MRAD labelling:

<abstract ... >
<sec sec-type="methods" ... >
<sec sec-type="materials|methods" ... >
<sec sec-type="results" ... >
<sec sec-type="discussion" ... >
<sec sec-type="conclusions" ... >
<sec sec-type="supplementary-material" ... >
<funding-source ... >
<ack ... >
<ref-list>

Wiley’s Evolution also follows MRAD:

<abstract>
<sec sec-type="methods"><title>Methods</tit
<sec sec-type="results"><title>Results</tit
<sec sec-type="discussion"><title>Discussio
<sec sec-type="supplementary-material">
<ack>
<ref-list>

Hindawi’s Anemia XML resembled MDPI’s approach – no labelled IMRAD sections. In article PMC4334859 (XML snippets below) there wasn’t even an <abstract> tagged section. In fairness I examined another article from this journal (PMC4312619) and found that this one did have an <abstract> tagged section.

<sec id="sec1"><title>1. Background</title><p>Iron deficiency 
<sec id="sec2"><title>2. Methods </title><p>This study was con
<sec id="sec3"><title>3. Result</title><sec id="sec3.1"><title
<sec id="sec3.2"><title>3.2. Hematological and Ferritin Status
<sec id="sec3.3"><title>3.3. Grouping Study Participants</titl
<sec id="sec3.4"><title>3.4. Correlations between Mothers and 
<sec id="sec4"><title>4. Discussion</title><p>In our study, we
<sec id="sec5"><title>5. Conclusion </title><p>Median hemoglob
<sec sec-type="conflict"><title>Conflict of Interests</title><
<sec><title>Authors' Contribution</title><p>Betelihem Terefe
<ack>
<ref-list>

LWW’s Academic Medicine also wasn’t much labelled. No funding section, nor use of the <ack> tag.

<abstract>
<sec><title>Problem</title><p>Annually affe
<sec><title>Approach</title><p>Research ind
<sec><title>Outcomes</title><p>By October 2
<sec><title>Next Steps</title><p>Future eva
<sec><title>Problem</title><p>Sepsis is a g
<sec><title>Approach</title><sec><title>Cre
<sec><title>User experience.</title><p>Sept
<sec><title>Design rationale.</title><p>A m
<sec><title>Harnessing emerging trends and 
<sec><title>Gamification.</title><p>Septris
<sec><title>Technology.</title><p>The desig
<sec><title>Evaluating Septris</title><p>Gi
<sec sec-type="methods"><title>Evaluation m
<sec><title>Measuring worldwide game dissem
<sec><title>Measuring impact on learner kno
<sec><title>Outcomes</title><p>Septris had 
<sec><title>Next Steps</title><p>To our kno
<sec><title/><p><italic>Acknowledgments:</i
<ref-list>

 

Conclusions

It’s a mixed bag, some publisher XML is clearly better/worse than others on an objective basis. I have a lot of respect for Şenay Kafkas‘s work on tagging sections, now that I’ve seen just how varied they can be!

In my biased opinion, as a systematist, I think Pensoft has the best XML markup. They provide from their website TaxPub marked-up content, with special bits that are super useful to taxonomists. Checkout how descriptive this is:

<abstract>
<sec sec-type="Citation">
        <sec sec-type="Introduction">
        <sec sec-type="materials|methods">
            <sec sec-type="Sources of material and abbreviations">
            <sec sec-type="Measurements and indices">
            <sec sec-type="Genetic analysis">
        <sec sec-type="Results">
            <sec sec-type="Key to species of Crematogaster fraxatrix-group">
        <sec sec-type="Taxonomy">
                <tp:treatment-sec sec-type="Type locality">
                <tp:treatment-sec sec-type="Type-specimens">
                <tp:treatment-sec sec-type="Measurements and indices">
                <tp:treatment-sec sec-type="Diagnosis">
                <tp:treatment-sec sec-type="Worker description">
                <tp:treatment-sec sec-type="Distribution">
                <tp:treatment-sec sec-type="Etymology">
                <tp:treatment-sec sec-type="Type material examined">
                <tp:treatment-sec sec-type="Other material examined">
                <tp:treatment-sec sec-type="Measurements and indices">
                <tp:treatment-sec sec-type="Diagnosis">
                <tp:treatment-sec sec-type="Worker description">
                <tp:treatment-sec sec-type="Distribution">
                <tp:treatment-sec sec-type="Type locality">
                <tp:treatment-sec sec-type="Type-specimens">
                <tp:treatment-sec sec-type="Other material examined">
                <tp:treatment-sec sec-type="Measurements and indices">
                <tp:treatment-sec sec-type="Diagnosis">
                <tp:treatment-sec sec-type="Worker description">
                <tp:treatment-sec sec-type="Distribution">
                <tp:treatment-sec sec-type="Etymology">
	<ack>
	<ref-list>

An issue Pensoft & NLM might want to confer on is how some of that perfect XML seems to get mangled by PubMed. When downloaded from Europe PubMedCentral the XML for Pensoft articles appears to lose the sec-type=”intro” label, which is a pity:

<abstract>
<sec><title>Introduction</title>
<sec sec-type="Citation">
...

 

Do get in contact with me @rmounce if you feel these snippets unfairly represent your markup. This wasn’t intended to be a systematic survey, merely a brief examination of the variation one can encounter.

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s