Entry points to the ContentMine toolchain

Where to get started when content mining? Well, it really depends on what you’ve currently got and/or where you plan to source your content from. In this bite-size post I’ll cover all 3 of the major entry points to the ContentMine toolchain.

entry points


We envisage that there will be 3 significant entry points to the ContentMine toolchain:

  1. From academic content aggregator websites via getpapers (most recommended route, if possible)
  2. From journal websites via quickscrape
  3. From local-desktop access user-supplied files (least recommended)


All three of these entry points pass content to norma, to normalise the to-be-mined-content to ContentMine standards and specifications, prior to analysis and visualisation by downstream ContentMine tools.


Here’s some command-line examples of how each of these entry-points work:


1.) The ideal workflow if your subject matter / resource provider allows it is to take standardised XML e.g. NLM XML from EPMC and work with this highly structured content. The example below is taken from a previous blog post on finding species.

getpapers --query 'species JOURNAL:"PLOS ONE" AND FIRST_PDATE:[2015-04-02 TO 2015-04-02]'  
          -x  --outdir plos-species
norma -q plos-species/ -i fulltext.xml -o scholarly.html --transform nlm2html
#downstream analyses proceed on normalised content from here onwards...


2.) If your subject matter isn’t covered by IEEE / arXiv / Europe PubMedCentral, or some other reason like pesky embargo periods, then you can try to enter content into the ContentMine toolchain via quickscrape.  In terms of file format the order of preference is (from best to worst): XML > HTML > PDF. Sadly many legacy subscription access publishers choose not to expose content in XML from their journal websites. The HTML workflow goes from publisher HTML -> tidied-up XHTML -> scholarly HTML.

#quickscrape usage on a Nature Communications paper
quickscrape --url http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html  
            --scraper journal-scrapers/scrapers/nature.json --output natcomms
info: quickscrape 0.4.5 launched with...
info: - URL: http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html
info: - Scraper: /home/ross/workspace/quickscrape/journal-scrapers/scrapers/nature.json
info: - Rate limit: 3 per minute
info: - Log level: info
info: urls to scrape: 1
info: processing URL: http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html
info: [scraper]. URL rendered. http://www.nature.com/ncomms/journal/v1/n3/abs/ncomms1031.html.
info: [scraper]. download started. fulltext.html.
info: [scraper]. download started. ncomms1031-s1.pdf.
info: [scraper]. download started. fulltext.pdf.
info: URL processed: captured 11/25 elements (14 captures failed)
info: all tasks completed

tree natcomms/
└── http_www.nature.com_ncomms_journal_v1_n3_abs_ncomms1031.html
    ├── fulltext.html
    ├── fulltext.pdf
    ├── ncomms1031-s1.pdf
    └── results.json

1 directory, 4 files

#norma steps
norma -i fulltext.html -o fulltext.xhtml --cmdir natcomms/ --html jsoup
norma -i fulltext.xhtml -o scholarly.html --cmdir natcomms/ --transform nature2html

tree natcomms/
└── http_www.nature.com_ncomms_journal_v1_n3_abs_ncomms1031.html
    ├── fulltext.html
    ├── fulltext.pdf
    ├── fulltext.xhtml
    ├── ncomms1031-s1.pdf
    ├── results.json
    └── scholarly.html

1 directory, 6 files


3.) If all else fails, you can feed your files into our toolchain directly via norma but this route doesn’t capture rich metadata about each item of user-supplied content, so it’s not an optimal pathway. Here’s an example of three random PDFs being prepared for analysis with norma:

#put content in direct via norma
norma -i A.pdf B.pdf C.pdf -o output/ctrees --cmdir

tree output/
└── ctrees
    ├── A_pdf
    │   └── fulltext.pdf
    ├── B_pdf
    │   └── fulltext.pdf
    └── C_pdf
        └── fulltext.pdf

4 directories, 3 files


…and that’s it. Three different entry-points to the ContentMine toolchain: primarily designed with XML, HTML or PDF in mind but other formats available as input too.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s