Explaining the difference between getpapers and quickscrape

Having written a blog post about getpapers yesterday, I thought it might be useful to explain the difference in utility between getpapers and quickscrape.

I think of getpapers as a handy command-line tool for search & retrieval of relevant research. However, there are a variety of circumstances that can prevent getpapers from returning you the full text of some relevant papers, this is where quickscrape becomes very useful.

quickscrape is a command-line tool simply for retrieval of known research you want to download, with more power and flexibility of download techniques than getpapers. To some extent, it is in theory possible to get anything and everything you have legal access to, in bulk, via quickscrape. Now that’s what I mean by POWER!


Q: Is there a situation in which I might use both getpapers and quickscrape?

A: Yes! getpapers has functionality specifically designed for input into quickscrape which can be very useful when getpapers finds relevant closed access papers for which publisher-imposed restrictions don’t allow EPMC to make available for full text download.

A worked example: I want to mine the last 3 months of papers published in PNAS. PNAS typically imposes a 6-month embargo on research published in it, so EPMC cannot allow full-text download of recent PNAS research from EPMC. So you have to go via the PNAS journal website to get recent PNAS articles.

# Use getpapers to get a list of all recent PNAS articles
  --query 'JOURNAL:"PNAS" AND FIRST_PDATE:[2015-04-01 TO 2015-07-01]' 
  --outdir recentpnas 

# Use quickscrape to download recent PNAS articles output by getpapers
  --urllist recentpnas/fulltext_html_urls.txt 
  --scraper journal-scrapers/scrapers/pnas.json  
  --output recentpnasfull 
  --outformat bibjson

Perfect synergy, eh?


Q: What’s a real use case in which someone would use quickscrape instead of getpapers?

A: When the journal (e.g. Acta Palaeontologica Polonica) or platform (e.g. bioRxiv) that the desired research is published on, is not in Europe PubMedCentral (EPMC), arXiv, or IEEE.

Incidentally, there are two Acta Palaeontologica Polonica articles in EPMC and I have no idea why they are in EPMC to be honest! It would certainly make my life easier if EPMC / PMC were more widely scoped in terms of subjects/journals allowed in.

I’m not a biomedical researcher myself so unfortunately this is a common problem for me. There is no central aggregation of evolution, ecology or palaeontology journal content – if you want to do full text mining on them you have to aggregrate the content yourself, with quickscrape !





Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s