My depression project, starting with an animal model, depression-specific search carried out in Pubmed and Embase in May 2016, includes 70,363 unique records after deduplication using EndNote. I am currently evaluating the options as to what is the best solution to get my full search (or of much of it as possible) into the ContentMine pipeline.
Here is an outline of the options I am considering and my thoughts:
- Wait until the screening section of my systematic review is complete and import only the included studies.
- Pros: Less papers to deal with, which has benefits computationally. Only the most relevant papers would go through the ContentMine pipeline.
- Cons: Do not get information from the annotation carried out with ‘ami’ tool, in order to more thoroughly test out machine learning options for screening in systematic reviews.
- Use the ‘quickscrape’ function on a list of manually collated URLs.
- Pros: Records are smoothly imported into the ContentMine pipeline and in the correct formats.
- Cons: Time. It takes a long time to manually collect the correct URLs and group them into categories for different scrapers. There will likely be issues in that some records will not have a URL. These records would slow the process as a more thorough library search would need to be conducted to find these records or the authors of the record would need to be contacted for full text. A third possibility is that the record does not have an electronic copy. These records could then not be processed by the ContentMine pipeline.
- Run a search using EuPMC with the ‘getpapers’ function. I am currently retrieving about 20,000 records using the search string “(“depressive disorder” OR “depression” OR “depressive behavior” OR “depressive behaviour” OR “dysthymia” OR “dysthymic” AND “animal”)” which has been developed with a librarian at the University of Edinburgh to roughly correspond with the more complex original PubMed & Embase searches.
- Pros: Records are quickly and easily imported into the ContentMine pipeline.
- Cons: Would need to later reconcile the records downloaded with EuPMC search with the included papers from screening in order to translate these forward in the systematic review pipeline, to the data extraction phase.
- Retrieve pdfs for the records of my search using EndNote and run ‘norma’ to convert pdfs to text.
- Pros: The process of downloading full text pdfs from EndNote will be carried out in any case, in order to import full text references into the systematic review database that we have at CAMARADES for further data extraction and meta-analysis.
- Cons: Possible issues with different pdf formats and therefore the possible output from the PDF conversion. Will the reader be able to deal with the typical journal format of columns (particularly double columns)? Whether and how well it deals with tables and figures? Can legends be retrieved from figures?
- Run my PubMed search with the Open Access Filter and download the xml versions of the articles using the FTP service. Running my PubMed search using the open access filter retrieves 20,381 records which is comparable to the records retrieved using the amended EuPMC search with ‘getpapers’.
- Considerations: Seeing as the records retrieved are comparable, I do not see much added advantage of using this facility over that of ContentMine EuPMC API.
- Use CrossRef to retrieve full text documents, using a list of DOIs. This method can download up to 10,000 full text records at a time.
- Pros: Another method of retrieving full text documents in a machine readable format. Could be faster method compared to ‘quickscrape’ as there are DOIs for roughly a third of records.
- Cons: This method has similar issues to the ‘quickscrape’ tool, namely the time it takes to manually collect the correct DOIs, as only about a third of records have corresponding DOIs that have been located thus far.
If anyone has any input or possible solutions I am unaware of or have not yet considered, please leave a comment down below.
As with the other fellows, I am also experiencing issues using the ‘getpapers’ function and getting time-out errors. This is an issue that lies with EuPMC and its API and this issue will be rectified shortly.
In the meantime, while I weigh the advantages and disadvantages of the above options, I am putting together dictionaries to aide annotation of my documents. The dictionaries I am putting together are; animal models of depression, molecular and cellular pathways, outcome measures (in particular behavioural, neurobiological and anatomical), and risk of bias terms in the animal modelling literature.