Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.
Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.
To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.
(1) Using getpapers, I have been downloading a corpus of research articles that contains “RNAseq”, “microarray”, “sequencing”, and others from Europe PubMed Central. Only the open access research articles (including protocols and commentaries) in English are considered.
As expected, I saw an explosion of such papers over time. A number of articles that include “RNAseq” are shown below. Of course, since “RNAseq” may go by “RNA-seq”, “RNA sequencing”, “next generation sequencing”, and others, this search needs to be expanded to these keywords as well. But for now, I’ll demonstrate my approach in the later steps using a smaller set of articles returned by searching “RNAseq”.
As expected, I found a larger number of articles including “microarray”, which was introduced in the late 1990s, has became an uniform and unambiguous term, and has been widely available since 2000. Because downloading full XML articles was limited by available and allocated memory, one needs to run getpaper in batches according to publication year.
(2) I need to obtain a list of bioinformatic software packages that I should be searching for in genomic research articles. There are, in fact, far too many tools (including many that are obsolete). For starters, I decided to use “List of RNA-Seq bioinformatics tools” from Wikipedia. Twenty four categories include “Quality Control”, “Errors Correction”, “Unspliced Aligner”, “Assembly Evaluation”, “Structural Variation”, and “Functional Analysis”.
However, this list is not complete and potentially biased. Furthermore, some software packages may be designed for multiple purposes and could not be directly compared. As shown in the later section, for example, even though htseq is listed as a quality control tool, it is more general in application and could be used in other steps of data analysis in genomics. It is then difficult to assess which process and analysis steps are conducted by a particular tool. In the future, it would be worthwhile to expand and automate this process, using a comprehensive directory of biological data analysis packages (e.g., OmicTools).
(3) I used programming language R to process these articles. Particularly, dplyr, tidytext, and tokenizers are used to organize and analyze texts. Essentially, each XML file is read and parsed into sentences and words. Although there are packages to better understand and process the XML format in R, it takes ~100 sec for each article. Therefore, I simply read them as plain texts which take ~0.05 sec each.
(4) Frequencies of bioinformatics tools in research articles are calculated. Basically, I searched all the articles in the corpus for the appearance of names of software packages. Among 3919 articles containing a keyword “RNAseq”, we obtained frequencies of software packages used for quality control and alternative splicing.
These plots are not accurate representations of trends in genomics and bioinformatics. Firstly, due to the naive search using grep, generic names of software packages may be counted more often. This can be clearly seen when looking at the “alternative splicing” barplot where mats has appeared in 452 articles. I need to account for the context in the future. Secondly, some articles may only include citations or URLs, without naming the software packages used. I am looking to incorporate such information, although slightly different bibliography formats and styles pose another challenge.
This allows me to look at temporal and regional trends of software packages. Putting aside potential limitations outlined above, frequencies of quality control tools used in genomic studies are broken down into publication year.
Overall, I have used a corpus of genomic research articles to find trends in bioinformatics software packages. This type of trend analysis in the literature will complement reviews and expert opinions that are often used to guide a choice of computational tools in genomics. I continue to work on overcoming challenges outlined in this post. Particularly, it is evident that the context is important when text mining from a corpus of publications, in order to accurately count and categorize software packages.