Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.

contentmine

(1) Using getpapers, I have been downloading a corpus of research articles that contains “RNAseq”, “microarray”, “sequencing”, and others from Europe PubMed Central. Only the open access research articles (including protocols and commentaries) in English are considered.

As expected, I saw an explosion of such papers over time. A number of articles that include “RNAseq” are shown below. Of course, since “RNAseq” may go by “RNA-seq”, “RNA sequencing”, “next generation sequencing”, and others, this search needs to be expanded to these keywords as well. But for now, I’ll demonstrate my approach in the later steps using a smaller set of articles returned by searching “RNAseq”.

rnaseq_total

As expected, I found a larger number of articles including “microarray”, which was introduced in the late 1990s, has became an uniform and unambiguous term, and has been widely available since 2000. Because downloading full XML articles was limited by available and allocated memory, one needs to run getpaper in batches according to publication year.

microarray_total

(2) I need to obtain a list of bioinformatic software packages that I should be searching for in genomic research articles. There are, in fact, far too many tools (including many that are obsolete). For starters, I decided to use “List of RNA-Seq bioinformatics tools” from Wikipedia. Twenty four categories include “Quality Control”, “Errors Correction”, “Unspliced Aligner”, “Assembly Evaluation”, “Structural Variation”, and “Functional Analysis”.

However, this list is not complete and potentially biased. Furthermore, some software packages may be designed for multiple purposes and could not be directly compared. As shown in the later section, for example, even though htseq is listed as a quality control tool, it is more general in application and could be used in other steps of data analysis in genomics. It is then difficult to assess which process and analysis steps are conducted by a particular tool. In the future, it would be worthwhile to expand and automate this process, using a comprehensive directory of biological data analysis packages (e.g., OmicTools).

(3) I used programming language R to process these articles. Particularly, dplyr, tidytext, and tokenizers are used to organize and analyze texts. Essentially, each XML file is read and parsed into sentences and words. Although there are packages to better understand and process the XML format in R, it takes ~100 sec for each article. Therefore, I simply read them as plain texts which take ~0.05 sec each.

(4) Frequencies of bioinformatics tools in research articles are calculated. Basically, I searched all the articles in the corpus for the appearance of names of software packages. Among 3919 articles containing a keyword “RNAseq”, we obtained frequencies of software packages used for quality control and alternative splicing.

qualitycontrol

alternativesplicing

These plots are not accurate representations of trends in genomics and bioinformatics. Firstly, due to the naive search using grep, generic names of software packages may be counted more often. This can be clearly seen when looking at the “alternative splicing” barplot where mats has appeared in 452 articles. I need to account for the context in the future. Secondly, some articles may only include citations or URLs, without naming the software packages used. I am looking to incorporate such information, although slightly different bibliography formats and styles pose another challenge.

(4) Beyond simply counting how often certain software packages are used in genomic studies, I am interested in trends over time and across different regions, among other attributes. Meta data obtained via getpapers (as JavaScript Object Notation files) are used to find out about publication years, origin of authors, journals, and other information. I used a package jsonlite to read in meta data into R.

This allows me to look at temporal and regional trends of software packages. Putting aside potential limitations outlined above, frequencies of quality control tools used in genomic studies are broken down into publication year.

temporaltrends_qualitycontrol

Overall, I have used a corpus of genomic research articles to find trends in bioinformatics software packages. This type of trend analysis in the literature will complement reviews and expert opinions that are often used to guide a choice of computational tools in genomics. I continue to work on overcoming challenges outlined in this post. Particularly, it is evident that the context is important when text mining from a corpus of publications, in order to accurately count and categorize software packages.

Advertisements

4 thoughts on “Evaluating Trends in Bioinformatics Software Packages”

  1. Very cool. Did this include research in ancient DNA? aDNA is particularly difficult to work with because it is usually degraded into very short fragmented strands requiring a lot of post-sequencing processing. I am just starting in this field and trying to figure out the best software options before I invest a lot of time learning how to use them.

    Like

    1. I’m unsure whether you need to use specialized software to deal with ancient DNAs.

      But if you have a corpus of publications in your field and a list of software packages, you can check what software is frequently used. Please check/fork my github repository 🙂

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s