Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.


Continue reading Evaluating Trends in Bioinformatics Software Packages


Introducing Fellow Neo Christopher Chung: Computational Trends in Genomics

neo-christopher-chungHigh-throughput genomics has fundamentally changed how life scientists investigate their research questions. Instead of studying a single candidate gene, we are able to measure DNA sequences and RNA expression of every gene in a large number of genomes. This technology advance requires us to become familiar with data wrangling, programming, and statistical analysis. To meet new challenges, bioinformaticians have created a large number of software packages to deal with genomic data.

For example, a molecular biologist could nowadays sequence a full human genome that results in billions of very short and randomly located DNA sequences. Such short DNA sequences (e.g., ~hundreds base pairs) must be aligned, so that we can infer a full genome that is over 3 billion base pairs. Among many available computational methods, it is often difficult to know the most appropriate tool for one’s specific need. I propose to track the usage of bioinformatics tools used in sequence alignment, variant calling, and other genomic studies.

This work will lead to a new kind of reviews, that is interactive and dynamic. Eventually, molecular biologists will have a convenient portal that quantitatively summarizes the trends of computational and statistical methods in genomics. Furthermore, the source code will be published in Github, so that questions about other research trends can be duplicated and extended.