Content mining for trend analysis

Let’s suppose you have assembled a large collection of papers (we’ll call that corpus) as a starting point for a literature review. Some of the first questions would be of an exploratory nature, you would like to get an intuition of what’s really in there. “Is there a certain structure, possibly a hidden bias I need to take into account? What is the coverage, are there some ‘holes’ in the data set, perhaps some missing months, or should I include another keyword in the search? How do certain keyword frequencies develop over time, is there a trend appearing?” We can help with getting this initial overview, and speeding up the process to get you working on the questions that really interest you.

How does it work?

The process would start with creating sets of keywords, which are of special relevance to you – we aggregate these sets of keywords in dictionaries. We can then apply an automated text mining workflow to extract frequency counts for each paper, by exactly matching the words of each dictionary with the full text content of every paper – so no occurrences will be missed. The data will then be combined with metadata of the articles, so that each extracted keyword can be associated with its source paper.

A possible outcome of this process are visualizations, which convey in at-a-glance graphs the hard facts in the overwhelming number of publications you’re working with. For example, you would like to know where your papers come from (from our demo on the Zika virus):


This graph is created from the publications’ metadata in the sample dataset. It is a grouped count of the journal titles, descendingly sorted. The blue line indicates the cumulative percentage relative to the total count. There is a long tail of journals only mentioned once or twice, we have omitted them in this graphic.

Or you could want a high-level overview of the absolute and relative distributions of the predefined dictionaries in your paper collection. The upper time series shows the absolute counts of facts, aggregated by their source dictionary. The lower time series shows the relative share of facts per dictionary, as a fraction of the daily total count.



Or you would like to see which keywords are most dominant overall, and which ones are uptrending recently. The left column shows the top 10 facts, determined by absolute counts over the whole period. The right column shows the top 10 uptrending facts, determined by the sum of percentage changes of daily counts. Similar approaches could be employed to identify completely new terms that start to appear in the corpus.


Some starters for an explorative trend analysis

This is not an in-depth tutorial, rather an outline of the necessary steps when trying to detect meaningful trends in an explorative, iterative approach.


As a first step we want to establish a baseline of publishing rates and total word counts, based on monthly and yearly binning of publications. This will help us normalizing our time series against the background rate of publications, so that we won’t see an uptrending dictionary, when it’s only due to a quantitative grow in publications overall. As the second step, we count the keyword frequencies per time step, and normalise them against the baselines. We can now identify increasing, plateauing or declining popularity or activity via the period-to-period change rates of keyword frequencies. We can tweak the analysis by choosing different binning periods (e.g. daily, monthly, yearly), or smooth the change rate over different ranges of periods (from one period to the next, or as an average of 2 or more).

Bursts of activity

Another point worth investigating are unusual, short-lived bursts of activities around a certain dictionary. We can treat them as “outliers” in the time series, relative to the background activity of this dictionary. To find them, we will start with normalized keyword frequency counts, and then look for frequency counts that are a certain number of standard deviations away from the average frequency. Here are also some parameters we can play with, one is again the binning period, another one is the number of standard deviations we set as a condition to count as “outlier”. Of course, the nature of the unusual activities would have to be identified by close reading, or pulling in other data.

New terms

From time to time we’d like to know what is completely new in our field. One way to find that is by looking at terms that only recently have started appearing in our collection of publications. We can use two definitions of “new” here: One is the first appearance of a term in the corpus, the other one is the usage of the term first surpassing a certain threshold. The first approach could be seen as looking for a first conceptual introduction, while the second approach can be seen as finding out when the term gained relevance in its respective domain.

Since we don’t know what we’re looking for, our approach has to be to heuristically filter out what we already know. As a first step, we start with the absolute frequency counts of all words within the corpus. In the second step we will then filter out common words from the language we’re working with, as well as keywords we “already know”. From the remaining frequency counts we filter out terms that started to occur in the data set already a certain amount of time (months, years) ago. This will yield a set of keywords, that haven’t been used until recently. We can also filter out all keywords that did not occur above a certain (absolute or relative) threshold until a certain amount of time ago. This will yield a set of keywords that only recently gained popularity – similar to the uptrending analysis, but this time with unknown keywords.

Combining these filters and looking for keywords that appeared early on in the dataset yet still are in the second group, could point to concepts that have been introduced long time ago, but haven’t really taken off until recently. Parameters we can adapt here are the amount of time we want to go back (e.g. 3, 7, 10 years), and the relative or absolute threshold keywords need to surpass to count as relevant. These parameters will affect the number of keywords marked as “new”.

Play around with your data!

Hopefully this post gave you some starters how to go about a trend analysis of scientific literature. In most cases, you’ll proceed iteratively and exploratory, with the analysis evolving around your findings in the data. If you have a specific use case and would like to involve us, we’d be happy to talk about the task – contact us at


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s