What can we learn about networks of collaboration between research organizations and researchers through content mining? At Mozilla Festival 2015 we had a look at this question with the help of the ContentMine tools.
Collaboration networks can be analysed in many different ways, such as looking at grant identifiers, author affiliations or citations. Our aim of the session was to try something different, and visualize which organisations tend to collaborate on a more informal level. We singled out the “Acknowledgement” section of papers as a possible data source. The sectioning of a paper enables us to specifically look at individual sections and to exclude other possible contexts of a mention. If for example an organization or researcher is named in the “Methods” section, it more likely refers to a technology or method used in the research.
For that, we built a network of relationships between organizations, persons or locations that are mentioned together in the “Acknowledgements” section of a paper. A network in general consists of nodes which are connected by links. The initial network in our case consists of two types of nodes, papers on the one hand, and names of organisations, persons and places on the other hand (entities). A link between a paper and an entity is created when the acknowledgements section mentions this entity by name. In this type of network, there are no links between nodes of the same type, to get from one organization to another, we always have to go over a paper.
But we are interested in how organizations are connected. So we create another network consisting only of organization nodes. In this second network, a link is created between two organizations when they are mentioned in the same acknowledgement section. Now we can go from one organization directly to another. We now can look at how organizations are connected by being mentioned together (co-occurrence).
To create an experimental data set we downloaded about 2,300 papers from the Open Access journal Trials with getpapers. We queried for terms like “drug, trial, patent” and the years between 2012 and 2015. Norma converted the downloaded xml-files from the search results into scholarly.html. The acknowledgement section was retrieved from the scholarly.html and the names of entities were parsed out of each section with the help of the Stanford Named Entity Recognizer.
Here is one of the final networks, centred around the Swiss National Science Foundation. The node size is determined by the count of connections a node has. So for example in the lower right corner we see that the Hospital of Aarau, the University of Basel and the University Hospital of Basel are mentioned in at least one paper together. All three of them are also mentioned together with the Swiss National Science Foundation.
And one for the Department of Health, which is so big that for readability we can’t label the nodes any longer. It contains more than 400 nodes and branches out, demonstrating the reach of the Department of Health.
Finally we looked at some networks and tried to explain them: It was possible to identify some more central persons and organisations where many connections ran through them. Another interesting point was the network structure: We could distinguish between more centralized, tight-nit networks of collaboration, and more distributed, dispersed networks. And of course we discussed the legitimacy of our conclusions, since it is not possible to say much about the nature of a collaboration just from the existence of a link. To draw any more detailed conclusions, we have to re-read the Acknowledgements any create some context for our facts.
Try to create your own network and see what you can find! You can use the ipython notebook we developed for Mozfest. You find the notebook and documentation how to set up your system here.