Evaluating Trends in Bioinformatics Software Packages

Genomics — the study of the genome — requires processing and analysis of large-scale datasets. For example, to identify genes that increase our susceptibility to a particular disease, scientists can read many genomes of individuals with and without this disease and try to compare differences among genomes. After running human samples on DNA sequencers, scientists need to work with overwhelming datasets.

Working with genomic datasets requires knowledge of a myriad of bioinformatic software packages. Because of rapid development of bioinformatic tools, it can be difficult to decide which software packages to use. For example, there are at least 18 software packages to align short unspliced DNA reads to a reference genome. Of course, scientists make this choice routinely and their collective knowledge is embedded in the literature.

To this end, I have been using ContentMine to find trends in bioinformatic tools. This blog post outlines my approach and demonstrates important challenges to address. Much of my current efforts are in obtaining as comprehensive a set of articles as possible, ensuring correct counting/matching methods for bioinformatic packages, and making this pipeline reproducible.


Job Opportunity: ContentMine Operations Manager


ContentMine was founded in 2016 as a UK non-profit company limited by guarantee. Our mission is to establish content mining for research and for education as widespread philosophy and practice through:

  • creating computer programs, protocols, practises, standards and educational materials that enable content mining,
  • training researchers and others in content mining,
  • encouraging research institutions and funders of research to support establishing freedom for anyone to engage in computational analysis of books, journals, databases and other knowledge sources for the purposes of education and research.

We develop open source software for mining the scientific literature and engage directly in supporting researchers to use mining, saving valuable time and opening up new research avenues.

We are seeking an Operations Manager to take overall operational responsibility for ContentMine’s development and execution of its mission, reporting to the Board of Directors and working closely with the ContentMine Founder, Dr Peter Murray-Rust. The successful candidate will develop deep knowledge of our core focus, operations, and business development opportunities and manage the transition of the organisation from a project to a sustainable non-profit with oversight of all major business areas from fundraising to communications and HR.


£40-45k pro rata, negotiable.

Time and Location

4 days per week, fixed term contract for four months in the first instance, with renewal subject to funding. The candidate should be a UK or EU national, remote working possible but candidates in easy travelling distance of Cambridge are preferred.


Leadership and Management:

  • Ensure ongoing excellence in delivery of the ContentMine mission, including program evaluation, and consistent quality of finance and administration, Manage fundraising, communications, and systems; recommend timelines and resources needed to achieve the strategic goals.
  • Actively engage and energize ContentMine board members, contractors, collaborators, Fellows, volunteers and funders.
  • Ensure effective systems to track progress, evaluate program components and report to the Board and funders.

Fundraising and Communications:

  • Expand revenue generating and fundraising activities to support existing program operations and planned developments.
  • Oversee and refine all aspects of communications—from web presence to external relations, with the goal of creating a stronger brand based on a recent graphical design exercise.
  • Use external presence and relationships to garner new opportunities.

Planning and New Business:

  • Build partnerships with research-oriented organisations including groups and institutes, scholarly societies and NGOs.
  • Establish relationships with potential collaborators and philanthropic funders.
  • Write grant applications and tender for client contracts.
  • Manage relationships and work allocations with partner organisations and contractors who bring new skills and capabilities to projects.

Person Specification

The Operations Manager will be thoroughly committed to ContentMine’s mission. All candidates should have proven leadership and relationship management experience. Concrete demonstrable experience and other qualifications include:

  • At least 5 years of management experience; track record of effectively leading an outcomes-based organization.
  • Ability to point to specific examples of having developed and actioned strategies that have taken an organization to the next stage of growth.
  • Commitment to delivering quality programs and data-driven program evaluation.
  • Excellence in organisational management including developing high-performance teams, setting and achieving strategic objectives, and managing a budget.
  • Fundraising experience with the ability to engage a wide range of stakeholders, partiuclarly in the academic, non-profit, research and publishing sectors.
  • Strong written and verbal communication skills; a persuasive and passionate communicator with excellent interpersonal and multidisciplinary project skills.
  • Action-oriented, entrepreneurial, adaptable approach to business planning.
  • Ability to work effectively in collaboration with diverse groups of people.
  • Passion, integrity, positive attitude, mission-driven and self-directed focus are all desirable.

To apply

Please submit a cover letter and CV to admin@contentmine.org by 2 Dec 2016. Interviews will be held by the 9 Dec. Informal enquiries should be directed to Dr Peter Murray-Rust (peter@contentmine.org).

ContentMine featured in Horizon magazine article “Copyright shift would put Europe ahead in ‘future of research’ data mining”

Horizon magazine featured an article on text and data mining and specifically the European Commission proposal for a copyright exception, currently covering “public or private organisations that are carrying out scientific research in the public interest”.

Dr Peter Murray-Rust is director of ContentMine, a not-for-profit organisation which has developed software that enables researchers to search through scientific papers on a particular subject. He gives the example of the Zika outbreak as an area where TDM can help to enhance knowledge.

‘We’re going to need to know a lot more about Zika, and much of it may already be in the scientific literature that’s been published but that we don’t read. We don’t read it because there’s so much, so we’ve built a machine, ContentMine, that will liberate the facts from the literature.’

Sci-hub and Legal aspects of ContentMining

I have written today to my collaborators in ContentMine – staff, volunteers, advisory board and Shuttleworth funders and mentors. It’s on the legal aspects of mining. It’s long, but laws are complex. It’s meant to put everyone ‘s minds at rest – us, universities, Shuttleworth, etc. it’s not authoritative, but may be a useful guide. We’d love to have your feedback. tl;dr I’ve assessed the main problems and most people should assume we have taken a responsible and public approach.
ContentMine is preparing to mine the complete scholarly literature every day – about 10,000 scholarly articles.People from inside CM and from outside have recently raised the question of whether CM is breaking or intends to break the law. This has arisen in parts because of our intention to use the UK Copyright exception to mine the whole literature, and because of speculation about the possible use of our technology by “illegal” sites such as Sci-Hub.

NOTE: I am not a lawyer (IANAL) but I have spoken to several and am aware of general principles and practice.

The simple answer is simple:

CM does not intend to break the law and intends not to break the law.

and to my colleagues.
Do not worry. You will not end up in court. If anyone does – and it is unlikely – it will be me and I am prepared.

I shall expand on this in blog posts, but please be assured that I am actively assessing areas where the laws might be broken, especially inadvertently. Note, of course, that there are many other laws where we have to observe on a continual basis, and include health and safety, employment, racial discrimination, libel, immigration, etc. I get frequent updates from the Chemistry Department  as to what procedures we have to observe. You, I, contentmine.org and everyone are bound to observe and practice these laws. They are complex in detail, extent, interpretation and we generally manage by knowing the outline of the law. We don’t steal, and we don’t read the small print of what is and is not a theft (e.g. “illegal borrowing”). But in others, e.g. animal experiments or immigration, the small print is critical. “Ignorance of the law is no defence”.

But I will take the responsibility of guiding you and making sure that you don’t transgress inadvertently.

The  laws particularly relevant to contentmine.org in question include:

* copyright law

* sui generis database rights (Europe only)

* computer fraud law

* technological protection measures (TPM) and digital rights management (DRM)

* national security laws

Most of these laws have a concern about geo-location. We shall attempt to make sure that all our activities are carried out by UK staff, “in the UK”, on UK machines.  But what is legal here may be illegal elsewhere and vice versa. Note also that many laws, especially new ones cannot have definite answers until they are tested in a courtcase. Lawyers may give opinions (for fees) but ultimately the court decides.
These laws are complex and often recent and – like many laws – it is possible to transgress unknowingly. We have have to educate ourselves and to behave responsibly in actions and language. If anyone is unsure they should raise the issue.
Note that by discussing this in public we will show our good faith and also be alerted by others to potential problems and misinterpretations.
Copyright law is exceedingly complex and also depends on the country. What is legal in the US may not be in Britain and vice versa. It includes:
* the process of copying for the purpose of mining for non-commercial research
* storage of copied material
* republication of the (transformed) output as part of the research/audit/verifiability requirement.

We continually discuss this with lawyers and with librarians. No one can predict precisely what is allowed and what is not – it may depend on “impact on the market of the rights-holder”. All law includes a balance of risks – It is my responsibility and (for some content) the librarians to make sure that we have a balanced assessment.

We believe that our mining is fully allowed under the UK 2014 reform (“Hargreaves”). It would not be allowed if we took money from commercial companies and mined the literature solely for their benefit. Europe has noted that much research is a public/private partnership (I worked for 15 years in the Cambridge Unilever Centre, for example). Was this non-commercial? I would take the view that all the projects I worked on were. If I was paid extra to do private contract research for a company which would not be published it would be commercial.

Since I and ContentMine are probably the only group in UK at present who publicly intend to use Hargreaves there is no case law to answer these questions. We read the current public discourse and form a balanced judgment.

What copyright material can we hold on our machines? It is common for researchers to have thousands of copies of copyright material on their machines and no one is challenged. Unlike them, our material is in a secure computer room in Cambridge with physical access only by trusted staff and e-access only to 2-3 named and authorised people. If anyone wishes to “steal” the literature from our server we will actively prevent and report this. We are not, of course, ourselves redistributing any of the University subscription content other than facts and fair quotations. If, as we hope, the resource becomes useful in the University, we will work with library staff to create a legally acceptable approach where any Cambridge scholar can use the system.

How long can we hold it for? Mining is often an iterative process, so we may wish to re-run searches with new parameters. It would be a technical waste to have to re-download everything everyday. It would also put additional workload on the publisher’s servers. We can’t give an answer in days or months or years until we know what the likely usage patterns are.

What can we republish? Since facts are uncopyrightable we can publish them without permission (although in Europe we cannot systematically republish the contents of databases protected by sui generis. Journals and supplemental data are not databases). But:


is not a useful fact.

“The average snout-vent-length (SVL, see https://sizes.com/natural/lizards.htm ) of the common lizards (Zootoca vivipara) found on Borchester Common ( https://en.wikipedia.org/wiki/Borchester )  was 42 mm (+- 5) measured by 3 independent researchers using the Graduated Ruler and Eyeball Method (see http://www.wikihow.com/Use-a-Ruler )”

is a useful fact. We intend to publish some or all of the facts we extract without formal permission from the publisher.

Note that a fact does not have to be “true”. I don’t actually know the sizes of newborn sandlizards. But what I have stated is a fact. The result might be a misprint for 142 mm (which is possible for an adult). It is still a (potentiallly falsifiable) fact. It remains a fact regardless of further lizard research.
I will blog more on facts as “facts” are uncopyrightable.
* sui generis database rights. We do NOT currently intend to systematically extract facts from factual databases described as such and specifically created for the purpose of holding facts.
* computer fraud laws. We scrupulously avoid breaking these laws. They carry the additional features that they are criminal, and so prosecution would be by the police. The UK takes these very seriously and wishes to extend the maximum term of imprisonment to 10 years:http://arstechnica.com/tech-policy/2016/04/uk-file-sharing-10-years-jail-time/(I personally protest against this, but I do it legally).You should therefore take especial care not to share files “illegally”. This means that ContentMine cannot have any dealings with Sci-Hub as it is seen by many as an “illegal” filesharing . Read  Ars technica:

<quote>The UK government has responded to that issue by saying that it accepts there are concerns, and writes: “the policy intention is that criminal offences should not apply to low level infringement that has a minimal effect or causes minimum harm to copyright owners, in particular where the individuals involved are unaware of the impact of their behaviour.”

Another major worry was the use of the term “affect prejudicially” in judging copyright infringements, which many felt was too vague and could mean a single infringing file would fulfil the requirement—for example, if it were widely shared online. Many thought this set the threshold for committing an offence far too low.

The UK government said it was not aware of any cases where minor infringement had resulted in a criminal prosecution, but “agrees that the undefined term ‘affect prejudicially’ could give rise to an element of ambiguity.” The government is now proposing to introduce “re-worded offence provisions” to address that.


It is extremely unlikely that we will trigger this law as we don’t deliberately intend to break it and deliberately don’t intend to break it. However #icanhazpdf is almost certainly “illegal” and also breaks the rules of the University. I have never used #icanhazpdf in either direction and never sent files to people who weren’t subscribed. ContentMine staff should not use #icanhazpdf.

In some cases crawling has been held to be a violation of the CFA acts of various flavours. I am not aware of any cases where scholarly publishers have used this to prosecute bona fide researchers, nor where the police have.,

Note also that many publishers know that I and others (e.g. Crystallography Open Database) have been crawling their sites for many years and by implication permit it. This includes Nature, Elsevier, American Chemical Society, Royal Society of Chemistry, Acta Crystallographica, Science. We are careful to adhere to responsible mining practice (see https://contentmining.files.wordpress.com/2015/06/responsible-content-mining-1.pdf )

Aaron Swartz’s case was – for many, including me – a serious miscarriange of justice. From Wikipedia:

(https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act#Aaron_Swartz )

<quote>In the wake of the prosecution and subsequent suicide of Aaron Swartz, lawmakers have proposed to amend the Computer Fraud and Abuse Act. Representative Zoe Lofgren has drafted a bill that would help “prevent what happened to Aaron from happening to other Internet users”.[35] Aaron’s Law (H.R. 2454, S. 1196[36]) would exclude terms of service violations from the 1984 Computer Fraud and Abuse Act and from the wire fraud statute, despite the fact that Swartz was not prosecuted based on Terms of Service violations.[37]

In addition to Lofgren’s efforts, Representatives Darrell Issa and Jared Polis (also on the House Judiciary Committee) raised questions about the government’s handling of the case. Polis called the charges “ridiculous and trumped up,” referring to Swartz as a “martyr.”[38] Issa, who also chairs the House Oversight Committee, announced an investigation of the Justice Department’s prosecution.[38][39]

As of May 2014, Aaron’s Law was stalled in committee, reportedly due to tech company Oracle‘s financial interests.[40]


* TPM and DRM

These are technical methods of prevent access to material and can include firewalls, encryption, specific tools, and possibly Captcha. We have bought legal advice and the result is not clear about whether Hargreaves allows us to circumvent them. The rule for all of us is that if there is any technical barrier to mining we should identify it and alert the librarians and possibly computer officers. Deliberately breaking this law could have serious consequences. Rest assured that I will publicize and comment on publishers who impose TPM.

* national security. It is very unlikely that we shall trigger this very serious offence. However, overzealous prosecutors or government departments – particularly in the US – have used such provisions.

There is a simplistic tendency of some companies and government departments to demonize all “hacking” as security violations. My laptop carries “Wget is not a crime” https://ttdphx.com/2014/10/23/digital-rights-wget-is-not-a-crime/ , after

was jailed for its use. See Slashdot for the link to Snowden and hackerbabble:

* scraping

Contentmine is in the business of scraping websites – scholarly publishers , academic departments, etc. Is this legal? People have been prosecuted for scraping (https://devcentral.f5.com/articles/web-scraping-data-collection-or-illegal-activity from a company selling anti-scraping software). Wiley and Elsevier caused Tilburg to cut off Chris Hartgerink for downloading (“stealing”) material to which he had legal access. Their accusations have not been made public and it seems most unlikely he had done anything illegal. However I have scraped publishers for 12 years (for legally accessible materials) with no complaints and I do not expect any.
*incitement to commit a crime.
in general it is a serious offence to encourage others to break the law. See http://www.cps.gov.uk/legal/h_to_k/inchoate_offences/#a01 for the official (and complex) UK law. For example I believe that any formal contact with Sci-hub or recommendation to use it could be interpreted as a crime.  Whether the same applies to breaking contract law is less clear, but ContentMine will not , knowingly, break this either.
Please let me know whether I have omitted an important item or have misrepresented one.

Tummy bug 2: The scientific literature teaches us about Isospora

In the previous post we showed how ContentMine could give immediate knowledge about a scientific topic – we analysed “Isospora”, which is a nasty tummy bug. Let’s just read Wikipedia to get some idea of the language we’ll need


Life Cycle

PHIL 3398 lores

  • An oocyst with one sporoblast is released in stool of infected person
  • After the oocyst has been released, the sporoblast matures further and divides into two
  • After the sporoblasts divide they create a cyst wall and become sporocysts
  • The sporocysts each divide twice, resulting in four sporozoites
  • Transmission occurs when these mature oocysts are ingested
  • The sporocysts excyst in the small intestine where sporozoites are released
  • The sporozoites then invade epithelial cells and schizogony is initiated
  • When the schizonts rupture, mereozoites are released and continue to invade more epithelial cells
  • Trophozoites develop into schizonts, containing many mereozoites
  • After about one week, development of male and female gametocytes begin in the mereozoites
  • Fertilization results in the development of oocysts, which are released in the stool [1][6]

The sporulation time of this parasite’s egg is usually 1–4 days, and the entire life cycle takes about 9–10 days.[7]

Wow! That’s complicated! But that’s because Life is complicated! These parasites have complex life cycles. You have to learn the terms – but it’s no harder than learning the terms in a new game, or a law case, or soccer strategy. You just need to want to do it! And Wikipedia will help. Wikipedia is always there. These parasites are all Apicomplexans and here’s their language https://en.wikipedia.org/wiki/Apicomplexan_life_cycle#oocyst


So if you are interested in more than just Isospora, use ContentMine to search for “Apicomplexan”.

Most of the papers have well defined messages. The first was about opportunistic infections in HIV patients. Read the word cloudlet for each paper here and see if you can guess the subject of papers 2,3,4,5,6. If you know the species behind the latin names that helps. If you don’t use your friend Wikipedia.


Here’s my thinking:

  1. Already done
  2. “Caninum, Parasitology, Vets – probably about Dogs. Toxoplasma I’ve heard of – it’s a parasite and https://en.wikipedia.org/wiki/Toxoplasma_gondii confirms it. Never heard of Neospora or Hammondia but I wouldn’t eat them. Check – https://en.wikipedia.org/wiki/Neospora , https://en.wikipedia.org/wiki/Hammondia_hammondi yes they are both Apicomplexa, the latter of cats. Did we get it right?

Canine faecal contamination and parasitic risk in the city of Naples (southern Italy).

  1. Seems to be about ferrets , and mink (Mustela) getting influenza.  Ferrets develop fatal influenza after inhaling small particle aerosols of highly pathogenic avian influenza virus A/Vietnam/1203/2004 (H5N1).

It is. But why are people worried about ferrets getting sick?? Because influenza uses non-human hosts such as birds and ferrets so we might get it from them. And when I was in the pharma industry they used ferrets as a model of human disease.
Where’s the Isospora?
The animals lacked signs of epizootic catarrhal enteritis, and were negative by microscopy for enteric protozoans such as Eimeria and Isospora species using fecasol, a sodium nitrate fecal flotation solution (EVSCO Pharmaceuticals, Buena, NJ).


Translation: we made sure the test animals didn’t have other infections that could distort our research (and we told you how we did it).

  1. I know Gallus is a hen. And we’re going to add an icon and a mouseover on the table so you don’t need to look it up. Eimeria is an apicomplexan, and because it occurs 6 times in the paper it’s pretty important. I’m guessing it’s about parasites of hens. But what’s the rest? There are lots of genes and my guess is that they being used for c omparative genetics or possibly modes of action.
    I don’t know what “QTL”. I probably should, but why bother when we have Wikipedia?


A quantitative trait locus (QTL) is a section of DNA (the locus) that correlates with variation in a phenotype (the quantitative trait).[1] The QTL typically is linked to, or contains, the genes that control that phenotype.

Rough Translation: The phenotype is what we feel, touch, smell, observe in an organism. and the QTL is that part of the genes that affects it.
So the paper is probably about genomic studies on parasites and chickens. Let’s look: QTL detection for coccidiosis (Eimeria tenella) resistance in a Fayoumi × Leghorn F₂ cross, using a medium-density SNP panel.

Rough translation: analysing the genome of chickens for regions that confer resistance the the most serious parasite. Eimeria is an apicompelxan, so I expect the paper mentions a range of them, including Isospora. (Yes: “Coccidia are sub-classified into several genera, including Eimeria, Isospora, Cryptosporidium, Toxoplasma and Sarcocystis. ) So we’re becoming experts on Apicomplexan names!

  1. Turdus, Coccothraustes … Thrushes and Hawfinch. Also cloudlet show “birds” and “iron”. “Deadly Outbreak of Iron Storage Disease (ISD) in Italian Birds of the Family Turdidae” . This is the paper where they examines the birdshit for parasites…


So that seems a lot of work – and we are only 5 papers through. But some of those are relevant to Natalie and some aren’t – her false positives. So can we get ContentMine to select just the ones she needs?
We hope so. If the paper has a lot about apicomplexans it’s probably relevant. If it’s about other diseases such as HIV or flu it’s probably not. So we could remove those automatically.

And that would save a lot of time. And hopefully help us learn bioscience in an efficient manner.

@TheContentMine preparing for largescale high-throughput Mining (TDM)

The ContentMine (contentmine.org) has almost finished the infrastructure and software for automatic daily mining of the scientific literature. We hope to start testing in the next few days. I’ll try to post frequent information.

The software has been developed by the ContentMine Team, wonderfully funded by the Shuttleworth Foundation. The people involved include:

  • Mark MacGillivray
  • Anusha Ranganathan
  • Richard Smith-Unna
  • Tom Arrow
  • Peter Murray-Rust
  • Chris Kittel
  • and voluntary contributions

The daily oprtation (as opposed to user-driven getpapers) consists of:

  • DOIs and URLs provided by CrossRef
  • downloading software
  • indexing of fulltext documents (closed as well as open, legal under the UK “Hargreaves” exception)
  • fact extraction
  • display

We’ll detail this later.

The sources include:

  • open repositories such as EuropePubMedCentral
  • arxiv and other repositories
  • closed documents to which Cambridge University subscribes. We are working intimately with Cambridge University Library staff and offer public applause and thanks.

All closed work will be carried out on closed machines run by the University’s computer officers, primarily in Chemistry, and again public thanks to this wonderful group. We take great care to limit access so that no unauthorised access is possible and that there is also an audit trail of what we do and have done.

It is difficult to predict the daily volume. MarkMacG has found it to vary between 300 and 80,000 documents a day. My guess is about 2000-7000 on average.

This is NOT a resource problem. The whole scientific literature for a year can be held on a terabyte disk. The processing time is small – perhaps 1000 documents a minute on our system. The whole literature can be done within a long coffee break.

The impact on publisher servers is minimal. at, say, 5000 articles/day even the largest publisher would only get 1 request per minute. The others would be trivial (1 request every 5-10 minutes). There is no case that our responsible TDM would cause any problems at all.

And, just to reassure everyone, I and colleagues are working hard to stay completely within the law as we see it. We are not stealing content.