I am a taxonomist, the kind of biologists who are charged with discovering, documenting and describing life on earth. I specialize on insects, the most diverse and successful form of life. My ContentMine Fellowship project will focus on mining weevil-plant associations from literature records. I will describe my project in the following.
Motivation. Comprising ~70,000 described and 220,000 estimated species, weevils (Curculionoidea) are one of the most diverse plant-feeding insect lineages and constitute nearly 5% of all known animals. Knowledge of host plant associations is critical for pest management, conservation, and comparative biological research. This knowledge is, however, scattered in 300 years of historical literature and difficult to access. This study will use ContentMine to extract, organize and synthesize knowledge of host plant associations of weevils from the literature. I have been doing literature mining manually and generated nearly 700 entries.
Goals and significance. I hope to use reproducible computational methods to extract information about weevil-plant associations from literature with documented precision and recall. The search for weevil-plant associations is a form of the more general computational problem of searching for entity co-occurrence and this project will have broader implications.
Description of content. The main content will be text in PDF files. The types of literature are taxonomic descriptions and natural history studies. I intend to work with English, Chinese and Spanish literature as I can read all three languages. Taxonomic descriptions are usually well structured. Host plant information, if available, is typically presented in a separate section called “biology”, “life history”, “remarks” or sometimes simply “host plants”. Descriptions of plant associations are short statements such as “5 specimens on Mimosa farinosa Grisebach” and “Adults and larvae live in decayed female cones of Araucaria araucana (Araucariaceae)”. Plant names are often cited at the rank of species, genus or family. The species name is always a binomial (consisting of two parts). Both species and genus names are italicized or underlined. The genus name also has the first letter capitalized. Higher-level taxonomic names of plants have standard endings, for example, -aceae for family and -eae for tribe. Similar to plant names, weevil names at higher-level ranks have standard endings and are binomials for species.
Mining strategies. Searches will capitalize on regular expressions shown in animal and plant taxonomic names. I would first filter literature by searching for plant names with standard higher-level endings (e.g., -aceae). For positive hits, the next step will depend on the scenario and I outline a few. First, the plant species name is followed by its family name (or other higher-level names), e.g., “Araucaria araucana (Araucariaceae)”. This is easy to work with. Second, the plant species name is used by itself (e.g., “Collected from Buchloe dactyloides”). In this example the name Buchloe dactyloides alone does not tell us if it is a plant or a weevil. The name can be looked up in ThePlantList or Google for verification. The third scenario is that the higher-level name is used alone (e.g., “Adults on the coniferous families Podocarpaceae and Phyllocladaceae”). Literature that does not contain any higher-level plant names would either contain no plant information or use only common names. An example of the latter is “on walnut and hickory trees”. A list of common names will be built and these names searched for. For articles that have neither scientific names nor common names of plants, they could be recorded as containing no plant information.
Challenges. The first challenge is the difficulty of parsing negations. For example, a statement that “Aus bus has not yet been found on any Asteraceae” cannot be easily understood by a parsing program as a negative statement. The second challenge is that for articles containing multiple species, it would be difficult to match the right plant with the right weevil. Additional challenges include variant forms of scientific names such as trinomials for subspecies, e.g., Myllocerus undecimpustulatus undatus and names containing subgenus, e.g., Apion (Perapion) curtirostre. It would also be important to distinguish general statements made about a whole group of weevil species, say, a genus, from species-specific statements. Finally, names in older literature may be synonyms, and they will need to be reconciled with currently used valid names. I look forward to addressing these challenges as part of the ContentMine experience.