Given a set of predicted protein-coding genes for a newly sequenced genome, functional annotation involves assigning putative functions to the predicted genes. Two ways in which this can be done are assigning protein names and Gene Ontology (GO; Gene Ontology Consortium, 2010) terms to the predicted proteins. Here we describe a computational pipeline for assigning protein names and GO terms to predicted proteins in parasitic worm (nematode and platyhelminth) genomes, which transfers names and GO terms from orthologues in other species.
When assigning protein names, UniProt protein naming rules (
www.uniprot.org/docs/nameprot) are followed where possible. This recommends that a good and stable name for a protein is "as neutral as possible"; that a protein name "should be, as far as possible, unique and attributed to all orthologs"; and a protein name "should not contain a specific characteristic of the protein, and in particular it should not reflect the function or role of the protein, nor its subcellular location, its domain structure, its tissue specificity, its molecular weight or its species of origin”.In our protocol, a protein name is assigned to each predicted protein based on curated names in UniProt (Bairoch & Apweiler, 2000) for human, zebrafish, Drosophila melanogaster, Caenorhabditis elegans, and Schistosoma mansoni orthologues identified from a database of gene families (e.g. built using Ensembl Compara; Vilella et al. 2009), or (if no information is found from orthologues) based on InterPro (Hunter et al. 2012) domains.
Figure 1 shows an example of using our protein naming pipeline for four Strongyloides ratti genes that belong to the tubulin polyglutamylase family (underlined in pink), where four different protein names were assigned to them (in pink), based on names of their C. elegans or human orthologues. Since each of the S. ratti genes belonged to a different subfamily of the tubulin polyglutamylase family, they were assigned different names.
Advantages of our approach are that it avoids taking the protein name from the top BLAST (Altschul et al. 1997) hit (which may not have a meaningful name, and/or may not be an orthologue if the gene of interest belongs to a large gene family); it transfers protein names from curated UniProt entries, so these names should be well constructed; it transfers protein names from orthologs, as recommended by UniProt; and although the pipeline was designed with parasitic worms in mind, it can be easily adapted for other taxonomic groups (e.g. protozoans).
Previous approaches for assigning names to predicted proteins include assigning a name based on top BLAST hits. For example, for the Echinococcus multilocularis gene set, Tsai et al. (2013) found the top ten BLASTP hits of a predicted protein in GenBank (Benson 2018) and found a consensus between their protein names, downweighting uninformative names and giving higher weight to the parts of names that agree between hits. A similar approach to ours is used by Ensembl (Zerbino et al. 2018) to transfer gene names (as opposed to protein names) from curated databases (HGNC (Yates et al. 2017), MGI (Eppig et al. 2017), ZFIN (Bradford et al. 2015)) to orthologues in other species.
In our protocol, GO terms are assigned by transferring GO terms from human, zebrafish, C. elegans, and Drosophila melanogaster orthologues (again, identified from a database of gene families), and using InterProScan (Jones et al. 2014), i.e. InterPro2GO.
To maximise the amount of GO annotation, terms are transferred from all orthologues, not just one-to-one orthologues (and therefore different usage to annotation of vertebrate orthologues by the Ensembl Compara GO-transfer approach; Vilella et al. 2009). The Compara GO-transfer pipeline is designed to transfer GO terms between relatively closely related vertebrate species. In contrast, our pipeline is designed to transfer GO terms across animal phyla (e.g. from D. melanogaster or human to a parasitic worm). Instead of transferring GO terms directly between orthologues, the last common ancestor terms of orthologues from two different species (e.g. a C. elegans orthologue and a D. rerio orthologue) are transferred. These GO terms are more likely to be conserved across the more distantly related species in this data set, and thus more likely to be accurate predictions for the query protein (e.g. from a parasitic worm such as Brugia timori). Like our approach for protein names, our pipeline for assigning GO terms was designed with parasitic worms in mind, but could easily be adapted for other taxonomic groups.
Vilella et al. 2009 showed that the similar approach developed by Ensembl, for transferring GO terms between one-to-one orthologues among vertebrates, gave more detailed GO terms than InterPro2GO, when they transferred GO terms from human and mouse genes to vertebrate orthologues. Thus, we believe that our approach for transferring GO terms between animal orthologues, supplemented by GO terms from InterPro2GO, will give an accurate and complete set of GO annotations for a parasitic worm genome.