Enzyme annotation using amino acid sequence isn’t a simple process and multiple methods have been published that accomplish this using different approaches. This protocol uses some of these complementary methods to obtain a high-confidence set of enzymes encoded in the genome.
See figure in Figures section.Figure 1: Overview of the protocol. EC sets in green box are high confidence ECs. EC set C (in red box) is a lower confidence set with less false negatives.
As shown in Figure 1, the first step is to annotate enzymes using the protein sequences and multiple annotation methods. Most of these methods use different characteristics of the protein sequence and assign an enzyme activity, formally represented as an Enzyme Commission number (EC) to it if it passes certain qualification thresholds. KAAS1, which uses the KEGG database2 assigns KEGG Orthology (KO) IDs, which can then be translated to corresponding EC IDs using information from the KEGG database. PRIAM3 uses position-specific scoring matrices derived for ECs in the ENZYME database4. Since, we consider KAAS and PRIAM to have a potential for greater false positive rates, we only use an intersection of their EC annotations. On the other hand, predictions from the BRENDA database5, comprising literature-curated predictions, and the DETECT pipeline6, a prediction approach accounting for sequence diversity across enzyme families, are considered robust. Thus KAAS/PRIAM predictions are combined with annotations from DETECTv1 and BRENDA to yield a high-confidence set of ECs (EC set A in Figure 1).This set is unlikely to suffer from high false-positive rates, because DETECT considers probabilities of all sequences and BRENDA is a literature-curated set of enzymes. However, due to stringency of annotation parameter, it is possible that some ECs will be missed in some draft genomes (i.e. false negatives).
For applications that are sensitive to false negatives (e.g. determining amino acid auxotrophies), a second step is introduced to add additional ECs to EC set A. This is done using the pathway hole-filling algorithm of the Pathway Tools package7. Pathway Tools reconstructs metabolic networks using reference pathways, together with an input EC set (EC set A in this case). Based on the coverage of reference pathways and input genome sequences, the hole-filling algorithm identifies genes that are likely to encode for candidate false negative ECs. These ECs are further pruned by considering only those supported by KEGG, PRIAM, and EFICaZ8 and are subsequently added to EC set A. Together the set of high-confidence EC annotations (EC set A) and those identified through the hole filling procedure constitute the final set of high-quality EC predictions (EC set B).
For large scale comparative studies, we also use gene family information to infer additional ECs that were not identified by the annotation pipeline. Since sequence orthology does not always correlate with shared enzyme activity, we only infer ECs when there are no conflicts in the orthogroup’s EC annotation, and the annotation is supported by some high quality genomes as part of the gene family. Note, these low-confidence ECs (EC set C) are only used to validate results obtained from EC set B, rather than for de-novo analysis.