Eutherian comparative genomic analysis protocol
The eutherian comparative genomic analysis protocol RRID:SCR_014401 integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis into one framework of eutherian gene descriptions.The protocol included 3 original genomics and protein molecular evolution tests, including tests of reliability of public eutherian genomic sequences using genomic sequence redundancies, tests of contiguity of public eutherian genomic sequences using multiple pairwise genomic sequence alignments and tests of protein molecular evolution using relative synonymous codon usage statistics.
1. Gene annotations
The eutherian gene annotations included gene identifications in public genomic sequence assemblies, analyses of gene features, tests of reliability of public eutherian genomic sequences and tests of contiguity of public eutherian genomic sequences.
1.1. All analyses and manipulations of nucleotide and protein sequences used sequence alignment editor BioEdit.
1.2. The eutherian reference genomic sequence data sets were accessible in National Center for Biotechnology Information's (NCBI) GenBank, as well as in Ensembl genome browser.
1.3. The identifications of potential coding sequences used public eutherian reference genomic sequence assemblies and NCBI's BLAST program including BLAST Genomes and Ensembl genome browser’s BLAST or BLAT programs.
1.4. The analyses of gene features used potential coding sequences and direct evidence of eutherian gene annotations accessible in NCBI's nr, est_human, est_mouse and est_others databases.
1.5. The tests of reliability of eutherian public genomic sequences analysed potential coding sequences using good laboratory practice in Sanger DNA sequencing method. The first test steps analysed nucleotide sequence coverages of potential coding sequences using NCBI's BLAST program and processed Sanger DNA sequencing reads or traces accessible in NCBI's Trace Archive. The second test steps discriminated complete coding sequences and putative coding sequences. Specifically, the tests described potential coding sequences as complete coding sequences only if consensus trace nucleotide sequence coverages were available for every nucleotide. Alternatively, if consensus trace nucleotide sequence coverages were not available for every nucleotide, the potential coding sequences were described as putative coding sequences that were not used in analyses. For example, the good laboratory practice in Sanger DNA sequencing method exacted that minimal consensus trace nucleotide sequence coverage included 2 identical trace nucleotide sequences.
1.6. The tests of contiguity of public eutherian genomic sequences included multiple pairwise genomic sequence alignments. The tests used public eutherian reference genomic sequences encoding complete coding sequences and mVISTA's program AVID. In eutherian genomic sequences, the tests analysed translated exon numbers, as well as their chimerisms and relative orders and orientations. The tests of contiguity of eutherian public genomic sequences did not use masking of transposable elements in public eutherian reference genomic sequence assemblies.
1.7. The curated eutherian gene collections were deposited in European Nucleotide Archive as third party data gene data sets. The revised and updated eutherian gene classifications and nomenclatures used guidelines of human gene nomenclature and guidelines of mouse gene nomenclature.
2. Phylogenetic analysis
The phylogenetic analysis included protein and nucleotide sequence alignments, calculations of phylogenetic trees and calculations of pairwise nucleotide sequence identities.
2.1. The complete coding sequences were translated using BioEdit and then aligned at amino acid level using ClustalW in protein amino acid sequence alignments. The protein amino acid sequence alignments were manually corrected, and nucleotide sequence alignments were prepared accordingly using BioEdit.
2.2. The calculations of phylogenetic trees used nucleotide sequence alignments and MEGA program.
2.3. Using nucleotide sequence alignments, the pairwise nucleotide sequence identities of eutherian complete coding sequences were calculated using BioEdit. The statistical analyses using Microsoft Office Excel statistical functions included calculations of average pairwise nucleotide sequence identities (ā) and their average absolute deviations (āad), as well as largest (amax) and smallest (amin) pairwise nucleotide sequence identities.
3. Protein molecular evolution analysis
The protein molecular evolution analysis included analyses of protein amino acid sequence features and tests of protein molecular evolution using relative synonymous codon usage statistics.
3.1. The protein amino acid sequence features were annotated manually, including analyses of common cysteine amino acid residue patterns among eutherian major protein clusters.
3.2. The tests of protein molecular evolution using relative synonymous codon usage statistics integrated patterns of nucleotide sequence similarities with protein primary structures. Using nucleotide sequence alignments, the MEGA calculated relative synonymous codon usage statistics as ratios between observed and expected amino acid codon counts (R = Counts / Expected counts). The amino acid codons including R ≤ 0.7 were described as not preferable amino acid codons. In reference protein amino acid sequences, the tests described invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment positions that did not include amino acid codons with R ≤ 0.7) and compensatory amino acid sites (variant alignment positions that included amino acid codons with R ≤ 0.7).