Eutherian comparative genomic analysis protocol
The eutherian comparative genomic analysis protocol RRID:SCR_014401 integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis into one framework of eutherian gene descriptions.
The eutherian gene annotations included gene identifications in public reference genomic sequence assemblies, analyses of gene features, multiple pairwise genomic sequence alignments and tests of reliability of eutherian public genomic sequences. The protocol used sequence alignment editor BioEdit in all analyses of nucleotide and protein sequences. The identifications of potential coding sequence used eutherian reference genomic sequence assemblies downloaded from National Center for Biotechnology Information's (NCBI) GenBank or Ensembl genome browser, as well as NCBI's BLAST programs and Ensembl genome browser's BLAST or BLAT web tools. In analyses of gene features, the protocol used direct evidence of gene annotations available in NCBI's nr, est_human, est_mouse and est_others databases. The protocol established tests of reliability of eutherian public genomic sequences that used potential coding sequences. The first test steps analysed nucleotide sequence coverages of each potential coding sequence, using NCBI's BLASTN program and primary experimental genomic sequence information deposited in NCBI's Trace Archive. The second test steps described potential coding sequences as complete coding sequences only if consensus trace coverages were available for every nucleotide in each potential coding sequence. Alternatively, the protocol described potential coding sequences as putative coding sequences (not used in analyses). The protocol used complete coding sequences in all analyses, and deposited them in European Nucleotide Archive as curated third party data gene data sets. The guidelines of human gene nomenclature and mouse gene nomenclature were used in revisions and updates of gene classifications. In multiple pairwise genomic sequence alignments, the protocol used mVISTA's AVID. In base sequences used in multiple pairwise genomic sequence alignments, transposable elements were masked by RepeatMasker. Finally, the pairwise nucleotide sequence identities of predicted promoter regions calculated using BioEdit were used in statistical analyses (Microsoft Office Excel).
In phylogenetic analyses, the protocol included protein and nucleotide sequence alignments, calculations of phylogenetic trees and calculations of pairwise nucleotide sequence identity patterns. First, the complete coding sequences were translated using BioEdit, and then aligned at amino acid level using ClustalW implemented in BioEdit. After inspections and manual corrections of protein primary sequence alignments, the protocol prepared nucleotide sequence alignments. The MEGA was used in phylogenetic tree calculations. The protocol used neighbour-joining, minimum evolution, maximum parsimony and unweighted pair group method with arithmetic mean methods. The pairwise nucleotide sequence identities of nucleotide sequence alignments calculated using BioEdit were used in statistical analyses (Microsoft Office Excel). For each nucleotide sequence alignment, the protocol calculated average pairwise identities and their average absolute deviations, as well as largest pairwise identities and smallest pairwise identities. Finally, the protocol discriminated between eutherian major gene clusters including and not including evidence of differential gene expansions.
Protein molecular evolution analysis
The protocol established tests of protein molecular evolution integrating patterns of nucleotide sequence similarities with protein primary structures. The protein and nucleotide sequence alignments were used in tests. First, for each nucleotide sequence alignment, the protocol calculated codon usage statistics using MEGA. The ratios between observed and expected amino acid codon counts determined relative synonymous codon usage statistics (R = Observed codon counts / Expected codon counts). Therefore, in nucleotide sequence alignments, the amino acid codons including R ≤ 0.7 were designated as not preferable amino acid codons. The protocol then described reference protein sequence amino acid sites as invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment positions that did not include amino acid codons with R ≤ 0.7) or compensatory amino acid sites (variant alignment positions that included amino acid codons with R ≤ 0.7), using protein and nucleotide sequence alignments.