Eutherian comparative genomic analysis protocol
The eutherian comparative genomic analysis protocol integrated gene annotations, phylogenetic analysis and protein molecular evolution analysis into one framework of eutherian gene descriptions (Figure 1).
Gene annotations
The eutherian gene annotations included gene identifications in public reference genomic sequence assemblies, analyses of gene features, multiple pairwise genomic sequence alignments and tests of reliability of eutherian public genomic sequences. The protocol used sequence alignment editor "BioEdit":http://www.mbio.ncsu.edu/BioEdit/bioedit.html in all analyses of nucleotide and protein sequences. The identifications of potential coding sequence used eutherian reference genomic sequence assemblies downloaded from "National Center for Biotechnology Information (NCBI) GenBank":ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/ or "Ensembl genome browser":http://www.ensembl.org , as well as "NCBI's BLAST programs":ftp://ftp.ncbi.nlm.nih.gov/blast/ and "Ensembl genome browser's BLAST or BLAT web tools":http://www.ensembl.org . In analyses of gene features, the protocol used direct evidence of gene annotations available in "NCBI's nr, est_human, est_mouse and est_others databases":https://www.ncbi.nlm.nih.gov . The protocol established tests of reliability of eutherian public genomic sequences that used potential coding sequences. The first test steps analysed nucleotide sequence coverages of each potential coding sequence, using BLASTN program and primary experimental genomic sequence information deposited in "NCBI's Trace Archive":https://www.ncbi.nlm.nih.gov/Traces/trace.cgi . The second test steps described potential coding sequences as complete coding sequences only if consensus trace coverages were available for every nucleotide in each potential coding sequence. Alternatively, the protocol described potential coding sequences as putative coding sequences (not used in analyses). The protocol used complete coding sequences in all analyses, and deposited them in "European Nucleotide Archive":https://www.ebi.ac.uk/ena as curated "third party data gene data sets":https://www.ebi.ac.uk/ena/about/tpa-policy . The guidelines of "human gene nomenclature":http://www.genenames.org/about/guidelines and "mouse gene nomenclature":http://www.informatics.jax.org/mgihome/nomen/gene.shtml were used in revisions and updates of gene classifications. In multiple pairwise genomic sequence alignments, the protocol used "mVISTA's AVID":http://genome.lbl.gov/vista/index.shtml . In base sequences used in multiple pairwise genomic sequence alignments, transposable elements were masked by "RepeatMasker":http://www.repeatmasker.org/ . Finally, the pairwise nucleotide sequence identities of predicted promoter regions calculated using "BioEdit":http://www.mbio.ncsu.edu/BioEdit/bioedit.html were used in statistical analyses (Microsoft Office Excel).
Phylogenetic analysis
In phylogenetic analyses, the protocol included protein and nucleotide sequence alignments, calculations of phylogenetic trees and calculations of pairwise nucleotide sequence identity patterns. First, the complete coding sequences were translated using "BioEdit":http://www.mbio.ncsu.edu/BioEdit/bioedit.html , and then aligned at amino acid level using ClustalW implemented in "BioEdit":http://www.mbio.ncsu.edu/BioEdit/bioedit.html . After inspections and manual corrections of protein primary sequence alignments, the protocol prepared nucleotide sequence alignments. The "MEGA":http://www.megasoftware.net was used in phylogenetic tree calculations. The protocol used neighbour-joining, minimum evolution, maximum parsimony and unweighted pair group method with arithmetic mean methods. The pairwise nucleotide sequence identities of nucleotide sequence alignments calculated using "BioEdit":http://www.mbio.ncsu.edu/BioEdit/bioedit.html were used in statistical analyses (Microsoft Office Excel). For each nucleotide sequence alignment, the protocol calculated average pairwise identities and their average absolute deviations, as well as largest pairwise identities and smallest pairwise identities. Finally, the protocol discriminated between eutherian major gene clusters including and not including evidence of differential gene expansions.
Protein molecular evolution analysis
The protocol established tests of protein molecular evolution integrating patterns of nucleotide sequence similarities with protein primary structures. The protein and nucleotide sequence alignments were used in tests. First, for each nucleotide sequence alignment, the protocol calculated codon usage statistics using "MEGA":http://www.megasoftware.net . The ratios between observed and expected amino acid codon counts determined relative synonymous codon usage statistics (R). Therefore, in nucleotide sequence alignments, the amino acid codons including R ≤0.7 were designated as not preferable amino acid codons. The protocol then described reference protein sequence amino acid sites as invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment positions that did not include amino acid codons with R ≤0.7) or compensatory amino acid sites (variant alignment positions that included amino acid codons with R ≤0.7), using protein and nucleotide sequence alignments.