Gene families specific to, or with significantly changed membership in, particular lineages compared to outgroups may reflect important lineage-specific changes in biology. Here we describe a computational protocol to identify gene families that vary greatly in gene count across a species tree. This protocol can be used to analyse families from an in-house database of gene families (e.g. built using the Ensembl Compara pipeline, Vilella et al 2009).
To identify gene families that vary greatly in gene count across a species tree, we use three metrics to capture aspects of this variability. However, to control for differences in gene counts due to fragmented assemblies (e.g. genes split across multiple contigs), we use summed protein length per species in a family as a proxy for gene counts in these metrics. The first metric, coefficient of variation (Cv), simply measures the variability in summed protein length (per species) in a family. The other two metrics, Z-score (Zmax) and enrichment coefficient (Emax), reflect whether there is a gene family expansion in a particular group of species, relative to the rest of the species tree. For this, species are classified into a set of non-overlapping species groups of interest that ideally correspond to monophyletic clades of the species tree (e.g. nematodes, flatworms, and arthropods in a species tree of animals).
The three metrics are defined as follows:
(i) coefficient of variation,
See figure in Figures section.where the numerator (s) is the standard deviation in the summed protein length per species in the family, and the denominator is the mean of the summed protein length per species in the family.
(ii) maximum Z-score,
See figure in Figures section.where T is the set of non-overlapping species groups, c is a species group in the set T. For each species group c in the set T, we calculate Z as the difference between the mean summed protein length (per species) in that species group c and the overall mean (the numerator), divided by the standard deviation in summed protein length per species among the species outside group c (the denominator). Zmax is the maximum of these Z values over all tested species groups. Note that the standard deviation used here is for the species outside species group c, so that it is not affected by any gene family expansion that occurred in the family within the species group c.
(iii) maximum enrichment coefficient,
See figure in Figures section.That is, for each species group in the set T, we calculate E as the ratio between the mean of the summed protein length (per species) in that species group and the mean summed protein length outside that species group. Emax is the maximum of these E values over all tested species groups.
High values of the three metrics may be used to detect families with different patterns of variability. High values of Zmax and Emax may indicate a gene family expansion in a particular clade of the species tree, but high values of Cv may highlight cases that they miss, such as a gene family that has independently expanded in different clades of the species tree. In addition, a gene family that has expanded in a particular clade of the species tree can only have high values of Zmax and Emax if the gene family also includes genes from species outside this clade, whereas high values of Cv can be used to detect large variability within a single clade even if the family lacks members from outside that clade.
The protocol has two steps. In Step A, the set of all input gene families is filtered to remove likely transposable element families, assuming that these are not of interest to most users, since expansions of transposable element gene families are relatively unlikely to contribute to lineage-specific changes in biology.
In Step B, we define a set of non-overlapping species groups, which are ideally monophyletic clades of the species tree (e.g. nematodes, flatworms), in which we wish to identify lineage-specific expansions (e.g. nematode-specific expansions, flatworm-specific expansions). In Step B, we also define a subset of the species as having high-quality assemblies. Then, the three metrics Cv, Emax and Zmax are calculated for each family in an in-house database of gene families. Families with high values of each metric are good candidates for families with lineage-specific expansions, and can be manually examined in follow-up analyses.
Our protocol controls for differences in gene counts due to fragmented assemblies, firstly by using summed protein length per species in a family as a proxy for gene count, and also by calculating the three metrics based only on those species that have high-quality assemblies (as defined in Step B).
A related approach for finding lineage-specific gene family expansions is to identify Pfam (Finn et al 2014) domains that are enriched in a certain species group compared to the rest of the species tree, and then to identify the expanded gene families containing that domain (e.g. see Tsai et al 2013). In contrast, our current protocol does not require that the genes in an expanded family have any predicted protein domains, so can detect expansions in uncharacterised or relatively poorly characterised gene families.
A more similar approach to the current protocol was used to identify Spirometra lineage-specific expansions by Bennett et al 2014, by ranking gene families by the ratio of the total cumulative length of encoded Spirometra erinaceieuropaei genes in a family to the cumulative length of the corresponding Echinococcus multilocularis genes in the same family. This is a very similar idea to Emax, but Emax for a family is extended to be a ratio between the mean summed protein length in a species group, compared to the mean summed protein length outside that species group. This should be a more accurate metric for detecting expansions because it can include information from more species (e.g. several Spirometra-lineage species and several Echinococcus-lineage species).