The procedure of gene expression genome-wide association testing is performed once normalized (or standardized) gene expression and genotypic data are obtained. Detailed description of quality control measures to obtain gene expression and genotypic data is beyond the scope of this protocol and can be found in ref. 1.
1- Input files for the procedure
1-1- Gene expression data
Obtain raw gene expression data
Do a log2 transformation followed by appropriate data standardization or normalization using JMP Genomics that provides several methods for normalizing gene expression data sets3.
Retain the transcripts with expression above background levels. This procedure is described in ref. 2.
1-2- Genotypic data
Extract genotypic data after stringent quality control. In ref. 1, a total of 579,144 SNPs were generated using BeadStudio (Illumina).
Generate molecular properties (minor allele frequency, heterozygosity….etc) for each marker using JMP Genomics.
Numerically code the entire allelic data set using JMP Genomics as 0, 1, or 2 where each number represents the number of copies of the minor allele.
Retain markers with minor allele frequencies MAF > 5% for association testing. In ref. 1, a total of 516,792 have a MAF > 5%.
Save both gene expression and genotypic data sets in JMP Genomics wide format with the samples as rows and molecular entities (numerical genotypes and gene expression measurements) as columns. Each data set must contain a column of individual identifiers which is used for merging the data sets together. Save the data also in JMP Genomics tall format along with an accompanying experimental design data set for certain analyses as described in JMP Genomics user guide3 where details about the different input file formats are described. Create also a data set comprising all gene expression and SNP genotype data as described below.
2- Variables included in the association model
2-1- Matrix of relatedness measures
- Calculate relatedness Âij between each pair of individuals using the method described in ref. 4 for all individual pairs where Âij is averaged over l = 1 to n loci:
Âij = [? (x il – 2 p ). ((x jl –2 p ) / 2 pq )] / n
where x il = 0, 1 or 2 according to whether individual i has genotype aa, Aa or AA at locus l, p (q) is allele frequency of A (a), and 2p is the mean of x l. Use only autosomal SNPs that satisfy quality control criteria for Hardy-Weinberg equilibrium (P value > 0.0001) and minor allele frequency (MAF > 0.05) to generate the matrix of relatedness measures.
2-2- Genotypic principal component eigenvectors and clustered ethnicity
Perform principal component analyses on the entire genotypic dataset using the Eigenstrat method5 as implemented in JMP Genomics.
Retain scores from the significant genotypic PCs for inclusion in the association tests.
Inclusion of clustered ethnicity based on PC plots is optional. For example, the 194 samples studied in ref. 1 were clustered into four clusters to account for location in an unbiased manner relative to ethnicity. Deciding to include clustered ethnicity or not as a variable is best done on a case-by-case basis depending on the observed population structure patterns.
3- Association testing
- First use the JMP Genomics Cross Correlation process to test for Pearson correlation between all pairs of expression measurements and numerically coded SNP genotypes (>11 billion pairs in ref. 1) using the following basic correlation model, where Baseline is the mean measure of transcript abundance, and the error ? is assumed to be normally distributed with a mean of zero
Expression = Baseline + SNP + ? (Model 1)
To significantly reduce computational time and the need for disk space for data storage save only hits below 0.05. This cut-off yielded 690 million significantly correlated SNP/gene pairs using the dataset in ref.1.
If storage and computing power is not an issue all P values should be saved.
Bring the 10,000 most significant SNP-expression associations from this model (~P < 10-7) forward for two further analyses in a mixed model framework.
Apply more complex models to each of the 10,000 most significant SNP-expression pairs using the JMP Genomics Q-K Mixed Model process. For example the following model is used in ref. 1 to account for various effects including relatedness and population structure and various interaction effects:
Expression = Baseline + Location + Gender + Relatedness + Significant Genotypic Principal Components + Clustered Ethnicity + SNP + SNP** Location + Gender** Clustered Ethnicity + Gender*Location + ? (Model2)
The JMP Genomics Q-K Mixed Model process fits a model as described in ref 6. to test for association between the gene expression variants and SNP genotypes while adjusting for fixed effects and the random effect relatedness. JMP Genomics calls the MIXED procedure in SAS/STAT to perform these analyses.
First create a data set comprising all expression and SNP genotype data as input for the Q-K Mixed Model process. Create a second data set containing the columns of the relatedness matrix named Col1, Col2,…, Coln as well as a Row column indicating the row of the matrix .
In ref.1, because of the large number of SNP-expression pairs to be run, a JSL script was used to auto-generate the required JMP Genomics setting files. Each settings file contained the parameters needed for a single run of Q-K Mixed Model for one model for a single SNP-expression pair. All settings were run as a group in the JMP Genomics Workflow Builder and results were gathered into a single data set.
Model 2 adjusted for relatedness, by specifying the column of the relatedness data set containing the relatedness measurements as a Random Effect, and using the PROC MIXED options LDATA= and TYPE=LIN(1) for the Random Statement Options.