Part I. Computational Analysis for Subtype Identification
In this section, we will describe the methods used to preprocess the microarray data, merge different gene expression microarray datasets, NMF analysis for the identification of subtypes and SAM analysis to identify subtype-specific gene signature.
A. Preprocessing of Microarray Data
Preprocess the CEL files from different platforms of Affymetrix microarrays before further analysis. Use R (frame work for statistical computing) and Bioconductor2 (open source software for bioinformatics) for the preprocessing.
1. Preprocessing using Bioconductor package "affy". Make sure that the required CEL files are in the same directory where R has been started. The R code for this step is as follows:
Code for preprocessing CEL files by robust multiarray (RMA) statistics
library(affy)
data <- ReadAffy()
data_rma <- rma(data)
data_exp <- exprs(data_rma)
Note: The expression values are in log2 scale.
Code for mapping probesets to gene symbol
library(hgu133plus2.db)
x = unlist(as.list(hgu133plus2SYMBOL)
Note that here "hgu133plu2.db" has been used as an example for those microarrays done using Affymetrix GeneChip Human U133Plus 2.0 array. Install and include the relevant library depending on the type of Affymetrix array used. x will contain the gene symbol corresponding each probe.
data_exp_1 <- cbind(x, data_exp)
write.table(data_exp_1,"hgu133plus2_genesonly.txt", row.names=FALSE,sep="\t",quote=FALSE)
This will provide a table of RMA processed gene expression profile data with gene symbols. The samples are in columns and genes are in rows.
2. Correcting batch effect. We have microarrays from microdissected (UCSF) PDA performed in different batches (years - 2006, 2007 and 2008). Correct batch effects using R based software "ComBat":http://jlab.byu.edu/ComBat/Abstract.html 3. Prepare two files - gene expression index file and sample information file. Gene expression index file is a delimited text file with first column containing gene names or probe IDs and rest of the columns with samples and gene expression indices. Sample information file contains first column with Array name, second column with sample name and third column with batch numbers (eg., 1, 2, ...).
Download and install "ComBat":http://jlab.byu.edu/ComBat/Abstract.html as described in the website. Use "ComBat":http://jlab.byu.edu/ComBat/Abstract.html as follows.
Note: Make sure to execute "ComBat":http://jlab.byu.edu/ComBat/Abstract.html in its installed directory or include the directory in the Source comment below.
Source("ComBat.R")
ComBat("your expression file name", "your sample information")
The results with batch adjusted data will be written in the same directory.
3. Assessing Array Quality. We performed normalized unscaled standard error (NUSE)4 to assess the quality of the arrays. NUSE is available as a part of "affyPLM":http://www.bioconductor.org/packages/2.6/bioc/vignettes/affyPLM/inst/doc/QualityAssess.pdf (a model based quality control assessment of Affymetrix GeneChips) Bioconductor package. Assess microarray quality as follows.
library(affy)
library(affyPLM)
data<-ReadAffy()
Note: Make sure that the required CEL files are in the same directory where R has been started.
Pset<-fitPLM(data)
Note - fitPLM fits a specified robust linear model to the probe level data. It calculates the standard error (SE) estimates for each gene on each array.
NUSE(Pset, main = "NUSE")
NUSE accounts for differences in variability between genes across arrays. It standardizes the SE for each gene on each array calculated from fitPLM such that the median SE for that gene is 1 across all arrays. The above comment will produce a graph with y-axis representing the median SE for each array and helps to compare across arrays. Arrays with median SE or NUSE score of lesser than 1 + 0.25 and greater than 1 - 0.25 were further considered as better quality microarrays for our study.
In addition, obtain the NUSE statistics using the following R code.
NUSE(Pset, type = "stats")
Those arrays with specified NUSE criteria can be used for further analysis.
4. Merging of different microarray datasets. We used distance weighted discrimination (DWD)5 6 algorithm to merge different microarray datasets from different microarray platforms or laboratory. For this purpose, download Java based DWD and install it as described on the DWD description page. Before DWD-merging, screen each dataset for probesets/genes with standard deviation (SD) greater than 0.8. Also, select unique genes in the case where many probesets represent a single gene. This can be done by selecting only the gene-specific probe with the highest SD among all probes for that given gene. Additionally, remove those probes that do not map to any gene. The R code for this screening is available in the Attachments or Figures section of this article. You can download and execute accordingly. Column (samples) normalize (N(0,1)) and row (probe or gene) normalize (by median centering) each dataset. Use the R code below for normalizing and merging datasets using DWD.
library(impute)
dataUnique<-read.delim("file name", stringsAsFactors=FALSE)
Note - dataUnique should read the file generated from above SD filtering with unique genes as described above. The columns of the tab-delimited file should be samples (except first column being gene symbols) and rows should be genes.
imp<-impute.knn(dataUnique)
Column normalization
colMax <- apply(dataUnique,2,max, na.rm=TRUE);
_colMin <- apply\(dataUnique,2,min, na.rm=TRUE);_
_scaleFactor <- 1.0/\(colMax-colMin);_
_for \(i in 1:dim\(dataUnique) ==\[== 2]) \{_
_dataUnique\[,i] <- \(dataUnique\[,i]-colMin\[i])*scaleFactor\[i];_
_}_
Row median centering
rowMed <- apply(dataUnique,1,median, na.rm=TRUE);
dataUnique<-dataUnique-rowMed
_outFile1<- paste(strsplit(exprFile,".txt"),"_sd",sdCutoff,"col_Norm_row_Med",".txt",sep="");
_dataUnique1<-cbind\(data\[,1],dataUnique)_
_colnames\(dataUnique1) ==\[== 1]<-"Genes"_
write.table(file="outFile", dataUnique1, quote=FALSE, row.names=FALSE, sep="\t");
_}_
Note: Ensure that the final output file is numeric (e.g., avoid 1e-2, instead 0.01).
Merge the two individual SD screened and preprocessed datasets using installed Java based "DWD":https://genome.unc.edu/dwd/. Remove the second row with dataset identifier added by DWD in the output file. Perform row median centering of the merged dataset before proceeding to the next step.
5. NMF analysis. The subtypes of PDA were identified in merged and row median centered microarray datasets using the non-negative matrix factorization (NMF) algorithm1. R code for the NMF algorithm was downloaded from GenePattern website at the Broad Institute. Perform the NMF analysis as follows:
source("/directory/nmfconsensus.R")
Note: make sure to include the directory where NMF R code is located.
nmfconsensus<-(input.ds="merged file name in tab-delimited format",k.init=2,k.final=5,num.clusterings=20,maxniter=500, error.function="euclidean")
The output graphics and other required files will be saved in the directory where the code is executed. The "consensus.all.k.plot.pdf" file provides a sample x sample consensus matrix for different k's and cophentic coefficient plot. Files with names ending with ".gct" provide the classes for each sample.
SAM Analysis. "Significant analysis of microarrays":http://www-stat.stanford.edu/~tibs/SAM/ 7 was performed using a Bioconductor package, "siggenes":http://www.bioconductor.org/packages/2.0/bioc/html/siggenes.html. The analysis was performed to identify gene signatures specific to each PDA subtypes identified by NMF analysis. Perform the analysis as follows:
library(siggenes)
dataMerged<-read.delim("DWD merged file name", stringsAsFactors=FALSE)
Numeric matrix was generated from above file.
dataMerged_m<-matrix(as.numeric(unlist(dataMerged ==[== ,2:dim(dataMerged) ==[== 2]])),nrow=dim(dataMerged) ==[== 1], ncol=(dim(dataMerged) ==[== 2])-1)
Note - The file used for this purpose is DWD merged files in tab-delimited format.
colnames(data_m)<-colnames(data)[2:(dim(dataMerged]
rownames(data_m)<-data[,1]
Reading the classes for each sample from NMF output file
con<-read.delim("consensus.k.3.gct", stringsAsFactors=FALSE)
m<-match(con[,1],colnames(dataMerged))
dataMerged_m1<-dataMerged_m[,m]
sam.out<-sam(dataMerged_m1,con[,2],rand=123,gene.names=rownames(dataMerged_m1))
genes<-summary(sam.out,12.2)@row.sig.genes
Note - number "12.2" is an example of delta value from SAM analysis and it varies from analysis to analysis and can be chosen based on false discovery rate (FDR) and false positive scores. We chose FDR < 0.001 and false positive being almost zero.
Final dataset with SAM selected genes
dataMerged_genes<-cbind(rownames(dataMerged_m1),dataMerged_m1[genes, ])
write.table(dataMerged_genes,"SAM genes selected dataset.txt", sep="\t", quote=FALSE,row.names=FALSE)
Finally, generate a Heatmap with above data file using "Cluster":http://rana.lbl.gov/EisenSoftware.htm software8 and viewed using GenePattern's "Hierarchical Clustering Viewer":http://www.broadinstitute.org/cgi-bin/cancer/software/genepattern/modules/gp_modules.cgi 9
Map the signatures from SAM onto other datasets as follows:
data_others<-read.delim("other datasets name.txt", stringsAsFactors=FALSE
map<-match(names(genes),data_others[,1])
w<-which(!is.na(map))
data_others_SAM<-data_others[map[w],]
write.table(data_others,"other datasets mapped with PDAssigner genes.txt",sep="\t",quote=FALSE,row.names=FALSE)
Perform the Hierarchical clustering using the procedure described above.
Part II. Statistics for Clinical Outcome.
Use overall survival data from patients for univariate and multivariate analysis.
Use the log-rank test for univariate associations with survival. Perform survival (Kaplan-Meier) curve and log-rank test using web-based GenePattern software SurvivalCurve and SurvivalDifference, respectively. Provide a tab-delimited file with sample id, class for each sample (subtype information in our case), censor (0 - censored and 1 - not censored) and overall survival (in days) as an input for the program.
Use the Cox proportional hazards model for multivariate modeling of survival using the R package survival.
library(survival)
sur<-read.delim("survival.txt", stringsAsFactors=FALSE)
Note: survival.txt should have one column each for sample name, class or subtype information, censored data (0 - censor, 1 - not censored), time (in days, weeks or years), grade, stage in the same order.
coxph(formula = Surv(as.numeric(sur[,4]), as.numeric(sur[,3]))~as.factor(sur[,2]), data = sur)
where sur[,4] should be time, sur[,3] should be censored data (0 or 1), sur[,2] should be class or subtype (if not in factors, as.factor will convert to factors). Similarly, grade or stage can be analyzed.
Apply Fisher’s exact test (available in R package, stats) to investigate the relationships among subtype, stage and grade.