*Note that all code within this protocol is italicized and has been written so it can be directly pasted into RStudio
Load necessary packages:
Import the files:
1. Set the RStudio working directory to the location of the gene counts and metadata files.
2. Import the gene counts file. Note: In the case of large datasets like the one used here, we recommend importing the gzipped (.gz) file. The software can unzip the file as it is imported, saving memory. Avoid viewing after opening the file and instead use the following to import :
tcga_RSEM_Hugo_norm_count <- read_delim("tcga_RSEM_Hugo_norm_count.gz",
"\t", escape_double = FALSE, trim_ws = TRUE)
3. Import the metadata file:
metadata <- read_csv("metadata.csv")
Prepare counts dataframe:
1. Make a dataframe of the counts:
df <- as.data.frame(tcga_RSEM_Hugo_norm_count)
2. Set the rownames equal to the first column, then delete the first column:
row.names(df) <- df$sample
3. Back-transform the data. Gene counts are in units of Log2(counts+1) and we need just 'counts' as input for the downstream normalization:
df2 <- 2^df
df3 <- df2-1
df <- df3
4. Check to ensure that the number of variables in the counts dataframe is equal to the number of observations in the metadata. It is critical these numbers are the same (here that number is 10,535 for each).
Perform TMM normalization and Voom transformation:
dge <- DGEList(counts=df, group=metadata$Cohort)
keep <- rowSums(cpm(dge)>1) >= 3
dge <- dge[keep, keep.lib.sizes=FALSE]
GD <- factor(metadata$Cohort)
design <- model.matrix(~0+GD)
dge <- calcNormFactors(dge)
v <- voom(dge,design,plot=TRUE)
fit <- lmFit(v, design)
*Note: these commands will generally have long (~5 minute) run times for a dataset of this magnitude. The steps will run more quickly if working with smaller data (a single cohort for example).
Generate results as Log2(Normalized CPM+1):
1. Generate the normalized counts per million (CPM):
TCGA_PanCan_CPM <- cpm(dge)
2. Add pseudocount of +1:
cpm_plus1 <- TCGA_PanCan_CPM+1
3. Transform to Log2 scale:
Log2CPM_plus1_TCGA_PanCan <- log2(cpm_plus1)
Log2CPM_plus1_TCGA_PanCan <- as.data.frame(Log2CPM_plus1_TCGA_PanCan)
Export the results:
*Note: This will export a file approximately 5GB in size. To simply obtain the expression of a single gene of interest this step can be skipped and the protocol can be resumed at the next section.
Obtain normalized expression of a single gene of interest across all cohorts:
*Note: Ensure the row names are the same identifier type as gene of interest's input. (Gene symbol, Ensembl identifier, etc.) Here we use the gene CYP3A5 as an example.
Log2CPM1_CYP3A5 <- Log2CPM_plus1_TCGA_PanCan["CYP3A5",]
Transpose and reformat to obtain a final dataframe having Sample ID, Expression, and Cohort:
Log2CPM1_CYP3A5 <- t(Log2CPM1_CYP3A5)
ID <- row.names(Log2CPM1_CYP3A5)
Log2CPM1_CYP3A5 <- cbind(Log2CPM1_CYP3A5,ID)
colnames(Log2CPM1_CYP3A5)[colnames(Log2CPM1_CYP3A5)=="CYP3A5"] <- "Log2CPM1_CYP3A5"
merged <- as.data.frame(merge(Log2CPM1_CYP3A5, metadata, by.x="ID", by.y="ID", all.x=TRUE))
Change the expression values to numeric instead of a factor (to work correctly with plotting results):
values_numeric <- as.numeric(levels(merged$Log2CPM1_CYP3A5))[merged$Log2CPM1_CYP3A5]
merged <- as.data.frame(cbind(merged, values_numeric))
merged_CYP3A5 <- merged
Export the expression results of GeneX across all cohorts:
Plot the results ranked by median expression:
ggplot(merged_CYP3A5, aes(x = fct_reorder(Cohort, Log2CPM1_CYP3A5, .fun = median, .desc =FALSE,na.rm=TRUE), y = Log2CPM1_CYP3A5)) + geom_boxplot(color="steelblue4", fill="steelblue4",outlier.size=0.8, alpha=0.1, lwd=0.6, fatten=1) + theme(axis.text.x=element_text(angle=0, hjust=1, face="bold", size=10), axis.text.y=element_text(face="bold", size=10)) + geom_point(color="steelblue4", size=0.8, alpha=0.1)+ theme_bw()+ coord_flip() + xlab("TCGA Cohort")
*Note: Plots can be easily exported as high-quality SVG images from the RStudio interface.