1. Perform MicroRNAs differential expression analysis
To perform differential microRNA expression analysis, we used a combination of Kolmogorov-Smirnov (KS) test and fold-change (FC). For the KS test we used the function ks.boot implemented in the R package ‘Matching’8. The thresholds for this step are chosen by a permutation-based estimate of the false discovery rate (FDR), i.e. the estimated percentage of microRNAs identified by chance. For each pair of chosen KS P-value and FC thresholds, the FDR was computed reshuffling 1000 times the samples constituting the microRNA dataset. The mean value of microRNAs significantly differentially expressed in these 1000 experiments was computed and then compared with the number of microRNAs differentially expressed in our step of the pipeline.
2. Perform Target transcripts enrichment analysis
In the second MMRA step, for each microRNA differentially expressed in a given CRC subtype, we performed a target enrichment analysis in the gene signature corresponding to the subtype in which the microRNA was differentially expressed. MicroRNA’s target transcripts were predicted following the procedure discussed in Riba and colleagues9. To evaluate an enrichment of predicted targets in the signature of the miRNA associated subtype, we calculated a Bonferroni-adjusted Hypergeometric test P-value and the observed/expected (O/E) ratio. To choose optimal P-value and O/E ratio thresholds, we implemented a FDR computation as follows.
3. Perform network analysis
Network analysis was performed using the ARACNe information-theoretic algorithm for inferring transcriptional interactions5. The software was downloaded (
http://wiki.c2b2.columbia.edu/califanolab/index.php/Software/ARACNE) and included in the pipeline to infer interactions between each microRNA selected by the previous steps and any mRNA from the paired dataset. For each microRNA selected at the previous steps, data preparation for ARACNEinvolved the setting up of an expression matrix (X) row-wise combining the entire mRNA expressionTCGA dataset with the expression values of the single microRNA under analysis. To generate a matrix compatible with the standard ARACNE pre-processing steps, we inverted log2 transformation of the expression dataset: naming Xij the elements of the expression matrix previously described, we obtained the called “linear expression matrix” Y through the following operation Yij=2Xij. Then, standardARACNE pre-processing involves quantile normalization of the dataset Y, log2 transformation and filtering of those genes with a standard deviation lower than 1.2. For MMRA the only edges of interest are those connecting the microRNA to mRNAs, therefore the algorithm is run imposing the microRNA as the only hub of the network. The chosen MI P-value significance threshold (10-7) and bootstrapping P-value threshold (10-12 after 100 bootstrapped networks) are the originally recommended ones10. Subsequently, each of the consensus networks constructed around the selected microRNAs (the “regulons”), is tested for significant enrichment in subtype signature genes respect to a random null model. To this end the Master Regulator Analysis (MRA) algorithm is used as previously described11,12, evaluating the statistical significance (P-values computed by Fisher’s exact test, FET) of the overlap between the “regulon” of each microRNA, and the gene signature of the subtype in which the microRNA was identified as differentially expressed at the previous steps. To assess the sensitivity and specificity of our approach, we built a null model selecting the microRNAs that were expressed (detected in more than 45 of the 450 samples) but not differential in any subtype of any classifier (signal to noise ratio – i.e. fold-change over standard deviation – < 0.05). The regulons of the microRNAs constituting the null model were also required to have an intersection with any regulon of the previously selected candidate microRNAs lower than 70%. Then, we chose the MRA Pvalue threshold of the MRApvalues obtained in the null model.4. Perform stepwise linear regression analysis
In this step, to filter out weak microRNA-mRNA relations within the regulons, MMRA employs stepwise linear regression (SLR), a procedure previously adopted for transcription factor / target analysis11,12. The SLR procedure involved the construction of a linear model for each signature gene, as follows: the log2-expression level of the gene was considered the response variable, and the log2-expression levels of microRNAs linked by ARACNE to the gene were considered as the explanatory variables. Then, a stepwise algorithm is used to select the best minimal set of explanatory variables within the model. Akaike information criterion (AIC) was used as the stop criterion. The output of SLR was reorganized at the microRNA level, to include, for each microRNA, a list of response variables (subtype signature genes associated by ARACNE) to which it was associated by SLR. The extent of modulation of a given subtype by a given microRNA can then be estimated as the fraction of signature genes for that subtype whose expression is approximated by the microRNA according to SLR analysis (positive or negative coefficient). To estimate a significance threshold for this step we considered the distribution of the results for all the selected microRNAs in all the colorectal cancer subtypes. These results are expected to include a small subset of true associations, also selected across the previous steps, and a larger set of random associations. We therefore selected the 90th percentile of the fraction values. To generate the final output of the MMRA pipeline, significant fractions of associated subtype genes are provided only for the microRNA-subtype associations also selected in the previous steps.