The gene expression profiling could aid the physicians to better understand the cellular morphology, resistance to chemotherapy and overall clinical outcome of disease [1,2]. Such individualized treatment may significantly increase survival due to the optimization of treatment procedure according to clinical pathogenesis. Ein-Dor et al [3] have pointed out that the gene sorted for the same clinical types of patients but different groups differed widely and possessed only few genes in common. An explanation to this lack of overlap between predictive signatures from different studies with the same goal may be due to the presence of more predictive genes than required to design an accurate predictor [4]. However, the microarray technique itself has been shown to be highly reproducible within and across two high volume laboratories [5].
Numerous statistical procedures including t-test [6,7], analysis of variance [8], Pearson correlation [9], Wilcoxon signed-rank test [10,11] and Mann Whitney U test [12,13] have been used for comparison of microarray data. However, the validity of various conventional statistical methods for two-group comparison of gene signatures was never evaluated using carefully selected data sets. A novel algorithm with software support is presented herein for more realistic and comprehensive interpretation of gene signatures.
Computational method and theory of CalcHEPI
The formula used for computation of HEPI score is HEPI=Σ [(Ni(0→t) Sj(0→1)/Nt]x100. Where Ni is the number of genes with Score Sj. The subscript ‘i’ may vary between 0 and total number of genes in the signature and ‘j’ may vary between 0 (minimum score) and 1 (maximum score). Nt is the total number of genes in the signature. First, all the ratios of expression data are categorized according to a logical scale to get the respective Ni and Sj values. The percent contributions of each set of genes (genes with same expression score) are computed and then summed up to get HEPI score. The fold-change strategy used in HEPI scores is robust, accurate and reproducible. Although the concept of fold-change has been described in microarray experiments it has never been utilized for collective interpretation of gene signatures. Technically, the ratio of the color intensity of each spot (probe) measures the relative expression of the corresponding gene under two different experimental conditions. In general, a gene is said to be differentially expressed if the ratio in absolute value of the expression levels between the control and treated group exceeds certain thresholds. The most acceptable expression ratios for up- and down-regulated genes have been suggested as >1.5 and <0.5 respectively [11,17,18]. While adopting the same cut-off values, additional sub-grading has been proposed in this protocol. HEPI scores are simple to interpret, easy to compare and prominent for visual cross checking.
Software design
CalcHEPI software has been developed in Microsoft Excel platform due to Excel’s flexibility, universal availability, and macro-based automation. Moreover, the spreadsheet layout of Excel is perfectly suitable for storing and analyzing microarray data as well as developing microarray analysis software. The data selection is controlled by input box to allow the users to select the paired expression values from any place of the worksheet (Fig. 1). The software then utilizes Excel’s worksheet formula function together with a macro subroutine to compute HEPI scores (Fig. 2). The percent contribution of norm-regulated (green), down-regulated (blue) and up-regulated (red) genes is also shown as a color-coded bar. The output of the software provides a comprehensive understanding of the results in terms of both qualitative (up- or down-regulation) and quantitative (gradation in fold-change) analysis of gene signature with the quick review indicator bar. The clarity and integrity of report format are quite helpful for any cross evaluation. HEPI scores are valid for any size of array signature as they are calculated according to percent (and not number) of differentially expressed genes on a 10 point scale (5 for up regulation and 5 for down regulation).
Software validation
The functional accuracy and reliability of software have been validated using the simulated and real gene signatures data for two-group comparisons. Six pairs of expression data were specifically designed to represent various degrees of similarity/differences (details not shown). Among them, the two groups in pair 4 are not significantly different whereas the groups in pair 6 possess maximum difference. All these 6 pairs were subjected to nonparametric comparisons with Mann-Whitney U test, Kolmogorov-Smirnov test, Kruskal-Wallis test, Wilcoxon signed-rank test, Sign test, Friedman test and Kendall W test using SPSS (Version 10). The real expression data of published signatures including ovarian carcinoma [14], ulcerative colitis [15], leukemia [16] and adenocarcinoma [6] were also analyzed by the above nonparametric tests as well as CalcHEPI. The characteristics of these real signatures have been summarized in our earlier report [10].