1. Import spot volumes and clinical/demographic data
Spot volumes, exported from the ImageMaster software or other image analysis softwares in .csv (comma separated values) format, are read as a data matrix. Each row represents one of the L gels, whereas columns from 3 to N+2 are spot volumes for each of the N detected spots. Column 1 contains the gel labels, whereas column 2 contains the group factor (e.g., control vs. affected). A second matrix is read in .csv format, containing again one row per gel. Columns represent clinical and demographic data (age, gender, years from onset, disease staging, (any) drug daily dose, treated or not with a given drug, etc). Columns are read as independent vectors for further analysis.
2. Normalize 2-DE spot volumes
In each gel, a set of spots common to all gels (if L is small) or to 75% of gels (if L is large) is identified for normalization purposes. The sum of these spot volumes is considered as a loading/staining reference, thus providing L normalization factors. Each line in the spot volume matrix is then divided by the appropriate normalization factor to obtain a new matrix of normalized spot volumes.
3. Replace missing values
After normalization, missing spot values are replaced by the mean value of the spot volume of the group (i.e., control or affected) or, if the mean was lower than the 98th percentile, by the minimum value observed in the group 12. This ensures to safely prevent the artefactual assignment of missing spots (qualitative differences) as false negatives. To this purpose, the data matrix is split into separate submatrices for distinct groups, i.e., controls (co), early-onset patients (eo) and late-onset patients (lo).
4. Check the quality of data sets
Like many other biological observables, 2-DE spot volumes show a log-normal distribution, with less-intense spots more frequent than more-intense ones. To simply verify this, spot volumes are transformed logarithmically and the standard deviation is calculated within the group. "Figure 1":http://www.nature.com/protocolexchange/system/uploads/2372/original/Figure1.jpg?1355321820 shows almost constant values of standard deviations as a function of log(volume), as expected for a log-normal distribution. This means that, in principle, a parametric test is suitable for comparison between groups after logarithmic transformation of spot volumes 16.
Then, the biological variability of the subjects in each group is evaluated. Pairs of 2-DE gels of specimens from different subjects in the same group are compared by Pearson linear correlation of corresponding spot volumes after logarithmic transformation. If the number of specimens (i.e., L) is large, it is preferable to take a single subject per group as a reference, so that all gels in the group are compared to this one. For each comparison, the Pearson correlation coefficient is shown below the scatter graph. Dispersions of the studentized residuals against ranked magnitudes (QQ) plots are shown on the left of each correlation ( "Figure 2":http://www.nature.com/protocolexchange/system/uploads/2373/original/Figure2.jpg?1355322300 ). Spots deviating from linearity indicate non-normal distribution of residuals. A linear behavior indicates linear proportionality between gels. Dashed lines indicate 95% CI. On this basis, a direct proportionality between gels from different subjects may be assumed 16. Otherwise, the gel shall be discarded.
5. Identify spots associated to confounding factors
Confounding factors may be either grouping classes (gender, assuming or not a given treatment, familiarity) or real numbers (age, daily drug dose, age at onset, disease duration, disease staging). In the first case, a nonparametric Wilcoxon test is performed on spot volumes grouped according to the classes, so to determine which spots are significantly different in a class with respect to the other one ( "Figure 3":http://www.nature.com/protocolexchange/system/uploads/2374/original/Figure3.jpg?1355322720 ).
Otherwise, a linear correlation analysis is performed to determine which spots are significantly correlated with the magnitude of the confounding factor (e.g., daily drug dose or age). Important to notice, the analysis of correlation with age should be performed on control subjects only ( "Figure 4":http://www.nature.com/protocolexchange/system/uploads/2374/original/Figure3.jpg?1355322720 ). Similarly, in the drug dose correlation only patients should be included, treated or not with that drug.
Eventually, spots that significantly correlate with one or more confounding factor(s) are removed from the dataset.
6. Calculate p values for all comparisons
This is the most critical step of the procedure. If data are log-normally distributed, a parametric test may be applied after logarithmic transformation of the dataset. A Welch test is preferred, since distinct spots may have different variance. Otherwise, a non-parametric test such as the Wilcoxon test may be a safer choice 16. Working with three groups, five contrast are suggested, in principle. Three are binary contrast (one group vs. another one), a fourth compares patients (both eo and lo) vs. control subjects and the last contrast compares a peculiar group (e.g., eo) vs. all other subjects. In this way, ten p values are computed for each spot (five Welch test and five Wilcoxon test). If the purpose is to identify a biomarker able to discriminate patients from controls, then the threshold for significance should be 0.05/N, where N is the number of spots, and a non-parametric variant of the classic ANOVA test might be preferable 16. However, our procedure might be employed to identify all those spots that are different in the contrast, in order to build a classification function as a linear combination of all the selected spots or of a subset of them.
7. Build a linear discriminant function to classify the individuals and verify the predictive model
Predictive models for the classification of patients with respect to control subjects, or of patients with a particular disease subtype with respect to idiopathic disease patients (e.g., early-onset vs. late-onset) are built by linear discriminant analysis (LDA) 16. Volumes of spots selected as described above are linearly combined in a discriminating function, thus providing a likelihood score for the disease. The predictive model is verified by leave-one-out cross-validation. Each subject is iteratively removed from the dataset and classified on the basis of the spot volumes observed in the remaining individuals. This procedure permits to obtain reliable estimates of sensitivity and specificity. A disease likelihood score is attributed to each subject and a ROC curve 16 is calculated by plotting sensitivity and specificity as a function of the cutoff value.
The contribution of each spot may be of different relevance, depending on the separation of mean volume values and their LDA coefficient. The product of these two parameters indicates the intrinsic contribution of every spot to the discriminating model, so to permit the selective removal of worst-contributing spots. Eventually, a report table is produced to summarize the descriptive statistics results, the fold of change in log2 units, the LDA coefficient and the weighted contribution of each spot to the model (i.e., the product of the difference of mean values times the LDA coefficient) ( "Table 1":http://www.nature.com/protocolexchange/system/uploads/2376/original/Table1.jpg?1355330640 ). On this basis, a subset of most relevant spots may be selected instead of the list including all identified spots.
8. Make a power analysis
Spots to be included in the classification function have been selected on the basis of a non-parametric test. However, linear discriminant analysis takes into account the separation of two normal distributions. It may be useful, at this point, to check which one would be the appropriate number of subjects to be enrolled in a verification study, assuming a normal distribution of each spot volume across the individuals.
Note: the complete source is available "here":http://www.nature.com/protocolexchange/system/uploads/2394/original/protocol.R?1356083285 as a text file.