For each step of the four stages of xMarkerFinder, detailed instructions of execution commands and input/output files, as well as default parameters are provided here, as well as in our GitHub repository (https://github.com/tjcadd2020/xMarkerFinder). Further characterization and elaboration of output files are in the Anticipated results part.
Stage 1: Differential signature identification
For Stage 1, input files include merged metadata and microbial count matrices. First, data cleansing is conducted prior to all analyses aiming to convert observed counts to compositional relative abundances, and to exclude rare signatures in the discovery set. Differential analyses are then performed, in each cohort, respectively, and across cohorts. The resultant differential significance file is served for visualization and the identified differential signatures are used later in Stage 2 for further feature selection and model construction.
1 Data normalization. Convert microbial counts to relative abundance profiles of all datasets involved, including discovery and test set, as well as datasets of other diseases.
$ Rscript 1_Normalization.R -W /workplace/ -p abundance.txt -o TEST
Users should specify these parameters or enter the default values, subsequent repetitions of which are not listed.
-W the Workplace of this whole protocol
-p the input microbial count profile
-o prefix of output files
Input files:
abundance.txt: merged microbial count profile of all datasets.
Output files:
relative_abundance.txt: normalized relative abundance profile of input dataset. Relative abundance profiles are used as input files for all subsequent analyses, except for Step 11, which requires raw count matrices.
2 Data filtering. Filter microbial signatures with low occurrence rates across cohorts.
$ Rscript 2_Filtering.R -W /workplace/ -m train_metadata.txt -p relative_abundance.txt -b Cohort -t 2 -o TEST
-m the input metadata file
-p the input microbial relative abundance file (output file of Step 1)
-b the column name of batch(cohort) in metadata (in example file: Cohort)
-t the minimum number of cohorts where features have to occur (default: 2)
-O prefix of output files
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
relative_abundance.txt: normalized relative abundance profile of the training dataset.
Output files:
filtered_abundance.txt: filtered relative abundance profile of the training dataset, used as the input file for following steps.
3 Confounder analysis. PERMANOVA test based on Bray-Curtis dissimilarity is performed to assess the correlation between metadata and microbial profiles and returns the coefficient of determination (R2) value and p value of each metadata index. Whichever index that contributes the most is considered as the major confounder used later in Step 4. PCoA plot with Bray-Curtis dissimilarity is provided.
$ Rscript 3_Confounder_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -g Group -o TEST
-m input metadata file
-p input filtered microbial relative abundance file
-g the column name of experimental interest(group) in metadata (in example file: Group)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered relative abundance profile after preprocessing.
Output files:
metadata_microbiota.txt: the confounding effects caused by clinical information, used to determine the major batch and covariates.
pcoa_plot.pdf: the PCoA plot with Bray-Curtis dissimilarity between groups.
4 Differential analysis. Based on the major confounder and covariates found in Step 3, cross-cohort differential signature analysis is conducted. Volcano plot of identified differential signatures is also provided.
$ Rscript 4_Differential_analysis.R -W /workplace/ -m train_metadata.txt -p filtered_abundance.txt -g Group -b Cohort -c covariates.txt -t 0.05 -o TEST
-g the column name of experimental interest(group) in metadata (in example file: Group)
-b the column name of major confounder in metadata (in example file: Cohort)
-c input covariates file (tab-delimited format containing all covariates)
-t the threshold of p value for plotting (default: 0.05)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
filtered_abundance.txt: filtered relative abundance profile after preprocessing.
covariates.txt: covariates identified in Step 3 (newly generated tab-delimited file where each row is a covariate, example file is provided).
Output files:
differential_significance_single_cohort.txt: the differential significance result in individual cohorts.
differential_significance.txt: meta-analytic testing results aggregating differential testing results in individual cohorts, used for visualization.
differential_signature.txt: significantly differential signatures between groups derived from input filtered microbial profiles, used as input files for feature selection.
differential_plot.pdf: the volcano plot of input differential significance file.
Stage 2: Model construction
5 Classifier selection. This step provides optional classifier selection for subsequent steps where the performances of each ML algorithm are generally assessed using all differential signatures. The output file contains the cross-validation AUC, specificity, sensitivity, accuracy, precision and F1 score of all classification models built with these various algorithms. Users should specify a selected classifier in all following steps.
$ python 5_Classifier_selection.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -s 0 -o TEST
-p input differential signature file (output file of Step 4)
-g the column name of experimental interest(group) in metadata (in example file: Group)
-e the experiment group(exposure) of interest (in example data: CRC)
-s random seed (default:0)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups.
Output files:
classifier_selection.txt: the overall cross-validation performance of all classifiers using differential signatures, used to determine the most suitable classifier.
6 Feature selection. xMarkerFinder introduces Triple-E, a curated three-step feature selection procedure.
6a) Feature effectiveness evaluation.
The first step of Triple-E evaluates the predictive capability of each feature via constructing individual classification models, respectively. Users should specify an algorithm here and in every future step as the overall classifier for the whole protocol from the following options: LRl1, LRl2, SVC, KNN, DT, RF, and GB. Features with cross-validation AUC above the threshold (default: 0.5) are defined as effective features and are returned in the output file.
$ python 6a_Feature_effectiveness_evaluation.py -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -e exposure -b Cohort -c classifier -s 0 -t 0.5 -o TEST
-p input differential signature file (output file of Step 4)
-b the column name of batch(cohort) in metadata (default: Cohort)
-c selected classifier
-t AUC threshold for defining if a feature is capable of prediction (default:0.5)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: significantly differential signatures between groups.
Output files:
feature_auc.txt: cross-validation AUC values of individual features.
effective_feature.txt: features derived from differential signatures that are capable of predicting disease states, used as input file of the following step.
6b) Collinear feature exclusion.
The second step of Triple-E aims to exclude collinear issue caused by highly correlated features based on the result of Step 6a and returns the uncorrelated-effective features.
$ python 6b_Collinear_feature_exclusion.py -W /workplace/ -p effective_feature.txt -t 0.7 -o TEST
-p input effective feature file (output file of Step 6a)
-t correlation threshold for collinear feature exclusion (default:0.7)
Input files:
effective_feature.txt: features with classification capability.
Output files:
feature_correlation.txt: spearman correlation coefficients of each feature pair.
uncorrelated_effective_feature.txt: features derived from input effective features excluding highly collinear ones, used as input file of the following step.
6c) Recursive feature elimination.
The last step of Triple-E recursively eliminates the weakest feature per loop to sort out the minimal panel of candidate biomarkers.
$ python 6c_Recursive_feature_elimination.py -W /workplace/ -m train_metadata.txt -p uncorrelated_effective_feature.txt -g Group -e exposure -c classifier -s 0 -o TEST
-p input uncorrelated-effective feature file (output file of Step 6b)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
uncorrelated_effective_feature.txt: independent features derived from effective features.
Output files:
candidate_biomarker.txt: identified optimal panel of candidate biomarkers, used as model input for all subsequent steps.
6* Boruta feature selection.
Besides Triple-E feature selection procedure, we provide an alternative method, feature selection with the Boruta algorithm.
$ Rscript alt_6_Boruta_feature_selection.R -W /workplace/ -m train_metadata.txt -p differential_signature.txt -g Group -s 0 -o TEST
-p input differential signature profile (output file of Step 4) or uncorrelated-effective feature file (output file of Step 6b)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
differential_signature.txt: differential signatures used for feature selection (could also be uncorrelated-effective features from Step 6b).
Output files:
boruta_feature_imp.txt: confirmed feature importances via Boruta algorithm.
boruta_selected_feature.txt: selected feature profile, which could also be used as input candidate biomarkers for all subsequent steps.
7 Hyperparameter tuning.
Based on the selected classifier and candidate biomarkers, the hyperparameters of the classification model are adjusted via bayesian optimization method based on cross-validation AUC. The output files contain the tuned hyperparameters and the multiple performance metric values of the constructed best-performing model, as well as its cross-validation AUC plot.
$ python 7_Hyperparameter_tuning.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -s 0 -o TEST
-p input candidate biomarker profile (output file of Step 6)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt (or boruta_selected_feature.txt for all subsequent steps): the optimal panel of candidate biomarkers.
Output files:
best_param.txt: the best hyperparameter combination of the classification model.
optimal_cross_validation.txt: the overall cross-validation performance of the best-performing model.
cross_validation_auc.pdf: the visualization of the cross-validation AUC of the best-performing model.
Stage 3: Model validation
8 Internal validations.
As stated above, this step provides extensive internal validations to assess the robustness and reproducibility of identified candidate biomarkers in different cohorts via intra-cohort, cohort-to-cohort, and LOCO validations. Output files contain multiple performance metrics used to assess these candidate biomarkers internally (Table 1). Corresponding grid plots are provided.
$ python 8_Validation.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -b Cohort -c classifier -s 0 -o TEST
-p input optimal candidate biomarker file (output file of Step 6)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
Output files:
validation_metric.txt: the performance of candidate biomarkers in internal validations via each model performance metric (Table 1).
validation_metric.pdf: the visualization of input validation_metric.txt.
9 External validations.
9a) Independent test.
As the best-performing candidate biomarkers and classification model are established, the test dataset is used to externally validate their robustness and generalizability. The input external metadata and microbial relative profiles need to be in the same format as initial input files for the training dataset. This step returns the overall performance of the model and its AUC plot.
$ python 9a_Test.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -a test_metadata.txt -x test_relative_abundance.txt -g Group -e exposure -c classifier -r best_param.txt -s 0 -o TEST
-a input external metadata file for the test dataset
-x input external microbial relative abundance file as the test dataset
-r input optimal hyperparameter file (output file of Step 7)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_relative_abundance.txt: the relative abundance matrix of the external test dataset.
best_param.txt: the hyperparameter combination of the best-performing classification model.
Output files:
test_result.txt: the overall performance of model in external test dataset.
test_auc.pdf: the visualization of the AUC value in test_result.txt.
9b) Specificity assessment.
To further assess biomarkers’ disease specificity, they are used to construct classification models to discriminate between other microbiome-related diseases and corresponding controls. Cross-validation AUC values of these classification models and corresponding visualization are returned.
$ python 9b_Specificity.py -W /workplace/ -p candidate_biomarker.txt -a other_metadata.txt -x other_relative_abundance.txt -g Group -e CTR -b Cohort -c classifier -r best_param.txt -s 0 -o TEST
-a input metadata file of samples from other diseases
-x input microbial relative abundance file of samples from other diseases
-e the control group name (in example file: CTR)
-b the column name of cohort (in example file: Cohort)
Input files:
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
other_metadata.txt: the clinical metadata of samples for other diseases.
other_relative_abundance.txt: the microbial relative abundance matrix of other diseases.
Output files:
specificity_result.txt: AUC values of models constructed with candidate biomarkers in other microbiome-related diseases.
specificity_auc.pdf: the visualization of the specificity_result.txt.
9b*) Alternative specificity assessment.
Random samples of case and control class of other diseases are added into the classification model, respectively, both labelled as “control”, the variations of corresponding AUCs of which are calculated and used for visualization.
$ python alt_9b_Specificity.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -q test_metadata.txt -l test_relative_abundance.txt -a other_metadata.txt -x other_relative_abundance.txt -g Group -e CTR -b Cohort -c classifier -r best_param.txt -n 5 -s 0 -o TEST
-q input external metadata file for the test dataset
-l input external microbial relative abundance file as the test dataset
-a input metadata file of samples from other diseases
-x input microbial relative abundance file of samples from other diseases
-e the control group name (in example file: CTR)
-b the column name of cohort (in example file: Cohort)
-n the number of samples to add into the model each time
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of candidate biomarkers.
test_metadata.txt: the clinical metadata of the external test dataset.
test_relative_abundance.txt: the relative abundance matrix of the external test dataset.
other_metadata.txt: the clinical metadata of samples for other diseases.
other_relative_abundance.txt: the relative abundance matrix of other diseases.
Output files:
specificity_add_result.txt: AUC values of the best-performing model applied in the external test dataset with additional case or control samples from other diseases.
specificity_add_auc.pdf: the visualization of the specificity_add_result.txt.
Stage 4: Biomarker Interpretation
10 Biomarker importance. Permutation feature importance is employed here to evaluate biomarkers’ contributions in the best-performing classification model.
$ python 10_Biomarker_importance.py -W /workplace/ -m train_metadata.txt -p candidate_biomarker.txt -g Group -e exposure -c classifier -r best_param.txt -s 0 -o TEST
-p input candidate biomarkers (output file of Step 6)
-r input optimal hyperparameter file (output file of Step 7)
Input files:
train_metadata.txt: the clinical metadata of the training dataset.
candidate_biomarker.txt: the optimal panel of selected candidate biomarkers.
best_param.txt: the hyperparameter combination of the best-performing classification model.
Output files:
biomarker_importance.txt: permutation feature importance of biomarkers via ten permutations.
biomarker_importance.pdf: the visualization of biomarker importance file.
11 Microbial co-occurrence network.
11a) Convert.
As the input file for microbial co-occurrence network construction needs to be microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample (could be converted profiles of all features, differential signatures, or candidate biomarkers according to users’ need, and null values needed to be set as 0) and header needs to start with “#OTU ID”, an additional file conversion script is provided for easier implementation.
$ python 11a_Convert.py -W /workplace/ -p abundance.txt -s selected_feature.txt -o TEST
-p input feature raw count file before normalization.
-s selected features for constructing microbial co-occurrence network (could be differential signatures or candidate biomarkers, output file of Step 4 or 6).
Input files:
abundance.txt: microbial raw count profile before normalization.
selected_feature.txt: selected features for constructing microbial co-occurrence network (output file of Step 4 or 6)
Output files:
convert.tsv: the converted file appropriate for constructing microbial co-occurrence network.
11b) Microbial co-occurrence network.
Microbial co-occurrence network is constructed using FastSpar with 50 iterations and the output files contain the correlation coefficients and p values between each signature pair.
$ ./11b_Microbial_network.sh -W /workplace/ –i convert.tsv -o TEST -t 4
-i input feature count file
-t threads of available computational source
Input files:
convert.tsv: microbial count profile in .tsv format where each row describes a microbial signature and each column represents a sample and the header needs to start with “#OTU ID”. Example input file is provided and users are recommended to use Step 11a to convert files into appropriate formats.
-t the threads of available computational source when running.
Output files:
median_correlation.tsv: the correlation coefficients between each input signature pair.
pvalues.tsv: the statistical significance of median_correlation.tsv.
11c) Microbial co-occurrence network plot.
The visualization of microbial co-occurrence network is performed using Gephi.
(i) Preprocess of the results of Step 11b to ensure that Step 11c only draws significant correlations (p < 0.05) with absolute correlation coefficients above 0.5 (as default).
$ python 11c_Microbial_network_plot.py -W /workplace/ –c median_correlation.tsv -p pvalues.tsv -t 0.5 -o TEST
-c input correlation profile (output file of Step 11b)
-p input p value profile (output file of Step 11b)
-t input correlation coefficient threshold (default: 0.5)
Input files:
median_correlation.tsv: the correlation coefficients profile (output file of Step 11b).
pvalues.tsv: the statistical significance of median_correlation.tsv (output file of Step 11b).
Output files:
microbial_network.csv: adjusted network profile for Gephi, only significant correlations reserved.
(ii) Open Gephi and click the “File” button, choose the “Import spreadsheet” option, and then choose the adjusted network profile.
(iii) Import the prepared file as undirected network.
(iv) Choose a preferable layout type to form the basic network and press the “stop” button when the network becomes stable (Fruchterman Reingold style is recommended).
(v) For further optimization of the network, appearances of nodes and edges could be adjusted according to users’ need, as well as the labels of nodes. Fig. 9 provides an optimized network example14.
12 Multi-omics correlation. If users have multi-omics or multidimensional microbial profiles of the same dataset, the correlation between different omics or dimensions are calculated via HAllA.
$ ./12_Multi_omics_correlation.sh -W /workplace/ -i microbial_abundance_1.txt -d microbial_abundance_2.txt -o TEST
-i input microbial abundance file 1
-d input microbial abundance file 2
Input files:
microbial_abundance_1.txt: microbial abundance profile 1.
microbial_abundance_2.txt: microbial abundance profile 2. These two input files should have the same samples (columns) but different features (rows).
Output files:
results/all_associations.txt: associations between different omics or dimensions.
results/hallagram.png: the visualization of all_associations.txt with only significant associations highlighted.