1) This R script is easiest to use in RStudio. Please be sure to have these versions (or newer) installed: R v4.3.0, RStudio 2023.06.0+421, XCode v14.3.1, and then load the R script.
2) Set the working directory to be the location where the input data is saved. To use the provided example data extract the zip and set the working directory to the location in which it was extracted.
3) Assuming the user has set the working directory to the location with the example input files, the script as written should execute without change and generate output for each section. Some packages may need to be installed by the user. If a "there is no package called ‘[package name]" error occurs when loading a library, run install.packages("[package name]").
4) Find the code associated with the analysis you want to perform. Follow per-package instructions as described in detail below, with additional stepwise instructions provided as comments in the R script itself.
Packages include:
Rarefication for 16S data
This section shows an R solution for rarifying 16S data (or any other count data) to a uniform number of read counts. This section takes 1 input file containing read counts, OTUs/ASVs down the rows and samples across the columns (e.g.: Input_example-Rareification.txt). An output file named "Rarefied_output_example.txt" will be produced in the working directory.
Diversity calculation
This method generates Shannon and Bray-Curtis diversity tables for a table of read counts. This section takes 1 input file containing read counts, OTUs/ASVs down the rows and samples across the columns (e.g.: Input_example-Rareification.txt). Output tables named "OUTPUT - shannon_diversity.txt" and "OUTPUT_braycurtis_diversity.txt" will be produced in the working directory.
Dirichlet-multinomial mixture model (DMM) clustering
This method uses the Dirichlet-multinomial mixture model to analyze data and cluster/identify samples with distinct communities of microbiome taxa. This code requires one input file containing read counts, OTUs/ASVs/Genomes down the rows, and samples across the columns (see: Input_example-readcounts_min3samples.txt) and produces several outputs. The script is initially set to test cluster sizes 1-6, but the user can modify this to suit their work. The output file "readcounts_DMM_modelfit_min3samples.tsv" will provide AIC/BIC/Laplace scores per tested cluster size (where lower scores indicate better clustering). This will also produce files listing the samples assigned to each cluster, for each number of clusters, in files named "readcounts_DMM_k#_membership_min3samples.tsv" where '#' indicates the cluster size. It will also produce files showing the contribution per taxonomic group for each component in files named "readcounts_min3samples_DMM_communityDriver_k_#.tsv" where '#' again indicates the cluster size tested.
Non-metric multidimensional scaling (NMDS) with metaNMDS
This section runs NMDS clustering, based on a cluster count guided by the results of DMM. This will provide X and Y coordinates for an NMDS plot, as well as a distance matrix (that will be used in the Silhouette index and prediction strength calculations). This requires an input table of read counts per sample per taxa (e.g. Input_example-readcounts_min3samples.txt). This metaNDMS method also allows the user to specific the number of desired clusters (default setting is '2'), which may be guided by DMM results. The output distance matrix is named 'NMDS_dist_table_readcounts_min3samples.tsv'. The scatterplot data is also provided in the output table 'NMDS_coordinates_readcounts_min3samples.tsv'.
Silhouette coefficient and prediction strength
This section assigns silhouette width and prediction strength to previously-generated cluster assignments, based on a distance matrix (for example, based on output from DMM and NMDS, above). The input files are a cluster membership table (e.g.: Input_example-cluster_membership.tsv) and a distance matrix (e.g.: Input_example-NMDS_dist_table_readcounts_min3samples.tsv). The example script will produce an output file containing silhouette coefficient metrics per cluster and overall stats, named 'readcounts_min3samples_DMMk2-5_distNMDS_Silhouette.txt'. Then, the prediction strength section takes as input the dataframe holding the dissimilarity matrix, and also allows the user to set the range of cluster counts to test in the call to the prediction.strength function (e.g. the default setting is testing size 2-5 clusters, M=100 sets the number of times to resample). This produces an output file of metrics including the prediction strength per cluster. This file by default is called 'readcounts_min3samples_k2-5_distNMDS_PredictionStrength.txt'.
Random Forest
This section runs the random forest algorithm on an input table of relative abundances per taxa per sample, with samples down the rows, features across the columns, relative abundance data in the matrix, and the second column indicating one of two classes per sample (e.g.: Input_example-RF.txt; count-based data can be used but is not necessary). Outputs include an importance table (listing the features most strongly guiding the classification), a confusion table (with the number of correct and incorrect classifications, and overall accuracy error), as well as an ROC curve plot that includes a Mann-Whitney U statistic for the significance of the prediction accuracy. Random forest is useful for determining how well the overall microbiome profile can predict classification into one of two groups, and the mean decrease in accuracy (MDA) values can be used to rank the taxa based on the classification.
ANCOMBC2
This section runs ANCOMBC2 to identify differentially abundant taxa in microbiome data. This method requires 2 input files: a read counts table (e.g.: Input_example-ANCOMBC2-readcounts.txt) and a metadata table with a grouping per sample (e.g.: Input_example-ANCOMBC2-comparisons.txt). Two output files will be generated in the working directory (ANCOMBC2_output_prime.tsv, ANCOMBC2_output_sens.tsv). The "output_prime" file contains the p values and the p values adjusted for multiple testing (q values). The "Comparison" values should be assigned to features, and not the "Intercept" values.