A. Requirements for pipeline Input
To run the full CloVR-ITS analysis pipeline, at least two different inputs have to be provided by the user: one or multiple sequence file\(s) and a samplemetadata mapping file. Sequence data may consist of a single fasta file that contains pooled and barcoded sequences from multiple samples, or multiplepre-processed fasta files with each sample being represented by a single fasta file. No two fasta headers within any submitted file may be identical. Themapping file provides sample-associated metadata information used for beta diversity analysis.A.1 Mapping file requirements for single sequence file input
Example for tab-delimited table to be used as mapping file in combination with a single sequence input file \(pooled, barcoded, unprocessed 454 sequences),in QIIME format:
#SampleID BarcodeSequence LinkerPrimerSequence Treatment_p Description
Sample_1 AGCACGAGCCTA CATGCTGCCTCCCGTAGGAGT Control mouse_ID_300
Sample_2 AGCACGAGCCTA CATGCTGCCTCCCGTAGGAGT Diabetic mouse_ID_354
Sample_3 AACTCGTCGATG CATGCTGCCTCCCGTAGGAGT Control mouse_ID_355
Sample_4 ACAGACCACTCA CATGCTGCCTCCCGTAGGAGT Diabetic mouse_ID_356
The following rules apply:
1. All entries are tab-delimited.
2. All entries in every column are defined \(no empty fields).
3. The header line begins with the following fields:
#SampleID<tab>BarcodeSequence<tab>LinkerPrimerSequence
4. The header line must end with the field Description, i.e. the total number of columns is four or more.
5. The BarcodeSequence and LinkerPrimerSequence fields have valid IUPAC DNA characters.
6. There are no duplicate header fields and no duplicate entries in the #SampleID column.
7. No header fields or corresponding entries contain invalid characters \(only alphanumeric and underscore characters allowed).
8. There are no duplicates when the primer and barcodes are appended.
A.2 Mapping file requirements for multiple sequence file input
Example for tab-delimited table to be used as mapping file in combination with multiple sequence input files \(one FASTA file per sample):
#File SampleName ph_p Treatment Temperature_p Description
A.fasta sampleA high control mild patientA
B.fasta sampleB high sick medium patientB
C.fasta sampleC low treated high patientC
The following rules apply:
1. All entries are tab-delimited.
2. All entries in every column are defined \(no empty fields).
3. The header line begins with: #File<tab>SampleName.
4. The #File column contains the names of all input fasta files and does not contain duplicate entries.
5. There are no duplicate header fields.
6. No header fields or corresponding entries contain invalid characters \(only alphanumeric and underscores characters allowed).
A.3 Pairwise comparisons with Metastats
To utilize the Metastats statistical methodology, which detects differential abundances of taxa between two sample groups, the associated header field mustend with "_p", \(e.g. "Treatment_p", or "ph_p"). If a header with the "_p" ending exists, pairwise Metastats calculations will be carried out between allgroups specified in the corresponding column \(provided that a group contains at least three samples).
A.4 Providing quality scores with sequence data
To include quality scores as input, for each input fasta file <prefix>.fasta there must exist a separate quality score file <prefix>.qual. Forexample, if the input fasta files are A.fasta, B.fasta and C.fasta, then there must also exist A.qual, B.qual, and C.qual for quality filtering to beperformed
\[1]. The quality score files are tagged similarly to the input fasta files before starting a pipeline.B. Sequence preprocessing
Input data are initially assessed for quality and chimeric sequences. Problematic sequences are removed before subsequent processing.B.1 File consistency check
All input fasta files are first checked for consistency with the input mapping file. If a fasta file listed within the mapping file does not exist, or ifan input fasta file is not listed in the mapping file, the pipeline will halt with an error. Likewise consistency is checked for any input quality scorefiles.
B.2 Splitting by samples and quality filtering
To check each read from the sequence pool for quality and to sort sequences based on the sample-specific barcodes, the QIIME script split_libraries.py isused with the following parameters:
--min-seq-length 100\(sets the minimum sequence length to 100 bp)
--max-seq-length 2000\(sets the minimum sequence length to 2000 bp)
--barcode-type variable_length\(disables barcode corrections and allows for unique barcodes with varying lengths)
--max-homopolymer 8\(sets the maximum homopolymer length to 8 bp)
--min-qual-score 25\(sets the minimum average quality score to 25, applies only when quality scores are provided to the pipeline)
--max-ambig 0\(sets the maximum number of ambiguous bases allowed to 0)The output of this component \(seqs.fna) is a single set of filtered reads identified by sample and meeting the above quality criteria.
B.3 Selection of high identity clusters
To assist in de novo chimera detection and downstream taxonomic analysis, sequences are clustered into high identity OTUs using a 99% identitythreshold and the QIIME command pick_otus.py. We allow for reverse complement searching by UCLUST here. The longest sequences in each stringent cluster areselected as OTU representatives using pick_rep_set.py. The relative abundance of each OTU is denoted with each representative sequence for UCHIME.
B.4 Chimera identification and removal
To detect putative chimeric sequences in the filtered data, representative sequences are input to UCHIME \(using de novo mode with defaultparameters). Representatives assigned as chimeras propagate the assignment across their clusters, and a single list of all putative chimeras is output. Allchimeric sequences are then removed from consideration before the next step in the pipeline.
C. Sequence processing
C.1 Sequence clustering
The QIIME script pick_otus.py is used to cluster all non-chimeric reads from all samples into genus-level operational taxonomic units \(OTUs) based on anucleotide sequence identity threshold. The clustering program for this step is UCLUST \[
2] and the nucleotide sequence identity threshold for all reads within an OTU is 85%. UCLUST is set to examine both the forward and reverse complementsequences during clustering.C.2 Alpha-diversity analysis.
Genus-level OTUs created by the QIIME commands above are reorganized and input to Mothur which uses the scripts read.otu, rarefaction.single, andsummary.single to generate rarefaction curves and estimators of species diversity for each sample. Finally a custom program called Leech is used to plotall rarefaction curves together defined by varying color schemes related to the input mapping file.C3. Taxonomic annotation of high identity clusters
All non-chimeric representative sequences from the 99% clusters generated in step
B.3 are searched against a custom database of ITSreference sequences from known species \(
clovritsdb v1.0) using BLASTN with the following options: "-e 1.0e-5" \(e-value threshold), "-b 10" \(numberof hits to show) and "-m 8" \(tabular output). Each sequence is assigned to the taxonomic lineage of the best BLAST alignment covering at least 90% of thequery sequence length and matching with a minimum identity of 90%, 85%, 75%, 70%, and 60% identity for species, genus, family, order, and class-levelassignments, respectively. Representatives without alignments of sufficient coverage or identity at a specific taxonomic-level are denoted as"Unclassified." Hits are propagated across the corresponding clusters.
D. Additional analysis using Metastats and the R statistical package
The output from the taxonomic classification of each sequence from all samples by the BLAST-based classification step is further analyzed and graphicallyrepresented using the Metastats program \[
6] and customized scripts in the R programming language.D.1 Detection of differentially abundant features
Metastats uses count data from annotated sequences to compare two populations in order to detect differentially abundant features \[
6]. BLASTN results are processed to detect different taxonomic groups at multiple levels \(class, order, family, genus, species). Metastats produces atab-delimited table displaying the mean relative abundance of a feature, variance and standard error together with a p value and q value to describesignificance of the detected variations \(see project website: http://metastats.cbcb.umd.edu/). Note Metastats can run analyses of 1 sample vs. 1 sample, orN samples vs. M samples, where N and M are greater than 1. It cannot perform a comparison of 1 sample vs. 2 samples.D.2 Stacked histogram generation
Custom R scripts are used to normalize taxonomic group counts to relative abundances. Stacked histograms of the relative abundances are generated in the.pdf format, if there are at most 50 samples and at most 25 taxon groups. Beyond these limits a visualized histogram is not generated.
D.3 Unsupervised sample clustering
A custom R script called skiff is used to normalize taxon counts and to calculate distance matrices for samples and taxonomic groups, using a Euclideandistance metric. Complete-linkage \(furthest neighbor) clustering is employed to create dendrograms of samples and taxa in the .pdf format. The R packagesRColorBrewer and gplots are included in this task.
D.4 Pie chart visualization
Custom R scripts are used to form pie charts displaying proportions of sequences assigned to specific functional and taxonomic levels for up to 12 samples.Outputs are in .pdf format. For more than 12 samples this function is not performed, as the visual comparison for the user would be cumbersome.
\[1]Note: fasta and quality files can be retrieved from an sff file using the Roche/454 proprietary program sffinfo.