CloVR-ITS: Automated ITS amplicon sequence analysis pipeline for fungal microbiota analysis– standard operating procedure, version 1.0

doi:10.1038/protex.2013.017

Method Article

CloVR-ITS: Automated ITS amplicon sequence analysis pipeline for fungal microbiota analysis– standard operating procedure, version 1.0

https://doi.org/10.1038/protex.2013.017

This work is licensed under a CC BY-NC 3.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

Next-generation sequencing has been successfully used to characterize microbial communities based on the amplification and sequencing of phylogenetic marker genes, e.g. the 16S rRNA gene. In comparison to 16S rRNA amplicon sequencing-based procedures for bacterial and archaeal microbiota analysis, few comparable protocols have been made available to study fungal organisms. Here we describe the CloVR-ITS protocol for fungal microbiota analysis using internal transcript spacer (ITS) amplicon sequencing. CloVR-ITS includes well known bioinformatic tools for alpha and beta diversity analyses, suitable to process even large sequence datasets:

A) QIIME [1] for sequence processing and beta phylogenetic analysis using different methods including UCLUST [2];

B) UCHIME [3] for rapid identification of chimeric sequences;

C) Mothur [4] for alpha diversity and ecological parameter calculations;

D) BLASTN [5] for taxonomic sequence assignments using custom databases;

E) Metastats [6] and custom R scripts for statistical and graphical evaluations.

CloVR-ITS accepts as input either a single raw multiplex 454-pyrosequencer output file (i.e. pooled barcoded sequences from multiple samples), or alternatively, pre-processed sequences from multiple samples in separate files. CloVR-ITS is available as part of the CloVR package (http://clovr.org).

Computational biology and bioinformatics

Biotechnology

Microbiology

internal transcribed spacer

ITS

fungal microbiota

automated sequence analysis pipeline

cloud computing

Software

Step	Program	Version	URL	Ref.
Preprocessing, OTU clustering	QIIME	1.1.0	http://qiime.sourceforge.net/	\[1]
Chimera detection	UCHIME	4.0.87	http://www.drive5.com/ uchime/	\[3]
Alpha-diversity analysis	Mothur	1.12.0	http://www.mothur.org/	\[4]
Taxonomic classification of sequences	BLASTN	2.2.21	http://blast.ncbi.nlm.nih.gov	\[5]
Differential taxonomic prevalence calculation	Metastats	1.0	http://metastats.cbcb.umd.edu/	\[6]
Statistical and graphical evaluation	R	2.10.1-2	http://www.r-project.org/

Reference data

Database	Data	URL	Ref.
clovr-itsdb, version 1.0	ITS2-derived custom database, curated full ribosomal ITS sequences \(not only ITS2) with taxonomic information down to the species level	http://its2.bioapps.biozentrum.uni-wuerzburg.de/	\[7]

Pipeline input

Data	Suffix	Description
Single sequence input file	.fasta	Single sequence file containing pooled, barcoded, unprocessed 454 sequences
Multiple sequence input files	.fasta	Pre-processed, trimmed and binned sequences with 1 file per sample
Sequence quality file\(s) \(optional)	.qual	Sequence quality file\(s) corresponding to single or multiple sequence input files

Metadata mapping file	.txt	Tab-delimited table containing sample-associated metadata information

A. Requirements for pipeline Input

To run the full CloVR-ITS analysis pipeline, at least two different inputs have to be provided by the user: one or multiple sequence file\(s) and a samplemetadata mapping file. Sequence data may consist of a single fasta file that contains pooled and barcoded sequences from multiple samples, or multiplepre-processed fasta files with each sample being represented by a single fasta file. No two fasta headers within any submitted file may be identical. Themapping file provides sample-associated metadata information used for beta diversity analysis.

A.1 Mapping file requirements for single sequence file input

Example for tab-delimited table to be used as mapping file in combination with a single sequence input file \(pooled, barcoded, unprocessed 454 sequences),in QIIME format:

#SampleID BarcodeSequence LinkerPrimerSequence Treatment_p Description

Sample_1 AGCACGAGCCTA CATGCTGCCTCCCGTAGGAGT Control mouse_ID_300

Sample_2 AGCACGAGCCTA CATGCTGCCTCCCGTAGGAGT Diabetic mouse_ID_354

Sample_3 AACTCGTCGATG CATGCTGCCTCCCGTAGGAGT Control mouse_ID_355

Sample_4 ACAGACCACTCA CATGCTGCCTCCCGTAGGAGT Diabetic mouse_ID_356

The following rules apply:

1. All entries are tab-delimited.

2. All entries in every column are defined \(no empty fields).

3. The header line begins with the following fields:

#SampleID<tab>BarcodeSequence<tab>LinkerPrimerSequence

4. The header line must end with the field Description, i.e. the total number of columns is four or more.

5. The BarcodeSequence and LinkerPrimerSequence fields have valid IUPAC DNA characters.

6. There are no duplicate header fields and no duplicate entries in the #SampleID column.

7. No header fields or corresponding entries contain invalid characters \(only alphanumeric and underscore characters allowed).

8. There are no duplicates when the primer and barcodes are appended.

A.2 Mapping file requirements for multiple sequence file input

Example for tab-delimited table to be used as mapping file in combination with multiple sequence input files \(one FASTA file per sample):

#File SampleName ph_p Treatment Temperature_p Description

A.fasta sampleA high control mild patientA

B.fasta sampleB high sick medium patientB

C.fasta sampleC low treated high patientC

The following rules apply:

1. All entries are tab-delimited.

2. All entries in every column are defined \(no empty fields).

3. The header line begins with: #File<tab>SampleName.

4. The #File column contains the names of all input fasta files and does not contain duplicate entries.

5. There are no duplicate header fields.

6. No header fields or corresponding entries contain invalid characters \(only alphanumeric and underscores characters allowed).

A.3 Pairwise comparisons with Metastats

To utilize the Metastats statistical methodology, which detects differential abundances of taxa between two sample groups, the associated header field mustend with "_p", \(e.g. "Treatment_p", or "ph_p"). If a header with the "_p" ending exists, pairwise Metastats calculations will be carried out between allgroups specified in the corresponding column \(provided that a group contains at least three samples).

A.4 Providing quality scores with sequence data

To include quality scores as input, for each input fasta file <prefix>.fasta there must exist a separate quality score file <prefix>.qual. Forexample, if the input fasta files are A.fasta, B.fasta and C.fasta, then there must also exist A.qual, B.qual, and C.qual for quality filtering to beperformed\[1]. The quality score files are tagged similarly to the input fasta files before starting a pipeline.

B. Sequence preprocessing

Input data are initially assessed for quality and chimeric sequences. Problematic sequences are removed before subsequent processing.

B.1 File consistency check

All input fasta files are first checked for consistency with the input mapping file. If a fasta file listed within the mapping file does not exist, or ifan input fasta file is not listed in the mapping file, the pipeline will halt with an error. Likewise consistency is checked for any input quality scorefiles.

B.2 Splitting by samples and quality filtering

To check each read from the sequence pool for quality and to sort sequences based on the sample-specific barcodes, the QIIME script split_libraries.py isused with the following parameters:--min-seq-length 100\(sets the minimum sequence length to 100 bp)--max-seq-length 2000\(sets the minimum sequence length to 2000 bp)--barcode-type variable_length\(disables barcode corrections and allows for unique barcodes with varying lengths)--max-homopolymer 8\(sets the maximum homopolymer length to 8 bp)--min-qual-score 25\(sets the minimum average quality score to 25, applies only when quality scores are provided to the pipeline)--max-ambig 0\(sets the maximum number of ambiguous bases allowed to 0)The output of this component \(seqs.fna) is a single set of filtered reads identified by sample and meeting the above quality criteria.

B.3 Selection of high identity clusters

To assist in de novo chimera detection and downstream taxonomic analysis, sequences are clustered into high identity OTUs using a 99% identitythreshold and the QIIME command pick_otus.py. We allow for reverse complement searching by UCLUST here. The longest sequences in each stringent cluster areselected as OTU representatives using pick_rep_set.py. The relative abundance of each OTU is denoted with each representative sequence for UCHIME.

B.4 Chimera identification and removal

To detect putative chimeric sequences in the filtered data, representative sequences are input to UCHIME \(using de novo mode with defaultparameters). Representatives assigned as chimeras propagate the assignment across their clusters, and a single list of all putative chimeras is output. Allchimeric sequences are then removed from consideration before the next step in the pipeline.

C. Sequence processing

C.1 Sequence clustering

The QIIME script pick_otus.py is used to cluster all non-chimeric reads from all samples into genus-level operational taxonomic units \(OTUs) based on anucleotide sequence identity threshold. The clustering program for this step is UCLUST \[2] and the nucleotide sequence identity threshold for all reads within an OTU is 85%. UCLUST is set to examine both the forward and reverse complementsequences during clustering.

C.2 Alpha-diversity analysis.

Genus-level OTUs created by the QIIME commands above are reorganized and input to Mothur which uses the scripts read.otu, rarefaction.single, andsummary.single to generate rarefaction curves and estimators of species diversity for each sample. Finally a custom program called Leech is used to plotall rarefaction curves together defined by varying color schemes related to the input mapping file.

C3. Taxonomic annotation of high identity clusters

All non-chimeric representative sequences from the 99% clusters generated in step B.3 are searched against a custom database of ITSreference sequences from known species \(clovritsdb v1.0) using BLASTN with the following options: "-e 1.0e-5" \(e-value threshold), "-b 10" \(numberof hits to show) and "-m 8" \(tabular output). Each sequence is assigned to the taxonomic lineage of the best BLAST alignment covering at least 90% of thequery sequence length and matching with a minimum identity of 90%, 85%, 75%, 70%, and 60% identity for species, genus, family, order, and class-levelassignments, respectively. Representatives without alignments of sufficient coverage or identity at a specific taxonomic-level are denoted as"Unclassified." Hits are propagated across the corresponding clusters.

D. Additional analysis using Metastats and the R statistical package

The output from the taxonomic classification of each sequence from all samples by the BLAST-based classification step is further analyzed and graphicallyrepresented using the Metastats program \[6] and customized scripts in the R programming language.

D.1 Detection of differentially abundant features

Metastats uses count data from annotated sequences to compare two populations in order to detect differentially abundant features \[6]. BLASTN results are processed to detect different taxonomic groups at multiple levels \(class, order, family, genus, species). Metastats produces atab-delimited table displaying the mean relative abundance of a feature, variance and standard error together with a p value and q value to describesignificance of the detected variations \(see project website: http://metastats.cbcb.umd.edu/). Note Metastats can run analyses of 1 sample vs. 1 sample, orN samples vs. M samples, where N and M are greater than 1. It cannot perform a comparison of 1 sample vs. 2 samples.

D.2 Stacked histogram generation

Custom R scripts are used to normalize taxonomic group counts to relative abundances. Stacked histograms of the relative abundances are generated in the.pdf format, if there are at most 50 samples and at most 25 taxon groups. Beyond these limits a visualized histogram is not generated.

D.3 Unsupervised sample clustering

A custom R script called skiff is used to normalize taxon counts and to calculate distance matrices for samples and taxonomic groups, using a Euclideandistance metric. Complete-linkage \(furthest neighbor) clustering is employed to create dendrograms of samples and taxa in the .pdf format. The R packagesRColorBrewer and gplots are included in this task.

D.4 Pie chart visualization

Custom R scripts are used to form pie charts displaying proportions of sequences assigned to specific functional and taxonomic levels for up to 12 samples.Outputs are in .pdf format. For more than 12 samples this function is not performed, as the visual comparison for the user would be cumbersome.

\[1]Note: fasta and quality files can be retrieved from an sff file using the Roche/454 proprietary program sffinfo.

Pipeline output

Data	Suffix	Description
Filtered sequences	.fna	Sequences passing QIIME-based poor- quality filter \(filename: seqs.fna)
Detected chimeras	.txt	Sequence names from seqs.fna \(and OTU representatives) identified as putative chimeras \(filename: allchimeraids.txt)
OTU assignments	.txt	List of genus-level OTUs \(QIIME)
Alpha diversity	.rarefaction	Rarefaction numerical curves separated by sample \(Mothur)
	.pdf	Rarefaction plots separated by metadata type \(Leech/CloVR)
	.summary	Richness and diversity estimators \(Mothur)
BLAST hits	.raw	BLASTN to reference datasets results table \("-m 8" format)
Taxonomic assignments	.tsv	Table \(tab-delimited) displaying taxonomic assignment counts for each sample
Metastats output	.csv	Differentially abundant taxonomic groups \(as pre-defined in Metadata input)
Skiff clustering	.pdf	Heatmap and two-way clustering based on taxonomic assignment abundances
Pie charts	.pdf	Pie charts describing assignment abundances for up to 12 samples \(not performed if >12 samples are given)
Stacked histograms	.pdf	Stacked histograms displaying relative abundances for up to 50 samples and 25 features \(not performed if beyond these thresholds)

Caporaso J.G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 7:335-336 (2010).
Edgar R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26(19):2460-2461 (2010).
Edgar R.C. et al. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics 27:2194-2200 (2011).
Schloss P.D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 75:7537-7541 (2009).
Altschul S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402 (1997).
White J.R. et al. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 5:e1000352 (2009).
Koetschan C. et al. ITS2 database IV: Interactive taxon sampling for internal transcribed spacer 2 based phylogenies. Mol Phyl Evol. 63(3):585-588 (2012).

This material is based upon work supported by the National Science Foundation under Grant No. 0949201 and by the National Human Genome Research Institute under Grant No. 5RC2HG005597-02.

The authors declare no competing financial interests.

supplement0.pdf
Table2_MultipleSequencesInput_MappingFile Table 2. Mapping file, example 2. Example for tab-delimited table to be used as mapping file in combination with multiple sequence input files (one FASTA file per sample).
supplement0.doc
Full text document CloVR-ITS Protocol
supplement0.pdf
Table1_SingleSequenceInput_MappingFile Table 1. Mapping file, example 1. Example for tab-delimited table to be used as mapping file in combination with a single sequence input file.

Download PDF

Version 1

posted

You are reading this latest protocol version

CloVR-ITS: Automated ITS amplicon sequence analysis pipeline for fungal microbiota analysis– standard operating procedure, version 1.0

Status:

Version 1

Abstract

Equipment

Software

Reference data

Pipeline input

Procedure

A. Requirements for pipeline Input

A.1 Mapping file requirements for single sequence file input

A.2 Mapping file requirements for multiple sequence file input

A.3 Pairwise comparisons with Metastats

A.4 Providing quality scores with sequence data

B. Sequence preprocessing

B.1 File consistency check

B.3 Selection of high identity clusters

B.4 Chimera identification and removal

C. Sequence processing

C.1 Sequence clustering

C.2 Alpha-diversity analysis.

C3. Taxonomic annotation of high identity clusters

D. Additional analysis using Metastats and the R statistical package

D.1 Detection of differentially abundant features

D.2 Stacked histogram generation

D.3 Unsupervised sample clustering

D.4 Pie chart visualization

Anticipated Results

Pipeline output

References

Acknowledgements

Additional Declarations

Supplementary Files

Associated Publications

Status:

Version 1

Privacy Policy

Terms of Service

CloVR-ITS: Automated ITS amplicon sequence analysis pipeline for fungal microbiota analysis– standard operating procedure, version 1.0

Status:

Version 1

Abstract

Equipment

Software

Reference data

Pipeline input

Procedure

A. Requirements for pipeline Input

A.1 Mapping file requirements for single sequence file input

A.2 Mapping file requirements for multiple sequence file input

A.3 Pairwise comparisons with Metastats

A.4 Providing quality scores with sequence data

B. Sequence preprocessing

B.1 File consistency check

B.3 Selection of high identity clusters

B.4 Chimera identification and removal

C. Sequence processing

C.1 Sequence clustering

C.2 Alpha-diversity analysis.

C3. Taxonomic annotation of high identity clusters

D. Additional analysis using Metastats and the R statistical package

D.1 Detection of differentially abundant features

D.2 Stacked histogram generation

D.3 Unsupervised sample clustering

D.4 Pie chart visualization

Anticipated Results

Pipeline output

References

Acknowledgements

Additional Declarations

Supplementary Files

Associated Publications

Status:

Version 1

Privacy Policy

Terms of Service

Manage Cookie Preferences