Build a Bioinformatics Analysis Platform and Apply it to Routine Analysis of Microbial Genomics and Comparative Genomics

doi:10.21203/rs.2.21224/v4

Method Article

Build a Bioinformatics Analysis Platform and Apply it to Routine Analysis of Microbial Genomics and Comparative Genomics

https://doi.org/10.21203/rs.2.21224/v4

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 4

posted

You are reading this older protocol version

Read the latest protocol version →

Genomics and comparative genomics have been increasingly used as routine methods for general microbiological researches. However, it is usually necessary to call several tools or even write some scripts to complete some simple analysis, which is complicated for most biological researchers. To simplify the operation process, especially for the convenience of microbiologists in the analysis, here we have developed PGCGAP, a comprehensive, malleable and easily-installed prokaryotic genomics and comparative genomics analysis pipeline, which implements genome assembly, gene prediction and annotation, average nucleotide identity (ANI) calculation, phylogenetic analysis, COG annotation, pan-genome analysis, inference of orthologous gene groups, variants calling and annotation and screening for antimicrobial and virulence genes. Although we have tried our best to simplify the installation and usage of PGCGAP, it may be difficult for non-bioinformatician users to master it. So, a protocol was created to help microbiologists without any experience in bioinformatics to establish their own bioinformatics platform and perform routine analysis. This protocol shows how to choose equipment, to install a Linux subsystem on a laptop with windows 10 system, to install PGCGAP and perform all analysis with an example dataset. The protocol requires a basic understanding of Linux, so an additional web page was written to help uninitiated users learn Linux and whole-genome sequencing (http://bcam.hzau.edu.cn/linuxwgs.php).

Computational biology and bioinformatics

comparative genomics

COG annotation

phylogenetic orthology

phylogenetic analysis

variants calling

pan-genome

genome assembly

gene prediction

genome annotation

genome distance

Average Nucleotide Identity

PGCGAP

Genome sequencing has become a routine method for common microbiological studies as continuously decrease in the cost of genome sequencing. Various tools have been developed for genome analysis. But for general users, it takes time to install and learn to use various programs and prepare the related input files. Even for some simple purposes, users need to spend much effort to integrate several tools or even write some scripts. For example, when we need a core-genome-SNP based phylogenetic analysis for the isolates from same species, we should successively use the following tools, Bowtie2¹ or BWA² for reads mapping, Samtools³ or GATK⁴ for SNP calling, and FastTree⁵or RAxML⁶ for phylogenetic tree construction. Therefore, a comprehensive, flexible and efficient pipeline for general analysis is urgently needed. We developed a prokaryotic genomics and comparative genomics analysis pipeline named PGCGAP to coordinate several genomic analysis software packages and in-house scripts to meet the various needs of microbiologists.

Development of the protocol

PGCGAP was developed to facilitate the work of genomics and comparative genomics analysis of microbes. Considering the important role of basic bioinformatics in microbial research and most microbiologists lacking analysis skills, this protocol describes in detail the installation of Linux systems and demonstrates the software installation methods. Finally, we demonstrated all the usages of PGCGAP step by step through the example datasets.

Applications of the protocol

PGCGAP can be used for (i) genome assembly, (ii) gene prediction and annotation, (iii) genome distance estimation, (iv) phylogenetic analysis, (v) COG annotation, (vi) pan-genome analysis, (vii) inference of orthologous gene groups, (viii) variants calling and annotation and (ix) screening for antimicrobial and virulence genes. It is worth noting that although the entire pipeline was developed for prokaryotes, some of the modules such as “Assemble”, “MASH”, “OrthoF”, “CoreTree” and “AntiRes” can also be used for the analysis of eukaryotic genomes. In addition, “VAR” is applicable to the analysis of any haploid genome.

Advantages and limitations of this pipeline

PGCGAP is versatile, feature-rich, easy to install and use, and friendly to microbiologists and bioinformatics beginners. New features will continue to be added. But a Graphical User Interface (GUI) has not been developed.

Expertise required to implement the protocol

Users need to be skilled in using computers, and it will be easier to master this protocol if they have some Linux skills. A webpage introducing the basics of Linux, usage of common commands, software installation, and the whole-genome sequencing technology was developed to help users get started with bioinformatics. Please visit https://github.com/liaochenlanruo/pgcgap/wiki/Learning-bioinformatics or http://bcam.hzau.edu.cn/linuxwgs.php for more information.

Overview of the procedure

Ten frequently used prokaryotic genomics and comparative genomics analysis processes were integrated into PGCGAP as different modules. Modules can be used separately or in different combinations for various purposes (Figure 1). (i) “VAR” performs genome-wide variants calling by mapping methods. Firstly, paired-end reads were mapped to a reference genome by BWA² after filtered by Sickle⁷. Secondly, variants calling and annotation were performed by Freebayes⁸ and snpEff⁹, respectively. Then, the whole genome SNP alignment and core SNP alignment were obtained by snippy-core¹⁰. Finally, Gubbins¹¹ was used to remove SNPs influenced by recombination events of the whole genome SNP alignment. (ii) “Assemble” performs genome assembly using ABySS¹² or Canu¹³. (iii) “Annotate” performs gene prediction and genome annotation by Prokka¹⁴. (iv) “ANI” computes Average Nucleotide Identity (ANI) between each genome pair by fastANI¹⁵. Three scripts “triangle2list.pl”, “get_ANImatrix.pl” and “Plot_ANIheatmap.R” have been developed here to generate the ANI matrix and plot the correlation matrix heat map (Supplementary Figure S1), respectively. (v) “MASH” estimates genome and metagenome distance and similarity using MinHash¹⁶, and a heat map of genome similarity will be generated by two scripts “get_Mash_Matrix.pl” and “Plot_MashHeatmap.R”. (vi) “Pan” calls Roary¹⁷ calculate the pan-genome. Two scripts “fmplot.py” and “plot_3Dpie.R” were developed for result visualization (Supplementary Figure S2). A phylogenetic tree based on single-copy core proteins called by Roary¹⁷ will be constructed (Supplementary Figure S3). (vii) COG (Clusters of Orthologous Group) annotation was conducted by module “pCOG”. Amino acid sequences of each genome were blasted against the COG database, and then all hits were mapped to the COG functional category by in-house scripts. R script “Plot_COG.R” was written for result visualization (Supplementary Figure S4). Comparison and visualization of COG functional categories among different genomes can be done by Perl script “get_flag_relative_abundances_table.pl” and R script “Plot_COG_Abundance.R” ( Supplementary Figure S5). (viii) “OrthoF” uses OrthoFinder¹⁸ for phylogenetic orthology inference. Gene duplication events will be also predicted (Supplementary Figure S6). (ix) “CoreTree” was developed for genome-wide phylogenetic analysis based on the protein sequences or SNPs of single-copy core genes. Firstly, CD-HIT¹⁹ was used to rapidly generate protein clusters, and then the protein sequences of single-copy core genes were extracted by Perl scripts and aligned by MAFFT²⁰. Secondly, on one hand, alignments of protein sequences were concatenated, and the phylogenetic tree was constructed by FastTree⁵ (Supplementary Figure S7). On the other hand, the protein sequence alignments were converted into corresponding codon alignments by PAL2NAL v14²¹. Then, the alignments were concatenated, and SNP-sites²² was called to find SNP sites. Finally, FastTree⁵ was used to construct the SNPs phylogenetic tree (Supplementary Figure S8). (x) “AntiRes” calls abricate²³ to screen for antimicrobial and virulence genes from contigs.

Experimental design

Selection of reference genome format for variants calling

The reference genome can be files in FASTA format and GenBank format. If a GenBank file rather than a FASTA file was supplied as the reference, annotation information of the variants will be generated to show to the user which feature was affected by the variants.

Adjust the kmmer size for better assembly results

According to our experience, when the length of Illumina reads is 150 bp and “--kmmer” was set from 81 to 89, better assembly quality can be obtained. It is better for the user to check the report file in the assembly result to ensure that the value of N50 is greater than 50,000 bp, otherwise, the user can try to change the value of “--kmmer” to improve the assembly quality.

Choice of a module to construct the phylogenetic tree of single-copy core proteins

Both “CoreTree” and “Pan” can be used to construct a phylogenetic tree of single-copy core proteins. That which module should be used depends on which type of input file the user has. “CoreTree” takes only the amino acid sequence files as inputs, while “Pan” needs both amino acid sequence files and Gff3 files.

Choice of a module to calculate pairwise genome distance

Both “ANI” and “MASH” can calculate pairwise genome distance. “MASH” is more suitable for dealing with thousands of genomes because of its faster running speed. In addition to nucleotide sequences and assembled genomes, “MASH” can also take amino acid sequences and raw sequencing reads as inputs and can be used to calculate distances between metagenomic samples.

A laptop, desktop PC or server can be used to build a bioinformatics analysis platform, and the suggested hardware requirements are listed in Table 1. Slightly lower features are also allowed (CPU must have four logical processors, memory must be greater than 8 G), but the computing speed may decrease, and the capacity of the hard disk can be adjusted according to actual requirements.

Building a bioinformatics analysis platform on Windows 10

Windows Subsystem for Linux (WSL) allows users to install Linux subsystems directly on Windows 10 system. It can easily run Linux commands and install Linux software, avoiding the installation of third-party virtual machine software. The advantage of WSL is that it makes better use of computer memory and does not require copying files between the host and the virtual machine.

Configuration of WSL

Timing ~1 min

System requirements: Windows 10 Version 1709, Build 16299 or above, 64-bit systems.

1. Enable WSL: Open "Settings", click "Apps", then find and click "Programs and Features", click "Turn Windows features on or off", find "Windows Subsystem for Linux" and check the box, click OK and restart the computer (Supplementary Video 1).

Install Linux

Timing ~59 min

2. Open the Microsoft Store, search Ubuntu, and choose to install Ubuntu 18.04 LTS. Follow the prompts to set up your username and password. Here we create an account with the username “bio” (Supplementary Video 2). After the installation is finished, we need to do some configuration on the system (Supplementary Video 3).

3. Enter the following command in the terminal to update the source:

$sudo apt-get update

4. Set the password for root.

$sudo root

5. Change the content of ptrace_scope from 1 to 0.

$su root

#echo 0 > /proc/sys/kernel/yama/ptrace_scope

6. Switch to the ordinary user “bio”.

#su bio

7. Installation of Miniconda

(A) Installation of Miniconda on Linux

(i) Here, Miniconda will be installed, go to the official website, and select the installation file suitable for your system and python version. Here, Miniconda 3 will be installed (Supplementary Video 4).

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

(ii) Start installation

$bash Miniconda3-latest-Linux-x86_64.sh

Press Enter when prompted to view the license, enter “yes” and press “Enter” to continue. Press “Enter” to confirm the installation location. Miniconda was installed in the miniconda3 directory under the user’s home directory. Type “yes” and press “Enter” to initialize. Finally, run the “source ~/.bashrc” command in the terminal.

$source ~/.bashrc

(iii) Set up Bioconda channel. Add the channels by entering the following three commands in the terminal.

$conda config --add channels defaults

$conda config --add channels bioconda

$conda config --add channels conda-forge

(B) Install Miniconda on MacOS

(i) Installation of Miniconda3

$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh

$sh Miniconda3-latest-MacOSX-x86_64.sh

$source ~/.bash_profile

(ii) Add channels of Bioconda

$conda config --add channels defaults

$conda config --add channels bioconda

$conda config --add channels conda-forge

Installation of PGCGAP (Supplementary Video 5).

Timing ~34 min

8. Create a pgcgap environment for the installation of PGCGAP.

$conda create -n pgcgap python=3

9. Activate the environment.

$conda activate pgcgap

10. Installation of PGCGAP.

$conda install pgcgap

11. Check if the dependent software packages were installed.

$pgcgap --check-external-programs

12. Set up the COG database.

$pgcgap --setup-COGdb

13. Exit the environment.

$conda deactivate

Step by Step examples

Timing ~2 d

14. Download the example dataset.

$wget http://bcam.hzau.edu.cn/PGCGAP/PGCGAP_Examples.rar

In this example, the working directory locates at the H drive. All hard disks in Windows were mounted in the “/mnt” directory of Ubuntu Linux. The “PGCGAP_Examples/Reads/Illumia” directory contains six Illumina Hiseq paired-end reads of Escherichia coli, the “PGCGAP_Examples/Reads/Oxford” directory contains the Oxford Nanopore reads of Escherichia coli K12, and the “PGCGAP_Examples/Reads/PacBio” directory contains the Pacific Biosciences released P6-C4 chemistry reads of Escherichia coli K12. “PGCGAP_Examples/Reads/MG1655” is the GenBank format file of E.coli K-12 substr. MG1655, used as the reference genome.

15. Activate the pgcgap environment.

$conda activate pgcgap

16. Example 1: Genome assembly with Illumina reads.

Paired-end reads of six strains in the directory “Reads/Illumina/” were used as inputs. In the dataset, the naming format of the genome is “strain_1.fastq.gz” and “strain_2.fastq.gz”. The string after the strain name is “_1.fastq.gz”, and its length is 11, so “--suffix_len” was set to 11.

$pgcgap --Assemble --platform illumina --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11

New directories and documents were generated after the program is finished. The assembly results for each genome are in the “Results/Assembles/Illumina” directory, and all scaffolds of the strains were stored in “Results/Assembles/Scaf/Illumina”. Users are advised to check the assembly stats file (such as Results/Assembles/Illumina/SRR9620252_assembly/SRR9620252-stats.tab) of each genome to ensure that the value of N50 is greater than 50,000 bp. The file “scaf.list” under the working directory contains the absolute path of all genomes.

17. Example 2: Oxford reads assembly.

Oxford nanopore only produces one reads file (“Reads/Oxford/oxford.fasta”), so only the parameter of “--reads1” needs to be set, here the value is “.fasta”. “--genomeSize” is the estimated genome size, and users can check the genome size of similar strains in the NCBI database for reference. The parameter was set to “4.8m” here. The suffix of the reads file here is “.fasta” and its length is 6, so “--suffix_len” was set to 6.

$pgcgap --Assemble --platform oxford --ReadsPath Reads/Oxford --reads1 .fasta --genomeSize 4.8m --threads 4 --suffix_len 6

The results are stored in the “Results/Assembles/Oxford” directory and the “Results/Assembles/Scaf/Oxford” directory. The former contains all intermediate files and genome files, the latter contains only the assembled genome.

18. Example 3: PacBio reads assembly.

PacBio also produces only one reads file too (“Reads/PacBio/pacbio.fastq”), the parameter settings are similar to Oxford. The strain name is “pacbio” with the suffix “.fastq” and the suffix length is 6, so “--suffix_len” was set to 6.

$pgcgap --Assemble --platform pacbio --ReadsPath Reads/PacBio --reads1 .fastq --genomeSize 4.8m --threads 4 --suffix_len 6

The results are stored in the “Results/Assembles/PacBio” directory and in the “Results/Assembles/Scaf/PacBio” directory. The former contains all intermediate files and genome files, the latter containing only the assembled genome.

19. Example 4: Gene prediction and annotation.

Here, the assembly results of Illumina reads were taken as inputs (“Results/Assembles/Scaf/Illumina/*.fa”). The suffix of the genome is “-8.fa”. When running the program, the value of the “--Scaf_suffix” parameter cannot be quoted. Here, -8.fa should not be quoted.

$pgcgap --Annotate --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --genus Escherichia --species “Escherichia coli” --codon 11 --threads 4

The generated files are stored in the “Results/Annotations” directory, and files in the directories “Results/Annotations/AAs”, “Results/Annotations/CDs” and “Results/Annotations/GFF” will be used for subsequent analysis.

20. Example 5: Constructing the single-copy core proteins tree and core SNPs tree.

The phylogenetic trees of single-copy core proteins and single-copy core genes SNPs will be constructed using the six E. coli genomes sequenced by Illumina as datasets. The input files are the amino acid sequence files (“Results/Annotations/AAs/*.faa”) and the nucleotide sequence files (“Results/Annotations/CDs/*.ffn”) obtained by the genome annotation. Amino acid files and nucleotide files must be suffixed with “.faa” and “.ffn”, respectively. The “.faa” and “.ffn” files of the same strain should have the same prefix name. The name of protein IDs and gene IDs in the Amino acids file and nucleotide file should be started with the strain name. The Prokka¹⁴ software was suggested to generate the input files.

$pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4

The result files are stored in the “Results/CoreTrees” directory. “ALL.core.protein.nwk” and “ALL.core.snp.nwk” are the phylogenetic tree files of the single-copy core proteins and the core SNPs. Users can import these two files into MEGA²³ or iTOL²⁴ to view the topology.

21. Example 6: Constructing the single-copy core protein tree only.

If the “--CDsPath” was set to “NO”, the nucleotide files will not be needed, and the phylogenetic tree of core SNPs will not be constructed too.

$pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4

22. Example 7: pan-genome analysis and phylogenetic tree constructing.

GFF3 files (With “.gff” as the suffix) of each strain placed into a directory (“Results/Annotations/GFF/*.gff”). They must contain the nucleotide sequence at the end of the file. Protein sequence files (one per species) in FASTA format under another directory were also needed (“Results/Annotations/AAs/*.faa”) if the parameter “--PanTree” was provided for constructing a phylogenetic tree. It should be noted that the “*.gff” file and the “*.faa” file must correspond. We strongly recommend using Prokka¹⁴ to generate the files. If the “--Annotate” function was run first, the files will be generated automatically.

$pgcgap --Pan --codon 11 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --AAsPath Results/Annotations/AAs

The results are stored in the “Results/PanGenome” directory. A spreadsheet named “gene_presence_absence.csv” lists each gene and which sample is presented in. At the same time, some visual results (“*.pdf”) are also outputted. The file “Results/PanGenome/Core/Roary.core.protein.nwk” is the phylogenetic tree constructed based on the single-copy core proteins called by Roary software. If the parameters “--PanTree” and “--AAsPath” were not provided, the phylogenetic tree will not be constructed.

23. Example 8: Inference of orthologous gene groups.

The input files are also the amino acid sequence files suffixed with “.faa” (“Results/Annotations/AAs/*.faa”).

$pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs

The resulting files are placed in the “Results/OrthoFinder/Results_orthoF” directory.

24. Example 9: Compute whole-genome Average Nucleotide Identity.

The input file named “scaf.list” contains the absolute path of each genome, one per line. If the “--Assemble” function was run first, the list file will be generated automatically. The value of the parameter “--Scaf_suffix” depends on the actual situation, here is “-8.fa”.

$pgcgap --ANI --threads 4 --queryL scaf.list --refL scaf.list --ANIO Results/ANI/ANIs --Scaf_suffix -8.fa

The results are stored in the “Results/ANI” directory. The file “ANI” contains comparison information of genome pairs. The document is composed of five columns, each of which represents ANI value, count of bidirectional fragment mappings and total query fragments, respectively. A heat map file “ANI_matrix.pdf” was generated.

25. Example 10: Genome and metagenome similarity estimation using MinHash

It takes genome files (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).

$pgcgap --MASH --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa

The results are stored in the “Results/MASH” directory. The file “MASH” shows pairwise distance between pair genomes and each column represents Reference-ID, Query-ID, Mash-distance, P-value and Matching-hashes, respectively. A heat map file “MASH_matrix.pdf” was generated.

26. Example 11: COG annotation.

The input files are also the amino acid sequence files suffixed with “.faa” (“Results/Annotations/AAs/*.faa”).

$pgcgap --pCOG --threads 4 --strain_num 6 --AAsPath Results/Annotations/AAs

The results are stored in the “Results/COG” directory. The super COG table of each strain (“*.2Scog.table”) and its plot (“*.2Scog.table.pdf”) will be generated. “All_flags_relative_abundances.table” is a table containing the relative abundance of each flag for all strains, and “All_flags_relative_abundances.pdf” is the corresponding visualization result.

27. Example 12: Variants calling, and phylogenetic tree construction based on a reference genome.

The six genomes sequenced by Illumina were chosen as datasets (“Reads/Illumina/*.gz”). Taking Escherichia coli K-12 substr. MG1655 as the reference genome and the reference file “MG1655.gbff” in the GenBank format is stored in the “Reads” directory. The absolute path of the reference genome (here is “/mnt/h/PGCGAP_Examples/Reads/MG1655.gbff”) is required to run the program.

$pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger

The resulting files are stored in the “Results/Variants” directory, where the “Core” directory contains the core SNPs of all strains and their phylogenetic tree.

28. Example 13: Screening of contigs for antimicrobial and virulence genes

It takes genome files (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).

$pgcgap --AntiRes --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --threads 4 --db ncbi --identity 75 --coverage 50

The resulting files are stored in the “Results/AntiRes” directory. “*.tab” files are screening results of each strain, and the “summary.txt” file contains a matrix of gene presence/absence for all strains.

29. Example 14: Perform all functions for paired-end reads.

Only the reads file and reference file should be provided. For the sake of flexibility, the "VAR" function needs to be added extra.

$pgcgap --All --platform illumina --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --kmmer 81 --genus Escherichia --species “Escherichia coli” --codon 11 --strain_num 6 --threads 4 --VAR --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --qualtype sanger --PanTree

Troubleshooting advice can be found in Table 2.

The following marked time was tested in the WSL on the laptop, the features of the laptop are as follows: i7-4710MQ CPU (with 4 cores and 8 logical processors), 16 GB DDR3L RAM, 240 G SSD, 1 T HDD. All commands called 4 threads.

Step 1, configuration of WSL, 1 min.

Step 2, installation of Linux, 43 min.

Step 3-6, the configuration of Linux, 10 min.

Step 7, installation of Miniconda, 6 min.

Step 8-13, installation of PGCGAP, 34 min.

Step 14, download the example dataset, 11 min.

Step 15, activate the pgcgap environment, 8 s.

Step 16, Illumina reads assembly, 43 min.

Step 17, Oxford reads assembly, 1.6 h.

Step 18, PacBio reads assembly, 54 min.

Step 19, gene prediction and annotation, 1 h.

Step 20, constructing the single-copy core protein tree and core SNPs tree, 3.3 h.

Step 21, constructing the single-copy core protein tree only, 3 h.

Step 22, pan-genome analysis and phylogenetic tree constructing, 3.5 h.

Step 23, inference of orthologous gene groups, 51 min.

Step 24, compute whole-genome Average Nucleotide Identity, 17 s.

Step 25, genome similarity estimation using MinHash, 1 min.

Step 26, COG annotation, 20.3 h.

Step 27, variant calling, and phylogenetic tree construction based on reference genome, 11.6 h.

Step 28, Screening of contigs for antimicrobial and virulence genes, 30s.

Step29, Perform all functions for paired-end reads, 1.9 d.

1 Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).

2 Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997v2 [q-bio.GN] (2013).

3 Heng Li et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078-2079 (2009).

4 McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297-1303 (2010).

5 Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 5, e9490 (2010).

6 Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312-1313 (2014).

7 Joshi, NA & Fass, JN. Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) [Software]. Available at https://github.com/najoshi/sickle. (2011).

8 Garrison, E & Marth, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] (2012).

9 Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80-92 (2012).

10 Seemann, T. Snippy: Rapid haploid variant calling and core genome alignment. Available at https://github.com/tseemann/snippy. (2014).

11 Croucher, N. J. et al. Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins. Nucleic Acids Res. 43, e15 (2015).

12 Jackman, S. D. et al. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 27, 768-777 (2017).

13 Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27, 722-736 (2017).

14 Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068-2069 (2014).

15 Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114-5114 (2018).

16 Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology 17, 132 (2016).

17 Page, A. J. et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691-3693 (2015).

18 Emms, D. M. & Kelly, S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20, 238 (2019).

19 Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658-1659 (2006).

20 Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059-3066 (2002).

21 Suyama, M., Torrents, D. & Bork, P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609-612 (2006).

22 Page, A. J. et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom 2, e000056 (2016).

23 Seemann T. Abricate, Github https://github.com/tseemann/abricate.

24 Kumar, S., Stecher, G., Li, M., Knyaz, C. & Tamura, K. MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. Mol. Biol. Evol. 35, 1547-1549 (2018).

25 Letunic, I. & Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 44, W242-W245 (2016).

This work was made possible through funding from the National Key R&D Program of China (2017YFD0201201), National Natural Science Foundation of China (31670085, 31970003 and 31770003), and China 948 Program of Ministry of Agriculture (2016-X21).

The authors declare no competing financial interests.

Download PDF

Version 4

posted

You are reading this older protocol version

Read the latest protocol version →

Build a Bioinformatics Analysis Platform and Apply it to Routine Analysis of Microbial Genomics and Comparative Genomics

Status:

Version 4

Abstract

Figures

Introduction

Equipment

Procedure

Troubleshooting

Time Taken

References

Acknowledgements

Additional Declarations

Supplementary Files

Associated Publications

Status:

Version 4

Privacy Policy

Terms of Service

Build a Bioinformatics Analysis Platform and Apply it to Routine Analysis of Microbial Genomics and Comparative Genomics

Status:

Version 4

Abstract

Figures

Introduction

Equipment

Procedure

Troubleshooting

Time Taken

References

Acknowledgements

Additional Declarations

Supplementary Files

Associated Publications

Status:

Version 4

Privacy Policy

Terms of Service

Manage Cookie Preferences