Building a bioinformatics analysis platform on Windows 10
Windows Subsystem for Linux (WSL) allows users to install Linux subsystems directly on Windows 10 system. It can easily run Linux commands and install Linux software, avoiding the installation of third-party virtual machine software. The advantage of WSL is that it makes better use of computer memory and does not require copying files between the host and the virtual machine.
Configuration of WSL
Timing ~1 min
System requirements: Windows 10 Version 1709, Build 16299 or above, 64-bit systems.
1. Enable WSL: Open "Settings", click "Apps", then find and click "Programs and Features", click "Turn Windows features on or off", find "Windows Subsystem for Linux" and check the box, click OK and restart the computer (Supplementary Video 1).
Install Linux
Timing ~59 min
2. Open the Microsoft Store, search Ubuntu, and choose to install Ubuntu 18.04 LTS. Follow the prompts to set up your username and password. Here we create an account with the username “bio” (Supplementary Video 2). After the installation is finished, we need to do some configuration on the system (Supplementary Video 3).
3. Enter the following command in the terminal to update the source:
$sudo apt-get update
4. Set the password for root.
$sudo root
5. Change the content of ptrace_scope from 1 to 0.
$su root
#echo 0 > /proc/sys/kernel/yama/ptrace_scope
6. Switch to the ordinary user “bio”.
#su bio
7. Installation of Miniconda
(A) Installation of Miniconda on Linux
(i) Here, Miniconda will be installed, go to the official website, and select the installation file suitable for your system and python version. Here, Miniconda 3 will be installed (Supplementary Video 4).
$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
(ii) Start installation
$bash Miniconda3-latest-Linux-x86_64.sh
Press Enter when prompted to view the license, enter “yes” and press “Enter” to continue. Press “Enter” to confirm the installation location. Miniconda was installed in the miniconda3 directory under the user’s home directory. Type “yes” and press “Enter” to initialize. Finally, run the “source ~/.bashrc” command in the terminal.
$source ~/.bashrc
(iii) Set up Bioconda channel. Add the channels by entering the following three commands in the terminal.
$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge
(B) Install Miniconda on MacOS
(i) Installation of Miniconda3
$wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
$sh Miniconda3-latest-MacOSX-x86_64.sh
$source ~/.bash_profile
(ii) Add channels of Bioconda
$conda config --add channels defaults
$conda config --add channels bioconda
$conda config --add channels conda-forge
Installation of PGCGAP (Supplementary Video 5).
Timing ~34 min
8. Create a pgcgap environment for the installation of PGCGAP.
$conda create -n pgcgap python=3
9. Activate the environment.
$conda activate pgcgap
10. Installation of PGCGAP.
$conda install pgcgap
11. Check if the dependent software packages were installed.
$pgcgap --check-external-programs
12. Set up the COG database.
$pgcgap --setup-COGdb
13. Exit the environment.
$conda deactivate
Step by Step examples
Timing ~2 d
14. Download the example dataset.
$wget http://bcam.hzau.edu.cn/PGCGAP/PGCGAP_Examples.rar
In this example, the working directory locates at the H drive. All hard disks in Windows were mounted in the “/mnt” directory of Ubuntu Linux. The “PGCGAP_Examples/Reads/Illumia” directory contains six Illumina Hiseq paired-end reads of Escherichia coli, the “PGCGAP_Examples/Reads/Oxford” directory contains the Oxford Nanopore reads of Escherichia coli K12, and the “PGCGAP_Examples/Reads/PacBio” directory contains the Pacific Biosciences released P6-C4 chemistry reads of Escherichia coli K12. “PGCGAP_Examples/Reads/MG1655” is the GenBank format file of E.coli K-12 substr. MG1655, used as the reference genome.
15. Activate the pgcgap environment.
$conda activate pgcgap
16. Example 1: Genome assembly with Illumina reads.
Paired-end reads of six strains in the directory “Reads/Illumina/” were used as inputs. In the dataset, the naming format of the genome is “strain_1.fastq.gz” and “strain_2.fastq.gz”. The string after the strain name is “_1.fastq.gz”, and its length is 11, so “--suffix_len” was set to 11.
$pgcgap --Assemble --platform illumina --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11
New directories and documents were generated after the program is finished. The assembly results for each genome are in the “Results/Assembles/Illumina” directory, and all scaffolds of the strains were stored in “Results/Assembles/Scaf/Illumina”. Users are advised to check the assembly stats file (such as Results/Assembles/Illumina/SRR9620252_assembly/SRR9620252-stats.tab) of each genome to ensure that the value of N50 is greater than 50,000 bp. The file “scaf.list” under the working directory contains the absolute path of all genomes.
17. Example 2: Oxford reads assembly.
Oxford nanopore only produces one reads file (“Reads/Oxford/oxford.fasta”), so only the parameter of “--reads1” needs to be set, here the value is “.fasta”. “--genomeSize” is the estimated genome size, and users can check the genome size of similar strains in the NCBI database for reference. The parameter was set to “4.8m” here. The suffix of the reads file here is “.fasta” and its length is 6, so “--suffix_len” was set to 6.
$pgcgap --Assemble --platform oxford --ReadsPath Reads/Oxford --reads1 .fasta --genomeSize 4.8m --threads 4 --suffix_len 6
The results are stored in the “Results/Assembles/Oxford” directory and the “Results/Assembles/Scaf/Oxford” directory. The former contains all intermediate files and genome files, the latter contains only the assembled genome.
18. Example 3: PacBio reads assembly.
PacBio also produces only one reads file too (“Reads/PacBio/pacbio.fastq”), the parameter settings are similar to Oxford. The strain name is “pacbio” with the suffix “.fastq” and the suffix length is 6, so “--suffix_len” was set to 6.
$pgcgap --Assemble --platform pacbio --ReadsPath Reads/PacBio --reads1 .fastq --genomeSize 4.8m --threads 4 --suffix_len 6
The results are stored in the “Results/Assembles/PacBio” directory and in the “Results/Assembles/Scaf/PacBio” directory. The former contains all intermediate files and genome files, the latter containing only the assembled genome.
19. Example 4: Gene prediction and annotation.
Here, the assembly results of Illumina reads were taken as inputs (“Results/Assembles/Scaf/Illumina/*.fa”). The suffix of the genome is “-8.fa”. When running the program, the value of the “--Scaf_suffix” parameter cannot be quoted. Here, -8.fa should not be quoted.
$pgcgap --Annotate --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --genus Escherichia --species “Escherichia coli” --codon 11 --threads 4
The generated files are stored in the “Results/Annotations” directory, and files in the directories “Results/Annotations/AAs”, “Results/Annotations/CDs” and “Results/Annotations/GFF” will be used for subsequent analysis.
20. Example 5: Constructing the single-copy core proteins tree and core SNPs tree.
The phylogenetic trees of single-copy core proteins and single-copy core genes SNPs will be constructed using the six E. coli genomes sequenced by Illumina as datasets. The input files are the amino acid sequence files (“Results/Annotations/AAs/*.faa”) and the nucleotide sequence files (“Results/Annotations/CDs/*.ffn”) obtained by the genome annotation. Amino acid files and nucleotide files must be suffixed with “.faa” and “.ffn”, respectively. The “.faa” and “.ffn” files of the same strain should have the same prefix name. The name of protein IDs and gene IDs in the Amino acids file and nucleotide file should be started with the strain name. The Prokka14 software was suggested to generate the input files.
$pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4
The result files are stored in the “Results/CoreTrees” directory. “ALL.core.protein.nwk” and “ALL.core.snp.nwk” are the phylogenetic tree files of the single-copy core proteins and the core SNPs. Users can import these two files into MEGA23 or iTOL24 to view the topology.
21. Example 6: Constructing the single-copy core protein tree only.
If the “--CDsPath” was set to “NO”, the nucleotide files will not be needed, and the phylogenetic tree of core SNPs will not be constructed too.
$pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4
22. Example 7: pan-genome analysis and phylogenetic tree constructing.
GFF3 files (With “.gff” as the suffix) of each strain placed into a directory (“Results/Annotations/GFF/*.gff”). They must contain the nucleotide sequence at the end of the file. Protein sequence files (one per species) in FASTA format under another directory were also needed (“Results/Annotations/AAs/*.faa”) if the parameter “--PanTree” was provided for constructing a phylogenetic tree. It should be noted that the “*.gff” file and the “*.faa” file must correspond. We strongly recommend using Prokka14 to generate the files. If the “--Annotate” function was run first, the files will be generated automatically.
$pgcgap --Pan --codon 11 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --AAsPath Results/Annotations/AAs
The results are stored in the “Results/PanGenome” directory. A spreadsheet named “gene_presence_absence.csv” lists each gene and which sample is presented in. At the same time, some visual results (“*.pdf”) are also outputted. The file “Results/PanGenome/Core/Roary.core.protein.nwk” is the phylogenetic tree constructed based on the single-copy core proteins called by Roary software. If the parameters “--PanTree” and “--AAsPath” were not provided, the phylogenetic tree will not be constructed.
23. Example 8: Inference of orthologous gene groups.
The input files are also the amino acid sequence files suffixed with “.faa” (“Results/Annotations/AAs/*.faa”).
$pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs
The resulting files are placed in the “Results/OrthoFinder/Results_orthoF” directory.
24. Example 9: Compute whole-genome Average Nucleotide Identity.
The input file named “scaf.list” contains the absolute path of each genome, one per line. If the “--Assemble” function was run first, the list file will be generated automatically. The value of the parameter “--Scaf_suffix” depends on the actual situation, here is “-8.fa”.
$pgcgap --ANI --threads 4 --queryL scaf.list --refL scaf.list --ANIO Results/ANI/ANIs --Scaf_suffix -8.fa
The results are stored in the “Results/ANI” directory. The file “ANI” contains comparison information of genome pairs. The document is composed of five columns, each of which represents ANI value, count of bidirectional fragment mappings and total query fragments, respectively. A heat map file “ANI_matrix.pdf” was generated.
25. Example 10: Genome and metagenome similarity estimation using MinHash
It takes genome files (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).
$pgcgap --MASH --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa
The results are stored in the “Results/MASH” directory. The file “MASH” shows pairwise distance between pair genomes and each column represents Reference-ID, Query-ID, Mash-distance, P-value and Matching-hashes, respectively. A heat map file “MASH_matrix.pdf” was generated.
26. Example 11: COG annotation.
The input files are also the amino acid sequence files suffixed with “.faa” (“Results/Annotations/AAs/*.faa”).
$pgcgap --pCOG --threads 4 --strain_num 6 --AAsPath Results/Annotations/AAs
The results are stored in the “Results/COG” directory. The super COG table of each strain (“*.2Scog.table”) and its plot (“*.2Scog.table.pdf”) will be generated. “All_flags_relative_abundances.table” is a table containing the relative abundance of each flag for all strains, and “All_flags_relative_abundances.pdf” is the corresponding visualization result.
27. Example 12: Variants calling, and phylogenetic tree construction based on a reference genome.
The six genomes sequenced by Illumina were chosen as datasets (“Reads/Illumina/*.gz”). Taking Escherichia coli K-12 substr. MG1655 as the reference genome and the reference file “MG1655.gbff” in the GenBank format is stored in the “Reads” directory. The absolute path of the reference genome (here is “/mnt/h/PGCGAP_Examples/Reads/MG1655.gbff”) is required to run the program.
$pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger
The resulting files are stored in the “Results/Variants” directory, where the “Core” directory contains the core SNPs of all strains and their phylogenetic tree.
28. Example 13: Screening of contigs for antimicrobial and virulence genes
It takes genome files (complete or draft) in a directory as inputs (Default: Results/Assembles/Scaf/Illumina).
$pgcgap --AntiRes --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --threads 4 --db ncbi --identity 75 --coverage 50
The resulting files are stored in the “Results/AntiRes” directory. “*.tab” files are screening results of each strain, and the “summary.txt” file contains a matrix of gene presence/absence for all strains.
29. Example 14: Perform all functions for paired-end reads.
Only the reads file and reference file should be provided. For the sake of flexibility, the "VAR" function needs to be added extra.
$pgcgap --All --platform illumina --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --kmmer 81 --genus Escherichia --species “Escherichia coli” --codon 11 --strain_num 6 --threads 4 --VAR --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --qualtype sanger --PanTree