Read Alignment
- Concatenate chromosome sequences of reference assembly of NCBI build v36 into a single reference sequence file in FASTA format (hereafter, referred to as “Reference.fasta”).
- Align Illumina GA reads from each Illumina GA lane to the reference sequence by software SOAP1. To take use of more data, we recommend to add command line parameter “-c 52” to trim low-quality bases in alignment process. Typically, the command lines are as follows:
For single-end (SE) data:
soap -a <Reads.fastq> -d <Reference.fasta> -o <Alignment.soap> -p <# of parallel processes> -c 52 -s 12
For paired-end (PE) data:
soap -a <Reads1.fastq> -b <Reads2.fastq> -d <Reference.fasta> -o <PEalignment.soap> -2 <SEalignment.soap> -p <# of parallel processes> -c 52 -s 12 -m <maximum insert size> -x <minimum insert size>
- Sort alignment result of each lane, first by chromosome names lexicographically, then by mapping coordinates on each chromosome numerically. This could be done by various kinds of sorting tools.
- Merge the sorted alignment results of all lanes into a single file, keeping the alignments sorted.
- Split the sorted and merged alignments, chromosome by chromosome, into different files, each file comprising alignments of a single chromosome.
Building Consensus Sequence
- Download flat ASN files of dbSNP2 database and allele frequency information if possible, such as HapMap3 data for human resequencing. Transform dbSNP information of each chromosome into a tab-delimited plain text file (hereafter, referred to as “chrN.dbSNP.txt”), which looks like:
chr1 201979756 1 1 0 0.161 0 0 0.839 rs568
The columns from left to right mean: name of chromosome, coordinate on the chromosome, whether the SNP has external allele frequency information (1 is true, 0 is false), whether the SNP is a validated dbSNP (1 is true, 0 is false), whether the SNP is actually an indel (1 is true, 0 is false), frequency of A, frequency of C, frequency of T, frequency of G, refSNP ID. For dbSNP sites that do not have allele frequency information, the frequencies can be arbitrarily determined as any positive values, which only imply what alleles have already been deposited in the database.
- Based on the alignment result, build consensus sequence for each chromosome by SoapSNP. Typically, the command line are as follows:
SoapSNP -i <Alignment.soap.sort.chrN> -d <chrN.fasta> -o <chrN.consensus> -r 0.00005 –e 0.0001 -t -u -L <Maximum Read Length> -M <chrN.mat> -s <chrN.dbSNP.txt> -2
The result of SoapSNP “chrN.consensus” has 17 columns: chromosome ID, coordinate on chromosome, reference genotype, consensus genotype, quality score of consensus genotype, best allele, average quality score of best allele, count of uniquely mapped best allele, count of all mapped best allele, second best allele, average quality score of second best allele, count of uniquely mapped second best allele, count of all mapped second best allele, sequencing depth of the site, rank sum test p_value, average copy number of nearby region, whether the site is a dbSNP (0 is false; 1 is true). These kinds of information would be used in SNP extraction process.
SNP extraction
- Extract potential SNP sites from each consensus files (“chrN.consensus”), where the called genotypes are different from reference one, and the reference genotypes are not “N”. For each potential SNP sites, record its distance to the neighboring potential SNP.
- Filter the raw SNP dataset to get the confident SNP’s that satisfy the following criteria: a) the quality score is larger than 20; b) each of the allele is supported by more than a certain number of reads (depending on the average sequencing depth, typically 2 for 20X data); c) the overall depth should be less than average plus 3 *standard deviation of sequencing depth; d) the average copy number of nearby region of the SNP is less than 2; e) each allele is supported by at least one paired-end reads; f) the site is at least 5bp away from another SNP.
Identification of indels
Align all paired-end reads to the reference genome using SOAP in gap alignment mode:
soap -a <Reads1.fastq> -b <Reads2.fastq> -d <Reference.fasta> -o <PEalignment.soap> -2 <SEalignment.soap> -p <# of parallel processes> -c 52 -s 12 -m <maximum insert size> -x <minimum insert size> -g 3 -e 5
Extract all gapped alignments from the SOAP result and merge them into a single file.
Sort all gapped alignments, first by chromosome names lexicographically, and then by coordinates on each chromosome.
Extract coordinates, sizes and numbers of supporting reads of potential indels from the alignments.
Select only one representative indel according to their numbers of supporting reads in each 10bp window. Thus indels are at least 10bp away from each other.
Filter all potential indels to get the confident ones that satisfy the following criteria: a) the indels are supported by at least 3 reads; b) number of ungapped alignment that cross the indels are no more than twice that of gapped reads.
Structural Variation Detection
- Extract all abnormally aligned paired-end reads, which have unexpected orientations and/or unexpected span size. Normally mapped read pairs are aligned in a forward-reverse pattern, i.e. the upstream read of a mapped pair is on the forward strand, and the downstream one is on the reverse strand.
- Cluster abnormally aligned paired-end reads that have same alignment orientation and similar coordinates (distance between coordinates of two read pairs is smaller than paired-end insert size) on both ends of read pairs.
- Fit all abnormal paired-end clusters into alignment models (Figure 1) and call potential structural variations (SVs).
- Merge redundant called SVs.
- Filter potential structural variations to get confident ones which satisfy the following criteria: a) each of the paired-end clusters that support the SV comprise at least 4 read pairs; b) the SV is not conflict with other SVs, otherwise they should be merged into a complex structural variation case.