Protocols for genomic and transcriptomic data production, assembly and quality control.
Genomic and transcriptomic data production for helminths
This work is licensed under a CC BY 4.0 License
This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.
posted 17 May, 2018
You are reading this latest protocol version
A multi-step protocol for preparation of draft genomes, specifically applicable to Nematoda. This includes initial library preparation, genome and transcriptome assembly, assembly QC/contamination screening and gene prediction. Both 454 and Illumina sequencing platforms are included.
A Genome sequencing library preparation
Paired end short insert libraries
454 titanium fragment libraries are constructed with 5-10ug of DNA according to the manufacturer's recommendations (Roche 454).
Illumina small-insert paired-end libraries are prepared according to the manufacturer's protocol with the exception that multiple library enrichment reactions and size selection are performed after amplification and multiple size fractions (300-400 and 400-500 bp) are collected.
454/Illumina 3 kb insert mate pair libraries
3kb mate pair libraries are created as follows:
For 454 sequencing, FLX Titanium paired-end library adaptors are ligated onto the immobilized DNA fragments and processed as recommended by the Manufacturers 3 kb span paired end library construction protocol (Roche 454). For Illumina sequencing, blunt ended fragments are processed through an adenylation reaction. Illumina’s Truseq adaptors are ligated, the library is enriched with KAPA HiFi polymerase (KAPA Biosystems) and a final dual SPRI size selection is performed to isolate 300-500 bp library fragments.
454/Illumina 8 kb insert mate pair libraries
8kb mate pair libraries are created as follows:
The final 300-500 bp library fragments are selected with a dual SPRI reaction.
Genomes sequenced on the Roche/454 platform are assembled from a combination of fragment reads, 3 kb paired-end reads and 8 kb paired-end reads generated to meet the coverage criteria of 15x, 15x and 3x respectively, with a target of 30x coverage for the final assembly. Genomes sequenced on the Illumina platform had overlapping fragment reads, 3 kb and 8 kb paired-end reads and are sequenced to a depth of 45x, 45x, and 10x, respectively.
B Genome assembly
Assemblies are generated using the assembly workflows outlined in Fig. 1, with the specific method depending on the input material. Assemblies based on Roche 454 3kb, 8kb and fragment input followed the steps detailed in panel m1. Assemblies from Illumina 3kb, 8kb and fragment input used the workflow described in panel m2 and a reference guided assembly method shown in panel m3.
Figure 1: The McDonnell Genome Institute Genome assembly pipelines
Assemblies built using a combination of Roche 454 3kb, 8kb and fragment input data are constructed as follows (Fig. 1 panel m1)
Finally the L_RNA_scaffolde4 used 454 cDNA data to further improve scaffolding.
Assemblies constructed from 3kb, 8kb and fragment Illumina sequences followed this methodology (Fig. 1 panel m2)
Finally we have a protocol for a reference guided, assisted assembly approach (Fig. 1 panel m3)
BLAT8 is then used to compare the contigs created by Velvet to the contigs created by alignment to the reference and all Velvet contigs greater than 500 bp that mapped less than 50% of their length (and at >80% identity) to an existing contig are added to the assembly.
C Assembly QC / Contamination screening
All assemblies are screened, to remove for contamination, before annotation.
Any contigs which are on the border of the requirements and longer in length are manually reviewed as an extra measure against true genome contigs being removed.
D Transcriptome sequencing and assembly
Assembled RNAseq ata are used alongside EST data in the MAKER stage of gene prediction.
The assembled contigs are assessed for quality by aligning (with TopHat210) back to reference assembly to establish the percentage of reference aligned to by the reads and the percentage of reads that aligned to the reference.
E Gene prediction
Gene prediction is run on assemblies as follows:
SNAP and Augustus models are generated where possible using the MAKER pipeline and species-specific evidence. A consensus gene set from the above prediction algorithms is generated, using a logical, hierarchical approach developed at MGI.
Figure 2: The McDonnell Genome Institute Gene-finding pipeline.
High confidence gene selection
A high confidence gene set is created from MAKER17 output:
b) If QI and QI are >0, or QI is >0, then the gene is kept.
c) Genes are retained if they matched Swissprot18 using BLAST (E<1e-06).
d) Genes are retained if they matched Pfam19 using RPSBLAST (E<1e-03).
e) RPSBLAST is run against CDD20 (E<1e-03 and coverage >40%). Genes that met both cut-offs are kept.
f) If no hit is recorded the gene is retained if it had ≥ 55% identity to the genes database from KEGG21, and and a bitscore of ≥35.
Additional curation of gene sets
Depending on the nature of the final gene set in relation to the assembly quality some gene sets underwent an additional manual review of short genes lacking definitive evidence. After the high confidence gene selection steps described above, shorter single and double exon genes and genes annotated as hypothetical (with no KEGG nor InterPro homologies) are further scrutinized. A manual review of the Annotation Edit Distance (AED, from MAKER) is considered in combination with the QI scores (all provided by MAKER), enabling analysts to make a more informed decision about whether to keep or discard each such gene.
raw sequence data, genomic and/or transcriptomic assembly and a high confidence gene set.
Dodt, M., Roehr, J. T., Ahmed, R. & Dieterich, C. FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms. Biology (Basel) 1, 895-905, doi:10.3390/biology1030895 (2012).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, doi:10.1093/bioinformatics/btu170 (2014).
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005).
Xue, W. et al. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 14, 604, doi:10.1186/1471-2164-14-604 (2013).
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 18, 810-820, doi:10.1101/gr.7337908 (2008).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821-829 (2008).
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664, doi:10.1101/gr.229202. Article published online before March 2002 (2002).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature methods 9, 357-359, doi:10.1038/nmeth.1923 (2012).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 14, R36, doi:10.1186/gb-2013-14-4-r36 (2013).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644-652, doi:10.1038/nbt.1883 (2011).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152, doi:10.1093/bioinformatics/bts565 (2012).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955-964 (1997).
Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research 43, D130-137, doi:10.1093/nar/gku1063 (2015).
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 59, doi:10.1186/1471-2105-5-59 (2004).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic acids research 34, W435-439, doi:10.1093/nar/gkl200 (2006).
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188-196, doi:10.1101/gr.6743907 (2008).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research 28, 45-48 (2000).
Finn, R. D. et al. Pfam: the protein families database. Nucleic acids research 42, D222-230, doi:10.1093/nar/gkt1223 (2014).
Marchler-Bauer, A. et al. CDD: NCBI's conserved domain database. Nucleic acids research 43, D222-226, doi:10.1093/nar/gku1221 (2015).