Protocols for genomic and transcriptomic data production, assembly and quality control.
Genomic and transcriptomic data production for helminths
This work is licensed under a CC BY 4.0 License
This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.
posted 17 May, 2018
You are reading this latest protocol version
A multi-step protocol for preparation of draft genomes, specifically applicable to Nematoda. This includes initial library preparation, genome and transcriptome assembly, assembly QC/contamination screening and gene prediction. Both 454 and Illumina sequencing platforms are included.
A Genome sequencing library preparation
Paired end short insert libraries
454 titanium fragment libraries are constructed with 5-10ug of DNA according to the manufacturer's recommendations (Roche 454).
Illumina small-insert paired-end libraries are prepared according to the manufacturer's protocol with the exception that multiple library enrichment reactions and size selection are performed after amplification and multiple size fractions (300-400 and 400-500 bp) are collected.
454/Illumina 3 kb insert mate pair libraries
3kb mate pair libraries are created as follows:
454/Illumina 8 kb insert mate pair libraries
8kb mate pair libraries are created as follows:
Genomes sequenced on the Roche/454 platform are assembled from a combination of fragment reads, 3 kb paired-end reads and 8 kb paired-end reads generated to meet the coverage criteria of 15x, 15x and 3x respectively, with a target of 30x coverage for the final assembly. Genomes sequenced on the Illumina platform had overlapping fragment reads, 3 kb and 8 kb paired-end reads and are sequenced to a depth of 45x, 45x, and 10x, respectively.
B Genome assembly
Assemblies are generated using the assembly workflows outlined in Fig. 1, with the specific method depending on the input material. Assemblies based on Roche 454 3kb, 8kb and fragment input followed the steps detailed in panel m1. Assemblies from Illumina 3kb, 8kb and fragment input used the workflow described in panel m2 and a reference guided assembly method shown in panel m3.
Figure 1: The McDonnell Genome Institute Genome assembly pipelines
Assemblies built using a combination of Roche 454 3kb, 8kb and fragment input data are constructed as follows (Fig. 1 panel m1)
Assemblies constructed from 3kb, 8kb and fragment Illumina sequences followed this methodology (Fig. 1 panel m2)
Finally we have a protocol for a reference guided, assisted assembly approach (Fig. 1 panel m3)
C Assembly QC / Contamination screening
All assemblies are screened, to remove for contamination, before annotation.
D Transcriptome sequencing and assembly
Assembled RNAseq ata are used alongside EST data in the MAKER stage of gene prediction.
E Gene prediction
Gene prediction is run on assemblies as follows:
Figure 2: The McDonnell Genome Institute Gene-finding pipeline.
High confidence gene selection
A high confidence gene set is created from MAKER17 output:
b) If QI and QI are >0, or QI is >0, then the gene is kept.
c) Genes are retained if they matched Swissprot18 using BLAST (E<1e-06).
d) Genes are retained if they matched Pfam19 using RPSBLAST (E<1e-03).
e) RPSBLAST is run against CDD20 (E<1e-03 and coverage >40%). Genes that met both cut-offs are kept.
f) If no hit is recorded the gene is retained if it had ≥ 55% identity to the genes database from KEGG21, and and a bitscore of ≥35.
Additional curation of gene sets
Depending on the nature of the final gene set in relation to the assembly quality some gene sets underwent an additional manual review of short genes lacking definitive evidence. After the high confidence gene selection steps described above, shorter single and double exon genes and genes annotated as hypothetical (with no KEGG nor InterPro homologies) are further scrutinized. A manual review of the Annotation Edit Distance (AED, from MAKER) is considered in combination with the QI scores (all provided by MAKER), enabling analysts to make a more informed decision about whether to keep or discard each such gene.
raw sequence data, genomic and/or transcriptomic assembly and a high confidence gene set.
Dodt, M., Roehr, J. T., Ahmed, R. & Dieterich, C. FLEXBAR-Flexible Barcode and Adapter Processing for Next-Generation Sequencing Platforms. Biology (Basel) 1, 895-905, doi:10.3390/biology1030895 (2012).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114-2120, doi:10.1093/bioinformatics/btu170 (2014).
Margulies, M. et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-380 (2005).
Xue, W. et al. L_RNA_scaffolder: scaffolding genomes with transcripts. BMC Genomics 14, 604, doi:10.1186/1471-2164-14-604 (2013).
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res 18, 810-820, doi:10.1101/gr.7337908 (2008).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754-1760, doi:10.1093/bioinformatics/btp324 (2009).
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 18, 821-829 (2008).
Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res 12, 656-664, doi:10.1101/gr.229202. Article published online before March 2002 (2002).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature methods 9, 357-359, doi:10.1038/nmeth.1923 (2012).
Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology 14, R36, doi:10.1186/gb-2013-14-4-r36 (2013).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644-652, doi:10.1038/nbt.1883 (2011).
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150-3152, doi:10.1093/bioinformatics/bts565 (2012).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic acids research 25, 955-964 (1997).
Nawrocki, E. P. et al. Rfam 12.0: updates to the RNA families database. Nucleic acids research 43, D130-137, doi:10.1093/nar/gku1063 (2015).
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 59, doi:10.1186/1471-2105-5-59 (2004).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic acids research 34, W435-439, doi:10.1093/nar/gkl200 (2006).
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188-196, doi:10.1101/gr.6743907 (2008).
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic acids research 28, 45-48 (2000).
Finn, R. D. et al. Pfam: the protein families database. Nucleic acids research 42, D222-230, doi:10.1093/nar/gkt1223 (2014).
Marchler-Bauer, A. et al. CDD: NCBI's conserved domain database. Nucleic acids research 43, D222-226, doi:10.1093/nar/gku1221 (2015).
Kanehisa, M. The KEGG database. Novartis Found Symp 247, 91-101; discussion 101-103, 119-128, 244-152 (2002).