A major problem in whole genome sequencing of parasitic worm (nematode and flatworm) species is that sequencing reads can be contaminated with those of other species, either arising from DNA of the host species (e.g. vertebrates, plants, etc.), other species that are commensal in the host (e.g. bacteria), or from laboratory contamination. Here we describe a computational protocol to identify and remove likely contaminant DNA from the initial genome assembly for a parasitic worm.
See figure in Figures section.To remove contaminant scaffolds from the initial genome assembly for a parasitic worm (nematode or flatworm), a multi-step approach is taken (Figure 1). In Step A, we take each scaffold (or 50 kb chunks of longer scaffolds) and run BLASTX (Altschul et al 1997) against databases of invertebrate and non-invertebrate protein sequences. If a scaffold has far stronger BLASTX hits to non-invertebrate proteins (e.g. vertebrates, bacteria) than to invertebrate proteins, it is considered to be a likely contaminant scaffold and removed.
The second step, Step B, requires a gene set for the assembly, and runs BLASTP between the predicted proteins for this gene set and the same databases searched in Step A. As in Step A, scaffolds with far stronger BLASTP hits to non-invertebrate proteins are considered to be likely contaminants and are removed. Step B often detects additional contaminant scaffolds missed by Step A.
Step C is designed to detect contamination of the parasitic worm’s assembly by other invertebrates (e.g. flatworm contamination in a nematode species’ assembly). It is similar to Step B, but carries out additional BLASTP searches of a database of nematode or flatworm protein sequences. If a scaffold has far stronger BLASTP hits to non-invertebrate proteins from another phylum (e.g. to flatworms, if the assembly being de-contaminated is from a nematode), then it is considered a likely contaminant and removed.
Helminth genomes can be very large (e.g. Fasciola hepatica ~1.3 Gb; Cwiklinski et al, 2015), so this approach is designed to be easy-to-run and scalable in terms of run-time to a large number of large parasitic worm genomes, with little or no manual analysis required. Our approach is designed to have few false positives (non-contaminant scaffolds misclassified as contaminant).
Our contamination scan protocol is designed particularly for parasitic worm genome assemblies, and relies on a series of BLASTX and BLASTP searches against invertebrate and non-invertebrate sequence databases. In contrast, some other approaches for contamination scanning can be used across a larger taxonomic breadth, and use additional data as well as BLAST searches. For example, Blobology (Kumar et al 2013), although designed with nematode genomes as a test case, can be used for any eukaryotic genome assembly, and analyses top BLAST hits but also the proportion of GC bases and read coverage to identify likely contaminant scaffolds.
Different contamination scan approaches likely disagree with respect to their verdict on some scaffolds. However, this may not matter much to the user in the case of very small scaffolds that lack any predicted genes. In contrast, missing a true contaminant scaffold that contains many protein-coding genes can have a large effect on downstream analyses (e.g. of orthology), so we suggest that users may like to try both our protocol and others (e.g. Blobology) to check if any additional large putative contaminant scaffolds are identified by one approach but not another. Such putative contaminants can then be subjected to manual scrutiny before deciding whether to discard them from a genome assembly.