Whole-exome sequencing (WES) and whole-genome sequencing (WGS) are increasingly used to identify disease-causing genetic variants of individuals affected with a genetically undiagnosed rare disorder1–3. Depending on the disease entity, analysis of WES or WGS achieve a diagnosis rate of up to 50%2–5. One of the main challenges in these DNA-based molecular diagnostics approaches is the interpretation of the non-coding variants. Many of these variants may have a regulatory role in controlling RNA abundance or splicing. However, computational predictions of their regulatory effects remain uncertain6. To advance diagnostics, RNA sequencing (RNA-seq) has emerged as a complementary tool because it directly probes gene expression, thereby providing functional data supporting the clinical interpretation of variants. Moreover, RNA-seq can reveal genes with a regulatory defect, even in the absence of candidate regulatory variants, which is particularly important in the situation where WES instead of WGS has been performed. Pilot studies systematically using RNA-seq have obtained increased diagnostic rates over DNA-sequencing alone for a variety of rare disorders ranging between 8% and 36%7–11.
As the first studies that systematically used RNA-seq for the genetic diagnosis of rare diseases were published only two years ago7,8, the field is new and lacks established workflows. Here, we provide a step-by-step protocol that instructs how to use RNA-seq data in this context by supporting the detection of aberrant events. Specifically, we provide detailed instructions to identify aberrant expression, aberrant splicing, and mono-allelic expression (MAE) of a rare allele by starting from Binary Sequence Alignment Map (BAM)12 and Variant Call Format (VCF) files13 (Table 1). The protocol covers a number of quality control steps and plots that evaluate the counting and fitting procedures of each pattern. In order to control sample mixing, the protocol includes a validation step comparing variants called from RNA and DNA. Lastly, we describe how to analyze individual samples using the results and objects derived from the pipeline.
Application of the protocol
The aim of this protocol is to provide a one-in-all computational workflow starting from VCF and BAM files to call and visualize all forms of aberrant gene expression that have been currently used for rare disease diagnosis purposes. It is based on DROP (Detection of RNA Outliers Pipeline), a computational workflow that automatizes and standardizes all the steps needed to obtain the results from raw input files. The workflow uses the Snakemake framework14 (Box 1) to guarantee reproducible results. Overall, our workflow can be easily implemented and provides a comprehensive collection of tools enabling the detection of outliers using RNA-seq.
To showcase the workflow, we have created a dataset based on 100 samples from the Geuvadis dataset15 (Table S1), which is accessible without restriction under https://www.ebi.ac.uk/Tools/geuvadis-das/. We refer to this dataset as the demo dataset. DROP, including a small test dataset, is publicly available at https://github.com/gagneurlab/drop.
Advantages and limitations
Currently, only one computational workflow for RNA-seq based diagnostics of rare disorder has been published9. That pipeline includes alignment and quality controls of RNA-seq reads, functions to filter and annotate variants, and prioritize candidates. The main advantages of the workflow proposed here is that, using DROP, we automatize most of the steps, integrate the specialized methods OUTRIDER16 and FRASER17, add a module to test for MAE and VCF-BAM matching, include new quality control plots to evaluate the different steps, and render all results and plots in HTML pages. DROP is designed with a modular architecture, such that the aberrant expression, aberrant splicing, and MAE modules can be executed without depending on each other. Moreover, this architecture allows new modules with different functionalities to be added. Another advantage of DROP is that it uses the workflow management framework Snakemake14 to run and monitor the execution of the pipeline (Box 1). Every time the pipeline is executed, Snakemake computes the dependencies among all scripts and data files. Then, it checks if either a data file or a script were modified or added. If a script is added or changed, Snakemake will execute it, together with the downstream steps. If a data file is added or changed, Snakemake will execute any script using it, and all the downstream scripts in cascade. Hence, this workflow management framework ensures that the results reflect the latest scripts and input data files while avoiding unnecessary executions of scripts lying upstream or parallel to the modifications.
Another advantage of DROP is the way in which it manages analysis groups. Users might not want to merge all the available samples into one analysis, but instead create subsets with, for example, samples extracted from the same tissue or belonging to the same disease entity. A sample can belong to one or multiple groups described in the sample annotation. The groups are recognized by DROP and a separate analysis is created for each one of them.
RNA-seq is not the first tier diagnostic tool. Its power is in the detection of functional relevant consequences of DNA variants affecting gene expression levels and splicing. In this way it complements DNA sequencing by supporting interpretation of variants of uncertain significance and reprioritizing variants overlooked by DNA sequencing. RNA-seq has some inherent limitations in diagnostics (the highest reported diagnosis rate increases after WES using RNA-seq are 35%8 and 36%10). One possible reason is that the gene is not expressed in the available tissue. Researchers can investigate a priori whether genes are expressed in a certain tissue by using public datasets like GTEx18. Another reason is that the expression or splicing are at normal levels in the available tissue, unlike in the affected tissue10. A final reason is that the disease-causing variant(s) (e.g. missense) might not have any effect on the transcript.
In order to confidently detect outliers, a large sample size is needed. The power analysis done in OUTRIDER concluded that the cohort should contain at least 50-60 samples16. This limitation can be overcome by integrating control samples from publicly available cohorts like GEUVADIS15 or GTEx18. We advise to select case and control samples from the same tissues, and to align reads using the same aligner and parameters, before processing them jointly with DROP.
Assuming that local pipelines are established for alignment and variant calling, DROP takes as input BAM and VCF files (Table 1) and does not provide a built-in solution for those steps. However, this protocol provides in Materials instructions for those steps as well. Another limitation is that we describe only one method to compute the results of each type of aberrant expression pattern. Nevertheless, the user can use the count tables and modify the pipeline to apply any other method to them. Furthermore, this protocol does not include variant calling from RNA-seq, which has limited value when WGS is available, but is useful in the situation where only WES is available8,10.
Our workflow is composed of three main steps: i) preparing the input data, ii) fitting the models and extracting the results from aberrant expression, aberrant splicing, and MAE, and iii) analyzing the individual results (Fig. 1). The input data are raw BAM files, VCF files, a sample annotation table, a configuration file containing the pipeline’s parameters, a human reference genome (FASTA) file, and a gene annotation (gtf) file (Table 1). DROP then generates intermediate files (e.g., read count matrices), final results, and produces HTML pages for convenient visualization using the framework wBuild (Box 1). The generated files are stored as either tab-separated text (tsv) files or binary R data (Rds) files, depending on their content. See Table 1 for file format overview. Finally, users can access the objects and results in order to plot and analyze the samples and genes of interest.
Aberrantly expressed genes, or expression outliers, refer to genes with an expression level aberrantly higher or lower in a sample with respect to other samples. Calling expression outliers in individuals affected with a rare disorder has been successfully used to identify disease-causing genes and to identify or confirm potential disease-causing variants7,9,10.
To detect aberrant expression, we first compute an RNA-seq read count matrix by counting for each sample the sequencing reads that fully overlap each gene (Methods). Then, we apply OUTRIDER16, a statistical method that computes significance level for extreme read count values while controlling for hidden confounders16. We extract P values, z scores and fold changes for each sample–gene combination from the OUTRIDER returned objects. After correcting for multiple testing, there is typically a handful of outlier genes per sample at a false discovery rate of 0.05. For example there is a median of 1 outlier per sample in the demo dataset, consistent with observations on other datasets7,16. If a sample has more than 0.1% of expressed genes as outliers, we refer to it as an aberrant sample. Pinpointing a candidate disease-causing genes among the tens to hundreds of expression outliers of an aberrant sample is difficult. Aberrant samples should be carefully analyzed to find a possible explanation for their high number of outliers, e.g., in case of contamination, or if the sample belongs to a different population (tissue, ethnicity). A biological cause is not to be excluded. For instance, disruption of the function of a transcription factor may cause aberrant expression of many downstream genes.
Other methods to detect aberrant expression consist of applying cutoffs on z scores9,10 computed after regressing out contributions of hidden confounders estimated by PEER19 or SVA9. However, simulations and enrichment analysis of rare variants among expresion outliers have shown that OUTRIDER outperformed these methods16.
Aberrant splicing is a major cause of Mendelian disorders20,21. Despite progresses in predicting aberrant splicing from sequence22,23, the task remains difficult because splicing involves a complex set of cis-regulatory elements that are not yet fully understood. Moreover, some variants affecting splicing cannot be detected by WES for being located in intronic sequences not covered by WES kit probes. Aberrant splicing events can instead be advantageously detected from RNA-seq data7–11,24–27.
The aberrant splicing module considers split reads that are reads aligning to separated location of a same chromosome and strand and therefore supportive of an exon-exon junction. This is done independently of exon annotation, allowing for calling splicing events de novo. In addition, reads spanning the exon-intron boundary as well as the exon-intron boundary are counted to enable intron detection. The counts are converted into two intron-centric metrics, percent-spliced-in (PSI) and splicing efficiency (theta), which are calculated for both the donor D (5’ splice site) and acceptor A (3’ splice site) sites of every intron28 (Methods). The theta_5 and theta_3 values are fitted together, hence we jointly refer to them as theta. Junctions expressed at low levels are filtered out (Methods). Then, for each of the metrics, the statistical method FRASER uses a denoising autoencoder to control for latent confounders and fits a beta-binomial distribution on every intron or splice site17. P values, z scores and Delta Psi 5, Delta Psi 3, and Delta theta values are then computed from the beta-binomial fits, where the Delta values are defined as the difference between the observed and the expected values. FRASER then reports gene-level P values as well as splice site-level P values. Similar to expression outliers, there is typically a handful of aberrant splicing events per sample that are significant at a FDR <0.05. For example, the median number of outlier events in the demo dataset are 2 for psi 5, 2 for psi 3, and 5 for theta.
Three other methods have been used to detect aberrant splicing. Specifically, (i) an adaptation of the differential splicing test LeafCutter29 used in the Kremer et al. study7, (ii) a cutoff based approach used in the studies of Cummings et al.8 and Gonorazky et al.10, and (iii) a z score based method used in the Frésard et al. study9. The first one constructs intron clusters and tests for differential usage between one sample and all others, instead of aberrant splicing events. Moreover, it does not control for sample covariation. The second one defines aberrant splicing events as novel introns in genes with enough reads in the affected individual but not (or almost not) appearing in a control cohort, after applying local normalization. The two main caveats are that it depends on arbitrary cut-offs and may fail to recognize aberrant splicing events in weak splice sites30. The third one, corrects for covariation, but uses a z score approach which does not offer any control for false discovery rate and can be inaccurate in splice sites with low reads. Having in mind these limitations, we have opted for FRASER17, which addresses all those issues.
MAE refers to expression of a single allele out of the two alleles of a gene. When assuming a recessive mode of inheritance, single heterozygous rare variants are not prioritized after DNA sequencing. However, mono-allelic expression of such a single heterozygous rare variant in an affected individual, which could be due to genetic or epigenetic silencing of the other allele, is consistent with a recessive mode of inheritance. Therefore, detecting MAE of a rare variant has helped diagnosing rare disorders7,9,10,26,31,32.
Detecting mono-allelically expressed genes relies on counting the reads aligned to each allele at genomic positions of heterozygous variants.
Several methods have been developed to detect MAE in the context of rare diseases, among which are the one described by Kremer et al.7, and more recently ANEVA-DOT31. On the one hand Kremer et al. used a negative binomial test with a fixed dispersion for all genes. On the other hand, ANEVA-DOT implements a binomial-logit-normal test with gene-specific variance VG, with the caveat that due to insufficient training data, estimates of VG have been computed so far for only 4,962 genes (in median), depending on the tissue of interest31. As using ANEVA-DOT would result in losing more than half of the tested genes, we opted for the training data-independent negative binomial test (Methods). In the demo dataset, we found 364 heterozygous SNVs to be mono-allelically expressed, 79 of which where toward the alternative allele, and 3 of which were rare (MAX AF ≤ 0.001).
A crucial step when performing multi-omics is to ascertain that all assays performed on samples obtained from the same individual correspond. Therefore, we designed an algorithm to match the variants derived from DNA and RNA sequencing based on the ideas proposed by t’ Hoen et al.33 and Lee et al.34 (Methods). We achieved this by comparing the BAM files from RNA-Seq with the VCF files from DNA sequencing at predefined genomic positions of variants that are not in linkage disequilibrium. The proportion of variants derived from the same individual that match in the DNA and RNA has to be significantly higher than the one from different individuals, but will not reach 100% due to MAE33. The algorithm is applied not only to the annotated VCF-BAM samples, but to all combinations in order to find other possible unannotated matches.
In the demo dataset, the median of the proportion of matching genotypes of samples from the same individual is 0.98 compared to 0.58 in case of non matching samples. In the case of family members, we expect a value in-between these. All of the 100 RNA samples from the demo dataset match their corresponding DNA, as expected since it is a quality-controlled public dataset. This procedure is part of the MAE module, but can also be run independently. We provide a VCF file containing the set of genomic positions to be tested in our resource web server.
Detection of RNA Outliers Pipeline
To automatize and standardize the generation of the results, we developed the computational workflow DROP. DROP consists of three independent modules (Fig. 1). It is built using R and Python on top of the workflow manager frameworks Snakemake14 and wBuild (Box 1). DROP offers a parallelized backend through Snakemake to execute the same processes on different samples at the same time, thus making effective use of computer clusters. It runs on the command line. Detailed instructions for installation, editing the input files and executing the pipeline can be found in the DROP documentation. The functions used throughout the workflow are explained in detail the Methods section.
Here we used a subset of 100 samples from the Geuvadis dataset to test our pipeline and generate the plots. We expect many groups to implement and benefit from our workflow, as using RNA-seq for rare disease diagnosis is growing in popularity. Finally, as the source code is open source and the design of our pipeline is flexible, we encourage the community to expand the pipeline with new modules, e.g. by adding other multi-omics modules, prediction models, or calling variants on RNA-seq data.
The installation, creation of a Python environment and set up of the config file should be done preferably by either a bioinformatician or a system administrator. Afterwards, the protocol can be followed by any scientist.
Box 1. Workflow Management Systems used in DROP
Snakemake (https://snakemake.readthedocs.io/en/stable) is a workflow management system to create data analyses guaranteeing reproducibility, automation, and scalability36. Snakemake workflows consist of rules that describe how to create output files from the respective input files. These output files are created by defining instructions in shell, Python or R code. Rule dependencies are determined bottom-up by matching file names with the support of automatically inferred named wildcards in the rules. Snakemake allows the usage of multiple scheduling systems or parallel backends (e.g., SGE, SLURM, multi core) to efficiently use HPC systems and run executions in parallel by automatically determining parallel parts in the workflow.
wBuild (https://wbuild.readthedocs.io) is a framework that automatically creates Snakemake dependencies, pipeline rules based on R markdown scripts and compiles the analysis results into a navigable HTML page. All information needed such as input, output, number of threads, and even Python code, is specified in a YAML header inside the R script file, thereby keeping code and dependencies together.
Fig. 1 | Workflow overview. As input, DROP requires a configuration file, a sample annotation file, BAM files from RNA-seq, and VCF files. DROP processes then for each module the input data and generates count tables, overview plots (e.g. sample covariation heatmap), quality control plots, and result tables. Lastly, users can perform case-by-case analysis with the help of different visualizations.
Fig. 2 | The Aberrant Expression Module. a, Histogram of raw read count distribution colored by different filtering strategies: all, minimum 1 count in 5% of the samples, minimum 10 counts in 5% of the samples, and minimum 1 (or user-defined) FPKM in 5% of the samples. b, Number of expressed genes cumulative across all samples. Colors represent the union of all detected genes (blue), genes that passed the FPKM filter as a group (violet), genes that are expressed on each sample (red), and the intersection of expressed genes (green). c, Heatmap of the correlation of row-centered log-transformed read counts between samples before and after (d) the autoencoder correction including the samples’ sexes and the laboratories where they were sequenced. Before correction, samples cluster by laboratory.
Fig. 3 | The aberrant splicing module. a, Histogram showing the junctions that passed the filter (Methods). b, Number of aberrant splicing events per sample by gene colored by metric. c, Heatmap of the correlation of row-centered logit-transformed read count ratios between samples before and after (d) the correction.
Fig. 4 | Mono-allelic Expression Module. a, Cascade plot showing the number of heterozygous SNVs with at least 10 reads (median=11,532), that are mono-allelically expressed (median=364), mono-allelically expressed toward the alternative allele (median=79) and rare (median=3). The lower and upper limits of the boxes correspond to the first and third quartiles, respectively. Whiskers extend to the most extreme value, no further than 1.5x the interquartile range. b, Histogram (y-axis in log scale) of the proportion of matching genotypes from DNA and RNA when comparing all sample combinations. The median of matching samples is 0.98 compared to 0.58 of non-matching samples. One sample had a very low rate of matching with the rest of samples (median=0.26).
Fig. 5 | Analysis plots. a, Negative log-transformed nominal P values (y-axis) versus z scores (x-axis) derived from the expression of all genes of sample NA20804 showing one over- and one underexpression outlier (red). b, Normalized counts (y-axis) of gene MUTYH of all samples (x-axis) reflecting one underexpression outlier (red). c, Representative sashimi plot showing the creation of a new exon for Sample 1 which is skipped by the other two samples. The height represents the coverage in log10 RPKM and the number of the arcs the junction’s coverage. On the bottom, the corresponding gene model is depicted (in red the cryptic exon). d, Fold change between the counts of the alternative and reference alleles (y-axis) compared against the total coverage of each heterozygous SNV (with at least 10 reads) of sample HG00096, highlighted by significance (orange), and significance and rarity (red).
Table 1 | File formats
Table 2 | Files created by DROP and their locations. The <root>, <htmlOutputPath>, and <geneAnnotation> variables are defined in the config file. The <group> variable corresponds to the DROP_GROUP column defined in the sample annotation.