A clear definition of the scientific question driving genome-wide transcriptional profiling remains the key factor to determine the optimal experimental design.
A crucial point is the specification of the number of biological replicates to ensure sufficient statistical power. The optimal number of replicates depends mainly on the magnitude of differential expression (effect size) and the complexity of experimental design (number of covariates). However, as described in Alpern et al. article9, shifting from 20 to 5 replicates greatly reduces the ability to detect true differentially expressed genes (DEGs) (decrease statistical power). The low cost per sample is a great advantage for increasing the number of samples in an analysis and allows a high performance to detect DEGs.
In order to avoid batch effects, RNA extraction series have to be randomized according to biological conditions. A randomization of the samples on micro-well plates is also necessary to avoid rows/columns batch effects; we have developed an algorithm to randomize samples.
Libraries are prepared from 10ng of total RNA per sample, diluted in 4µl of RNase free water. RNA concentration and purity are measured on NanoDrop station (ThermoFisher scientific); OD 260/230 ratios should be around 1.8 - 2.0 to ensure that concentration is properly assessed. Quality of RNA samples is controlled on 2100 Bioanalyzer system (Agilent) with RNA 6000 Nano Kit (ref 5067-1511, Agilent).
Library construction (Fig.2) for 3’SRP is performed according to Soumillon et al.1
The mRNA poly(A) tails are tagged with universal adapters, well-specific barcodes and unique molecular identifiers (UMIs) during template-switching reverse transcription: 96 RNA samples are distributed in a MicroAmp Optical 96-well Reaction Plate (ref N8010560, Applied Biosystems) with 1µl of well-specific barcoded adapter E3V6NEXT (10µM, Integrated DNA Technologies) .
E3V6NEXT = 5'-/5Biosg/ACACTCTTTCCCTACACGACGCTCTTCCGATCT[BC6]N10T30VN-3'
5Biosg = 5’ biotin,
[BC6] = 6bp barcode specific to each cell/well,
N10 = Unique Molecular Identifiers
Template-switching reverse transcription was performed using Maxima H Minus Reverse Transcriptase (ref EP0753, Life technologies), Deoxynucleotide (dNTP) Solution Mix (ref N0447L, New England Biolabs) and with universal adapter E5V6NEXT (100µM, Integrated DNA Technologies). No RNA Spike-in was used in our approach.
E5V6NEXT = 5’-/5Me-isodC//iisodG//iMe-isodC/ACACTCTTTCCCTACACGACGCrGrGrG-3’
All 96 cDNAs obtained after template-switching reaction are pooled together, purified and concentrated with single DNA & Concentrator-5 column (ref ZD4013, Ozyme). Pooled and concentrated cDNAs are treated with Exonuclease I (M0293S/L, New England Biolabs) and then amplified by single primer PCR using Advantage® 2 PCR Kit (ref 639206, Ozyme) and SINGV6 primer (10µM, Integrated DNA Technologies) :
SINGV6 = 5’-/5Biosg/ACACTCTTTCCCTACACGACG*C-3’
12 cycles of PCR amplification are applied. PCR products are purified with Agencourt AMPure XP magnetics beads [0.6x] (ref A63881, Beckman Coulter) and quantified spefically for dbDNA on Varioskan™ LUX (ThermoFisher scientific) with QUANT-IT dsDNA assay kit (ref Q33130, ThermoFisher scientific).
Tagmentation has been adapted from the original protocol. Amplified barcoded cDNAs are tagmented using a bead-linked transposome (Illumina). Between 100 and 200ng of full-length cDNAs was used as input to the Nextera™ DNA Flex Library Prep kit (ref #20018704, Illumina) and Nextera™ DNA CD Indexes (24 Indexes, 24 Samples) (ref #20018707, Illumina) according to the manufacturer’s protocol (Nextera DNA Flex Library Document, ref #1000000025416 v04, Illumina) with exception that :
- the i5 primer was replaced by P5NEXTPT5 primer (5µM, Integrated DNA Technologies)
P5NEXTPT5 = 5’-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCC*G*A*T*C*T*-3’
where * = phosphorothioate bonds
- number of PCR cycles was defined at 8
- final elution of purified library was realized in 15µl of RSB (Resuspension Buffer - included Nextera™ DNA Flex Library Prep kit, Illumina)
The size of the resulting sequencing library was controlled on a 2200 Tape Station (Agilent) with High Sensitivity D1000 ScreenTape (ref 5067- 5584, Agilent) and High Sensitivity D1000 Reagents (ref 5067- 5585, Agilent). Expected size of library is between 400 and 800pb.
Library is specifically P5-P7 quantified by qPCR on Light Cycler 480 (Roche) according to standards Lib Quant Standards (Illumina) (ref 07960387001, Roche). P5 and P7 primers (100µM, Eurofins) are added to Master Mix of kit Kapa Sybr Fast LC480 (ref KK4611, Sigma-Aldrich).
P5 = 5'-AATGATACGGCGACCACCGAGAT-3'
P7 = 5'-CAAGCAGAAGACGGCATACGA-3'
The final concentration must be greater than 2nM in 10µl (for HiSeq sequencing) or greater than 15nM in 10µl (for NovaSeq sequencing) to be sequenced.
Libraries are sequenced either on an Illumina HiSeq 2500 using a Hiseq Rapid SBS Kit v2-50 cycles (ref FC-402-4022) and a Hiseq Rapid PE Cluster Kit v2 (ref PE-402-4002) with 16-58 cycles reads according to manufacturer’s protocol (Denaturing and Diluting Libraries for the HiSeq® and GAIIx, Part # 15050107v03, Illumina), or on a NovaSeq 6000 using NovaSeq 6000 SP Reagent Kit 100 cycles (ref #20027464, Illumina) with 17-8-105 cycles reads (Illumina recommends to add one cycle for reads 1 and 3 on NovaSeq sequencing) according to the manufacturer’s protocol (NovaSeq 6000 Sequencing System Guide Document #1000000019358v11 Material #20023471, Illumina). i7 indexes are read on NovaSeq sequencing due to injection of two plates of 96 libraries on a SP Flow Cell.
Raw fastq pairs used for analysis matched the following criteria:
- the 16 bases of the first read (forward reads) correspond to 6 bases for a designed well-specific barcode and 10 bases for a unique molecular identifier (UMI).
- the second read (reverse reads), 104 bases for NovaSeq (58 bases for HiSeq runs), corresponds to the captured poly(A) RNAs sequence.
Bioinformatics analysis (Fig.1) can be divided in three parts: 1) primary analysis which includes demultiplexing, alignment and counting steps; 2) secondary analysis that performs the differential analysis; 3) tertiary analysis which is focused on the functional annotation and the representation of differentially expressed genes.
The primary analysis (Fig.3) consist in generating the expression matrix containing the raw counts for each sample and for each gene, from raw sequencing data files.
- Illumina basecall files are transformed into paired-end fastq files with Illumina bcl2fastq converter.
- Samples are demultiplexed according to their respective barcodes described in a sample sheet to create a single-end fastq file per sample.
- A step of read trimming is performed in order to remove any poly A tail, which can occur when the transposase cuts a DNA fragment in close proximity to the poly A tail. By using cutadapt on these fragments, all the bases following the poly A tails (reverse sequences of the UMI, sample barcode and illumina adapter) are also removed.
- The cDNA sequences from each sample are then aligned on the reference transcriptome and the mitochondrial genomic reference sequence with BWA5 aligner. Since a reference transcriptome is used, there is no need for RNA aligner (such as STAR or bowtie).
- Expression profiles are generated by parsing the alignment files (bam) then counting the number of unique UMIs associated with each gene for each sample.
- Reads are removed if one of the following cases occur:
* the UMI contains 'N's in its sequence.
* there are more than 3 mismatches in the alignment.
* the sequence aligns with the same score on transcripts of more that one gene.
- Finally, the raw expression matrix is produced, which contains the values of abundance for every gene in all samples.
The secondary analysis consists in identifying the differentially expressed genes between conditions from the raw expression matrix.
The pipeline is mainly written in the R programming language.
- Samples and genes are filtered out of the raw expression matrix according to the parameters specified in the configuration file:
* Samples are removed if they have less than --minReads reads assigned (default 200k) or if they have less than --minGenes genes detected (i.e. number of genes with at least 1 count) (default 5k).
* Genes are removed if not detected (counts = 0) in at least the number of samples forming the smallest condition.
- Counts are then normalized by DESeq26 method.
- Differentially expressed genes are found by DESeq2. By default, genes with an adjusted p-value < 0.05 and a log2 fold-change > 0.58 (i.e. fold-change > 1.5) are defined as differentially expressed.
Using ClusterProfiler10 and StringDB8 tools provides some hints on the interpretation of the differentially expressed genes.
- Enrichment tests are performed on Gene Ontology annotations and KEGG pathways.
- GSEA analysis
- StringDB networks are shown if the number of differentially expressed genes is below 50.
Running the 3’SRP pipeline
The pipeline is entirely implemented as a Snakemake workflow. Scripts used in the pipeline are mainly written in the python and R programming language.
Installing git and conda are mandatory before running the pipeline.
All necessary tools are then listed within a conda environment recipe. This environment can be created with a basic conda command.
$ conda env create -n srp -f CONDA/srp.yml
Testing the installation
A minimal sample size test data as well as a sample sheet are available in the "TESTDATA" folder. In order to check the installation, run the pipeline on this data.
This test data is made up of six samples classified in three conditions (2 samples per condition). The fastq files are real reads taken on a small portion of the human chromosome 22. The reference is a fasta file of the human chromosome 22.
$ conda activate srp
# Under the srp-pipeline directory:
$ python SCRIPTS/make_srp_config.py -s TESTDATA/samplesheet.tsv -r TESTDATA/REFERENCES/ -w RESULTS -f TESTDATA/fastqFiles.txt -c TESTDATA/conditions.tsv --minGenes 0 --minReads 0 > config.json
$ snakemake --config conf="config.json" -rp -j 1
The pipeline takes as input a samplesheet describing samples, one or multiple pairs of fastq files and eventually a file listing the comparisons to test.
--The samplesheet :
A samplesheet describing the samples is necessary to run the pipeline. This file has to be tab delimited (without header) containing six columns: well, index, name, project, condition, species.
--The fastq files :
The fastq files have to be in paired-end mode. The first file should contain the sequences of the sample indexes as well as the unique molecule identifiers (UMI) (6+10 bases). The second file should contain the DNA sequences of the captured polyA RNAs (N bases). There are two ways of specifying the fastq files to the program:
- Creating a file listing the fastq paths
- Specifying a Illumina directory (bcl2fastq output folder)
--The comparisons file :
In order to perform secondary analysis for all or some projects, a file listing the project name and comparisons has to be create.
Running the pipeline
1- Clone this repository and move to it
$ git clone "https://gitlab.univ-nantes.fr/bird_pipeline_registry/srp-pipeline.git"
$ cd srp-pipeline
2- Activate the conda environment
$ conda activate srp
3- Create the configuration file necessary for the pipeline.
The script used to create the configuration file is make_srp_config.py in the SCRIPTS folder. The help can be visualized with:
$ python SCRIPTS/make_srp_config.py -h
The mandatory options are -s for the samplesheet, either -f for a file listing the fastq paths or -i for an illumina directory and -r for the directory containing the genome files for the different assemblies.
It is recommended to specify a working directory where the files will be output with option -w to keep the srp-pipeline git clone clean. The program outputs a json object on stdout which can be redirected to a file.
$ python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <path_to_reference_folder> -w <path_to_workdir> -i <path_to_fastqs> > config.json
For secondary analysis, use option -c to specify the comparisons to perform.
$ python SCRIPTS/make_srp_config.py -s <my_samplesheet> -r <path_to_reference_folder> -w <path_to_workdir> -i <path_to_fastqs> -c <comparisons_file> > config.json
The generated configuration file should be checked to see if everything seems alright.
$ cat config.json
4- Launch the snakemake pipeline.
Test the launch with a dry run:
$ snakemake --config conf="config.json" -rpn
All rules and commands that will be run should be displayed. Otherwise errors are returned.
Launch the run:
$ snakemake --config conf="config.json" -rp -j 1
In order to launch the pipeline on a cluster, specify a script to encapsulate the jobs to snakemake. example for SGE:
$ snakemake --config conf="config.json" --cluster "qsub -e ./logs/ -o ./logs/" -j 33 --jobscript SCRIPTS/sge.sh --latency-wait 100 -rp