1. Prepare two input files
a. A "read counts" file, with 1 row of headers with unique sample names and 1 column with gene names, and the rest of the table filled with read counts data (integer values). This file should be saved as a tab-delimited text file format, which can be specified in the "Save As" dialog in MS Excel and other programs. Avoid special characters in sample IDs (spaces, dashes, periods, etc), except for underscores, and do not name samples using only a number. An example input file is provided in "DESeq2examplereadcounts.txt".
b. A "metadata file" containing sample names and groupings. Scripts are provided for three comparison scenarios:
i. Simple A vs B comparisons (e.g. treated vs control)
Example input file: "DESEq2examplemetadatasimple.txt"
The "metadata" file should have sample names corresponding to the "read counts" file down the first column, sample categories (defining the comparison you want to perform) down the second column, with 1 row of headers ("Sample" and "Group" in the example input file). This file should be saved as a tab-delimited text file. Sample names must be in the same order as the read counts file. Avoid special characters in sample names or comparison groups.
ii. Paired comparisons (e.g. the same individuals before and after treatment)
Example input file: "DESEq2examplemetadatapaired.txt"
Prepare the file the same way as for the simple A vs B comparison described above in (a), except add a third column with the header "Pairs" containing unique IDs to match the same individuals across the two comparison groups. Do not identify individuals using only numbers. Ensure that all individuals are represented in order in each of the two groups being compared.
iii. Comparisons controlling for batch effects (e.g. both treated and control samples were collected in two different experiments).
Example input file: "DESEq2examplemetadatabatcheffects.txt"
Prepare the file the same way as for the simple A vs B comparison described above in (a), except add a third column with the header "Batch" containing unique IDs per sample batch. This will only work if each side of the comparison contains at least one sample from each batch specified.
The file names in quotes in the script can be changed to match your input files.
2. Load the R script in R Studio
a. Please be sure to have these versions (or newer) installed: R v4.3.2, RStudio 2023.12.0+369. Both are downloadable for free (currently at https://posit.co/download/rstudio-desktop/)
b. Load the provided R script in RStudio (DESeq2script.R)
c. Set the working directory to the location where you saved the R script and example files (or your own custom files). You can do this after loading the script by using the top menu in RStudio, selecting "Session" then "Set Working Directory" then "To Source File Location".
3. Run the R script
Read the comments (starting with #) and run each command individually by highlighting the line and clicking "Run" at the top right of the script window (or by pressing CTRL+Enter in Windows or CMD+Enter in MacOS).
4. Description of output files
a. The DEseq2 output file (starting with the filename "DESEQoutput" when using the script without modifications) is a plain tab-delimited text file that can be opened in MS Excel or text editors, and contains the following headers:
baseMean: The average of the normalized count values for a given gene across all samples, which represents the mean expression level of the gene.
log2FoldChange: This is the estimated fold change in gene expression between the two conditions, expressed in log2 scale. A positive value indicates upregulation while a negative value indicates downregulation (for the first of the two conditions specified in the "contrast" in the script). Users editing the script should double-check that the direction of differential regulation matches the raw data to ensure that the direction of the fold change matches the raw data. Genes with values greater than 1 or less than -1 are two-fold upregulated or downregulated, which can be used as a filter, but it is not strictly necessary.
lfcSE (log2FoldChange Standard Error): This represents the standard error of the log2FoldChange estimate. A lower value indicates greater precision in the estimate of the fold change.
stat: This is the Wald statistic for the hypothesis test that the log2FoldChange is different from zero.
pvalue: This is the unadjusted p-value associated with the Wald test statistic. A lower p-value suggests that the observed difference in gene expression is less likely to be due to random chance.
padj (adjusted p-value): This is the FDR-adjusted p-value adjusted for multiple testing. This is the primary output statistic that should be used to determine significant differential expression, typically at a threshold of, at most, ≤ 0.05.
b. The PCA output file (DESeq_sample_PCA_coordinates.txt) is a plain tab-delimited text file containing the sample IDs in the first column, and the coordinates for the PC1 values (X-axis) and the PC2 values (Y-axis), and well as the sample grouping data. The variance values for PC1 and PC2 can be recorded from the PCA plot generated within R studio, and the PCA values can be used to replot data in other programs (such as MS Excel or Prism).