Assigning cells to known or de-novo cell types is an important step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. This protocol outlines how to use the CellAssign R package to accomplish this.
Method Article
Assigning scRNA-seq data to known and de novo cell types using CellAssign
https://doi.org/10.21203/rs.2.10442/v1
This work is licensed under a CC BY 4.0 License
This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.
posted
You are reading this latest protocol version
Assigning cells to known or de-novo cell types is an important step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. This protocol outlines how to use the CellAssign R package to accomplish this.
scRNA-seq
cell types
cell type assignment
RNA-seq
microenvironment
cell type composition
Assigning cells to known or de-novo cell types is an important step in the analysis of single-cell RNA-sequencing (scRNA-seq) data. CellAssign is a recently published statistical model that models the over-expression of a set of marker genes for each pre-specified cell type. CellAssign then computes a probability that each cell is of a given cell type, or is of an “unknown” cell type (does not reflect the expected expression of any of the specified cell types). These assignments can then be used to (i) study the cell type composition of each sample, (ii) focus in on a given cell type for further analysis (e.g. unsupervised clustering), or (iii) remove nuisance cell types.
Software:
R computing environment (> version 3.5)
The devtools R package
The cellassign R package (https://github.com/Irrationone/cellassign)
1. Install Tensorflow within R:
install.packages("tensorflow")
tensorflow::install_tensorflow()
2. Install cellassign by running
devtools::install_github(“Irrationone/cellassign”)
and load by calling library(cellassign)
3. Prepare single-cell expression data in the form of a SingleCellExperiment object
https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html
We will assume this object is “sce”. In rowData(sce) should be fields “ID”, corresponding to ensembl gene ID, and “Symbol”, corresponding to HGNC symbol.
4. Compute size factors using scran
sce <- computeSumFactors(sce)
5. Specify marker gene data
This is in the form of a list, where the names of the list are the cell types and the contents are marker genes for the cell types. An example can be found in the CellAssign package by calling data(example_TME_markers). As a simple example, we can create one for T cells and epithelial cells:
marker_list <- list(t_cells = c(“PTPRC”, “CD2”), epithelial = “EPCAM”)
Note that there is no requirement marker genes should be mutually exclusive or not expressed in other cell types.
6. Turn marker list into binary matrix using marker_list_to_mat
marker_mat <- marker_list_to_mat(marker_list)
Optional: an “unknown” cell type may be included by passing include_other = TRUE to marker_list_to_mat
7. Match IDs to rows of the SingleCellExperiment
mm <- match(rownames(marker_mat), rowData(sce)$Symbol)
8. Subset SingleCellExperiment to markers only
sce_marker <- sce[mm,]
9. Run CellAssign
fit <- cellassign(exprs_obj = sce_marker,
marker_gene_info = marker_mat,
s = sizeFactors(sce_marker))
Note that covariates can be included at this point by passing an argument named “x” to cellassign. For more information see the vignette below.
10. View assigned cell types
print(fit$cell_type)
11. View maximum likelihood parameter estimates
print(fit$mle_params)
This includes the cell assignment probabilities in fit$mle_params$gamma
For more detailed example see the package vignette at https://irrationone.github.io/cellassign/introduction-to-cellassign.html
Common errors include:
Including cells in the SingleCellExperiment or gene expression matrix passed to “cellassign” that have no counts remaining, after subsetting to marker genes only.
Not subsetting to marker genes only, ie passing a full SingleCellExperiment with all genes to “cellassign”. The marker matrix and expression data passed to cellassign should be for marker genes only.
Time end-to-end for a beginner user should be around 2 hours.
The resulting object returned by “cellassign” includes cell type assignments and maximum likelihood parameter estimates. This is in the form of a “cellassign_fit” object. This allows users to perform useful downstream analyses such as correlating cell type composition with phenotypes or further unsupervised analysis on cell subsets.
S.P.S. is a founder, shareholder, and consultant of Contextual Genomics Inc.
This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.
posted
You are reading this latest protocol version