Mapping cells to gene programs

doi:10.21203/rs.3.pex-2092/v1

Method Article

Mapping cells to gene programs

https://doi.org/10.21203/rs.3.pex-2092/v1

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

Single-cell data integration involves removing unwanted technical variations across multiple datasets. We introduce expiMap, an intepretable generative model that maps single cells to reference atlas¹ and gene programs. Here, we describe a protocol for building an initial reference atlas, mapping new datasets as queries, and identifying gene programs that split between populations. This protocol is associated with our Nature Cell Biology paper describing this model: Biologically informed deep learning to query gene programs in single-cell atlases.

Computational biology and bioinformatics

machine learning

unsupervised learning

single cell

Hardware:

A desktop or laptop computer with at least 16 GB RAM and 4 GB of GPU RAM.

Software:

Python version 3.8 or newer.

pytorch (Python package) version 1.8.0 or newer with CUDA support.

scanpy (Python package) version 1.6.0 or newer.

scarches (Python package) version 0.5.4 or newer.

See the section Equipment for the required versions of software and packages.

1. Install Python. We recommend using miniconda for this.

2. Install pytorch. The installation process is described in https://pytorch.org/get-started/locally/.

3. Install scarches with pip install scarches. scarches automatically installs scanpy and the other dependencies.

4. Prepare your scRNA-seq reference and/or query data in AnnData format. anndata package is automatically installed with scanpy. The details of the API and tutorials can be found in https://anndata.readthedocs.io/en/latest/index.html.

5. Obtain gene programs in a .gmt file. They can be obtained, for example, here http://www.gsea-msigdb.org/gsea/msigdb/index.jsp.

6. Load gene programs from the .gmt file into the AnnData object.

import scarches as sca

import scanpy as sc

sca.utils.add_annotations(adata)

You can also filter gene programs based on the number of genes of a gene program present in the current AnnData object adata. This can be done with min_genes and max_genes arguments.

Names of the loaded gene programs can be found in adata.uns["terms"].

Membership matrix for the loaded gene programs can be found at adata.varm["I"].

Full description of the function can be found at https://scarches.readthedocs.io/en/latest/api/utils.html#scarches.utils.add_annotations

7. Subset the AnnData object to higly variable genes.

It is important to use highly variable genes for training. We recommend to use at least 2000 HVGs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model.

You can subset HVG in adata with the scanpy function highly_variable_genes.

sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", batch_key="batch", subset=True)

See scanpy docs for the details https://scanpy.readthedocs.io/en/latest/generated/scanpy.pl.highly_variable_genes.html

8. Initialize the model for the reference data.

intr_cvae = sca.models.EXPIMAP(
   adata=adata,
   condition_key="study",
   hidden_layer_sizes=[256, 256, 256],
   recon_loss="nb"
)

recon_loss sets the loss function used for reconstruction of the data from the latent space.

For recon_loss="nb" (negative binomial likelihoood reconstruction loss) adata.X should contain raw counts.

For recon_loss="mse" adata.X should contain log-normalized counts. You can get log-normalized counts by calling scanpy the following scanpy function on adata before passing it to sca.models.EXPIMAP:

sc.pp.normalize_total(adata)

sc.pp.log1p(adata)

condition_key sets the condition (a key in adata.obs containing condition labels) used for integration. The effect of this condition in the data will be "integrated out" during the training.

The description of all arguments can be found at https://scarches.readthedocs.io/en/latest/api/models.html#expimap

The initialized model has the latent space with dimension equal to the number of gene programs loaded during the step 6. Each latent dimension corresponds to a gene program. Names of the gene programs can be found at adata.uns["terms"].

9. Train the model with the reference data.

intr_cvae.train(
   n_epochs=400,
   alpha_epoch_anneal=100,
   alpha=0.7,
   alpha_kl=0.5
)

For the description of hyperparameters see https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.train

The main hyperparameter which affects the quality of integration for the reference training is alpha_kl, the value of which is multiplied by the kl divergence term in the total loss. If the visualized latent space looks like a single blob after the reference training, we recommend to decrease the value of alpha_kl. If the visualized latent space shows bad integration quality, we recommend to increase the value of alpha_kl. The good default value in most cases is alpha_kl=0.5. The required strength of group lasso regularization alpha depends on the number of used gene programs and the size of the dataset. For 300-500 gene programs we recommend to use alpha=0.7 and increase for larger numbers of gene programs.

10. Initialize the trained model for query to reference projection¹.

q_intr_cvae = sca.models.EXPIMAP.load_query_data(query, intr_cvae)

Here query is an AnnData object with the query data. It needs to have condition labels under the same key which was used during the reference model initialization with condition_key.

The description of the arguments is provided here https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.load_query_data

11. Train the model to project the query onto the reference.

q_intr_cvae.train(n_epochs=200, alpha_epoch_anneal=100, weight_decay=0., alpha_kl=0.1)

We recommend to use 200 epochs for the query to reference mapping. Smaller datasets tend to require more epochs of training to map the query into the reference well. If you observe that the query is not integrated into the reference, we recommend to try longer training for the query mapping.

12. Get the joint latent space for the query and reference and visualize it.

ref_query = sc.AnnData.concatenate(adata, query)

ref_query.obsm['X_cvae'] = q_intr_cvae.get_latent(ref_query.X, ref_query.obs["study"], only_active=True)

only_active excluded the deactivated gene programs. The description of the arguments can be found at https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.get_latent

sc.pp.neighbors(ref_query, use_rep='X_cvae')

sc.tl.umap(ref_query)

sc.pl.umap(ref_query, color=["study", "cell_type"])

13. Do differential gene programs test for some condition in reference + query using built-in bayesian test and visualize the results.

q_intr_cvae.latent_enrich(groups='condition', comparison='control', adata=ref_query)

fig = sca.plotting.plot_abs_bfs(ref_query, yt_step=0.8, scale_y=2.5, fontsize=7)

For the details see https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.latent_enrich

14. Save the model for future use.

intr_cvae.save("models/integrated_model")

15. Load the previously saved model.

q_intr_cvae = sca.models.EXPIMAP.load("models/integrated_model", ref_query)

The code for the step by step procedure for training and analysis of the results is also provided in the tutorials:

Basic tutorial for query to reference mapping with expiMap - https://scarches.readthedocs.io/en/latest/expimap_surgery_pipeline_basic.html

Advanced expiMap tutorial for query to reference mapping with de novo learned gene programs - https://scarches.readthedocs.io/en/latest/expimap_surgery_pipeline_advanced.html

1. Lotfollahi, M. et al. Mapping single-cell data to reference atlases by transfer learning. Nature Biotechnology 1–10 (2021).

Download PDF

Version 1

posted

You are reading this latest protocol version

Mapping cells to gene programs

Status:

Version 1

Abstract

Equipment

Procedure

References

Associated Publications

Status:

Version 1

Privacy Policy

Terms of Service

Cookie Settings