See the section Equipment for the required versions of software and packages.
1. Install Python. We recommend using miniconda for this.
2. Install pytorch. The installation process is described in https://pytorch.org/get-started/locally/.
3. Install scarches with pip install scarches. scarches automatically installs scanpy and the other dependencies.
4. Prepare your scRNA-seq reference and/or query data in AnnData format. anndata package is automatically installed with scanpy. The details of the API and tutorials can be found in https://anndata.readthedocs.io/en/latest/index.html.
5. Obtain gene programs in a .gmt file. They can be obtained, for example, here http://www.gsea-msigdb.org/gsea/msigdb/index.jsp.
6. Load gene programs from the .gmt file into the AnnData object.
import scarches as sca
import scanpy as sc
sca.utils.add_annotations(adata)
You can also filter gene programs based on the number of genes of a gene program present in the current AnnData object adata. This can be done with min_genes and max_genes arguments.
Names of the loaded gene programs can be found in adata.uns["terms"].
Membership matrix for the loaded gene programs can be found at adata.varm["I"].
Full description of the function can be found at https://scarches.readthedocs.io/en/latest/api/utils.html#scarches.utils.add_annotations
7. Subset the AnnData object to higly variable genes.
It is important to use highly variable genes for training. We recommend to use at least 2000 HVGs and if you have more complicated datasets, conditions then try to increase it to 5000 or so to include enough information for the model.
You can subset HVG in adata with the scanpy function highly_variable_genes.
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", batch_key="batch", subset=True)
See scanpy docs for the details https://scanpy.readthedocs.io/en/latest/generated/scanpy.pl.highly_variable_genes.html
8. Initialize the model for the reference data.
intr_cvae = sca.models.EXPIMAP(
adata=adata,
condition_key="study",
hidden_layer_sizes=[256, 256, 256],
recon_loss="nb"
)
recon_loss sets the loss function used for reconstruction of the data from the latent space.
For recon_loss="nb" (negative binomial likelihoood reconstruction loss) adata.X should contain raw counts.
For recon_loss="mse" adata.X should contain log-normalized counts. You can get log-normalized counts by calling scanpy the following scanpy function on adata before passing it to sca.models.EXPIMAP:
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
condition_key sets the condition (a key in adata.obs containing condition labels) used for integration. The effect of this condition in the data will be "integrated out" during the training.
The description of all arguments can be found at https://scarches.readthedocs.io/en/latest/api/models.html#expimap
The initialized model has the latent space with dimension equal to the number of gene programs loaded during the step 6. Each latent dimension corresponds to a gene program. Names of the gene programs can be found at adata.uns["terms"].
9. Train the model with the reference data.
intr_cvae.train(
n_epochs=400,
alpha_epoch_anneal=100,
alpha=0.7,
alpha_kl=0.5
)
For the description of hyperparameters see https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.train
The main hyperparameter which affects the quality of integration for the reference training is alpha_kl, the value of which is multiplied by the kl divergence term in the total loss. If the visualized latent space looks like a single blob after the reference training, we recommend to decrease the value of alpha_kl. If the visualized latent space shows bad integration quality, we recommend to increase the value of alpha_kl. The good default value in most cases is alpha_kl=0.5. The required strength of group lasso regularization alpha depends on the number of used gene programs and the size of the dataset. For 300-500 gene programs we recommend to use alpha=0.7 and increase for larger numbers of gene programs.
10. Initialize the trained model for query to reference projection1.
q_intr_cvae = sca.models.EXPIMAP.load_query_data(query, intr_cvae)
Here query is an AnnData object with the query data. It needs to have condition labels under the same key which was used during the reference model initialization with condition_key.
The description of the arguments is provided here https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.load_query_data
11. Train the model to project the query onto the reference.
q_intr_cvae.train(n_epochs=200, alpha_epoch_anneal=100, weight_decay=0., alpha_kl=0.1)
We recommend to use 200 epochs for the query to reference mapping. Smaller datasets tend to require more epochs of training to map the query into the reference well. If you observe that the query is not integrated into the reference, we recommend to try longer training for the query mapping.
12. Get the joint latent space for the query and reference and visualize it.
ref_query = sc.AnnData.concatenate(adata, query)
ref_query.obsm['X_cvae'] = q_intr_cvae.get_latent(ref_query.X, ref_query.obs["study"], only_active=True)
only_active excluded the deactivated gene programs. The description of the arguments can be found at https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.get_latent
sc.pp.neighbors(ref_query, use_rep='X_cvae')
sc.tl.umap(ref_query)
sc.pl.umap(ref_query, color=["study", "cell_type"])
13. Do differential gene programs test for some condition in reference + query using built-in bayesian test and visualize the results.
q_intr_cvae.latent_enrich(groups='condition', comparison='control', adata=ref_query)
fig = sca.plotting.plot_abs_bfs(ref_query, yt_step=0.8, scale_y=2.5, fontsize=7)
For the details see https://scarches.readthedocs.io/en/latest/api/models.html#scarches.models.EXPIMAP.latent_enrich
14. Save the model for future use.
intr_cvae.save("models/integrated_model")
15. Load the previously saved model.
q_intr_cvae = sca.models.EXPIMAP.load("models/integrated_model", ref_query)
The code for the step by step procedure for training and analysis of the results is also provided in the tutorials:
Basic tutorial for query to reference mapping with expiMap - https://scarches.readthedocs.io/en/latest/expimap_surgery_pipeline_basic.html
Advanced expiMap tutorial for query to reference mapping with de novo learned gene programs - https://scarches.readthedocs.io/en/latest/expimap_surgery_pipeline_advanced.html