The following procedure details how to use the pySM library (https://github.com/alexandrovteam/pySM) to perform FDR controlled annotation.
"
See figure in Figures section.":http://www.nature.com/protocolexchange/system/uploads/4803/original/thefigure_workflow_detailed.png?1474008322The pipeline has two core parts: Calculation of Metabolite Signal Match (MSM) scores for every molecular formula in a metabolite database. Reporting of molecular formula at a specified FDR
Installation
- Obtaining the code
a. Create a convenient directory, for example spatial_metabolomics and clone the repository into there:
b. mkdir spatial_metabolomics
c. cd spatial_metabolomics
d. git clone https://github.com/alexandrovteam/pySM
- We recommend installing pySM and its dependencies inside a virtual environment as follows.
a. Next, if you have Anaconda installation of Python, follow the installation instructions Setting up a virtual environment using conda. Otherwise, follow the instructions Setting up a virtual environment using virtualenv.
b. Setting up a virtual environment using conda
i. Initialize and activate an 'pySM' environment with all the dependencies:
ii. cd pySM
iii. conda env create
iv. source activate pySM
v. Install pySM package with pip:
vi. pip install . -r requirements.txt
c. Setting up a virtual environment using virtualenv
i. Setup and activate a new virtual environment:
ii. pip install virtualenv
iii. virtualenv venv
iv. source venv/bin/activate
v. Install pySM and dependencies with pip:
vi. cd pySM
vii. pip install . -r requirements.txt
Annotating a dataset
- Inputs
a. To process a dataset three things are needed: 1. a high-resolution imaging MS dataset; 2. a metabolite database 3. a configuration file
- Dataset
Data should be in the .imzML format. The pipeline is designed for and was tested on centroided data.
- Database
The database is a CSV with columns for id, name, exact_mass, formula
- Configuration file
A complete example configuration can be found here (https://github.com/alexandrovteam/pySM/blob/master/pySM/example/example_config.json). The following parameters should be set individually for every dataset, other parameters can generally be left at their default values
"name":"dataset_short_name",
"image_generation":{
"ppm":
},
"file_inputs":{
"data_file":"/path/to/imaging_ms_dataset.imzML",
"database_load_folder":"/path/to/tmp_folder_for_storing_isotope_patterns",
"results_folder":"/path/to/folder_for_storing_results",
"database_file":"/path/to/database.csv"
},
"fdr":{
"pl_adducts":[
{"adduct":"+H"},
{"adduct":"+Na"},
{"adduct":"+K"}
],
},
"isotope_generation":{
"charge":[
{"polarity":"+", "n_charges":1}
],
"isocalc_sig":0.01,
a. name: a short name for the dataset, if "name":"" the imzml filename will be used
b. ppm: the m/z window for ion images
c. file_inputs: path for loading data/storing results
d. fdr: false discovery rate properties
e. pl_adducts: real adducts to search for
f. isotope_generation:
g. charge: polarity and charge state to search for (the pipeline currently only supports one charge state at a time). e.g. for negative mode singly charged use "charge":[ {"polarity":"-", "n_charges":1} ],
h. isocalc_sig: peaks are predicted with a gaussian shape. This parameter is the sigma parameter. sigma = FWHM/2.3548.
i. isocalc_resoultion is not mass spectral resolution, it is the digitisation rate of the isotope patterns
- Calculating MSM Scores
a. The spatial_metabolomics module runs the pipeline for calculating MSM scores. To calculate MSM scores for a whole dataset and database simply pass the configuration file to the run_pipeline function:
from pySM import spatial_metabolomics
json_filename = '/path/to/config.json'
spatial_metabolomics.run_pipeline(json_filename)
This will then write the MSM score for every combination of molecular formula and target adduct found in the metabolite database to a text file in the "results_folder" specified in the config file. Additionally a randomly selected set of decoy adducts will be chosen for , and their MSM scores calculated. (The number of decoy adducts is controlled by the config parameter fdr\n_im).
Reporting FDR
a. The main use of FDR control is to report which molecular formulas are annotated at a fixed FDR. This uses the results file generated by spatial_metabolomics.run_pipeline and the target and decoy adducts specified in the configuration file.
from pySM import spatial_metabolomics, fdr_measures
json_filename = '/path/to/config.json'
results_fname = spatial_metabolomics.generate_output_filename(spatial_metabolomics.get_variables(json_filename),[],'spatial_all_adducts')
target_adducts,decoy_adducts = fdr_measures.get_adducts(json_filename)
fdr = fdr_measures.decoy_adducts(results_fname,target_adducts,decoy_adducts)
b. To print a list of molecular-formula for each target adduct that have an MSM score which results in an FDR of less than fdr_target.
fdr_target=0.1
print fdr.decoy_adducts_get_pass_list(fdr_target,n_reps=20,col='msm')