Isolation of native proteome, SEC fractionation and preparation for MS analysis
Cell culture and harvest TIMING: ~7 days
1. Culture cells as applicable to the respective cell type. If using HEK293 cells, culture cells in DMEM medium supplemented with 10% FBS and 50 μg/mL penicillin/streptomycin in 15 cm cell culture dishes, incubating at 37˚C, 5% CO2. To establish a log-linearly growing cell population, split the cells twice at a ratio of 1:2 using 1× Trypsin-EDTA for 5 min at 37˚C.
2. Harvest the cells at ~80% confluency, as determined by visual inspection under the microscope. Harvest cells on ice in ice-cold PBS buffer containing 5nM EDTA using pipette flow (sufficient in the case of HEK293 cells) or a plate scraper into a 15 ml Falcon tube. Spin at 4°C, 500×g for 5min, remove supernatant using a serological pipette, and snap-freeze the cell pellet in liquid nitrogen.
PAUSE POINT: Cell pellets can be stored at -80˚C for several weeks prior to SEC-SWATH-MS analysis.
Native lysis and fractionation by size exclusion chromatography
3. Lyse cells or tissue amount sufficient to extract at least 1 mg of total protein (in the case of HEK293 cell line, 7e7 cells). Lyse cell pellets snap-frozen in step 2 by freeze-thawing into 1 ml of HNN lysis buffer. Thaw and dissolve the frozen pellet by pipetting up and down 20 times. Incubate on ice for 5 min. Other cell or tissue types may be used, whereas input amounts need to be adapted based on cell size or protein yield with a minimal pure protein amount of 600mg required as input to SEC fractionation, with concentration determined colorimetrically (e.g. using the Pierce BCA protein assay kit). This corresponds to ~2mg when protein concentration is estimated by OD280 measurements which are confounded by other molecules in the sample but used here for the sake of processing speed.
4. Fill the lysate to a volume of two milliliters with HNN lysis buffer and distribute two Ultracentrifuge tubes. Balance weight on a fine balance with HNN Lysis buffer.
5. Transfer to the pre-cooled centrifuge rotor and clarify by 15 minutes of ultracentrifugation (100,000×g, 4°C, 55,000rpm on TLA120.2 rotor).
6. Pre-cool two Amicon Ultra-4 Centrifugal Filter Units on ice. Transfer 300 μl of the cleared lysate to each Amicon device and exchange buffer to HNN buffer as follows.
CAUTION: Avoid transfer of lipids from the top layer of the supernatant by aspirating the cleared lysate from 1 cm below the liquid surface.
7. Exchange buffer to HNN buffer (50 mM HEPES pH 7.5, 150 mM NaCl, 50 mM NaF) at a final ratio of 1:50 in three dilution and re-concentration steps to avoid large dilution steps in the interest of complex integrity. Centrifugation is performed at 3220×g, 4°C.
CRITICAL: Local precipitation occurs at and blocks the filtration membrane. It is therefore important to flush the membrane with the dilution buffer and using a 200µl pipette tip to achieve thorough rinsing of the membrane.
7.1.Centrifuge for 5’ (final volume above filter ca. 200µl)
7.2.Dilute 1:5 in HNN (add 800 µl), flush membrane
7.3.Centrifuge for 10’ (final vol. ~250µl)
7.4.Dilute 1:5 in HNN (add 1000 µl), flush membrane
7.5.Centrifuge for 10’ (final vol. ~250µl)
7.6.Dilute 1:2 in HNN (add 250ml), flush membrane
7.7.Centrifuge 5’ (vol. ~150µl), flush membrane
7.8.Centrifuge 5’
7.9.Final volume per tube: ca. 50-80µl.
7.10. Remove precipitates by centrifugation at 16,900 ×g, 4°C, for 5min,
transferring the supernatant, leaving 10ml, to a pre-cooled injection vial.
8. Measure the concentration of the lysate by UV/Vis photospectrometry (Nanodrop Spectrophotometer) against a reference sample of HNN Lysis buffer in HNN buffer (1:50), approximating 1 OD280 = 1μg/μl protein concentration. The measured concentration should typically be between 20 and 30 mg/ml.
CAUTION: The concentration read by UV/Vis photospectrometry is confounded by other compounds with absorbance at 280nm. Based on colorimetric methods (BCA assay) the protein loading is ca. 3-4-fold lower than approximated by UV-Vis (Figure 2A). We suggest the fast UV-Vis reading to be sufficient to align sample loading amounts and preferable over BCA or similar quantitative assays with significant incubation times that may affect complex stability.
9. Subject 1000μg of the concentrated lysate to SEC fractionation at 500 μl/min. Ensure that the chromatographic system and column show reproducible and expectable performance in the fractionation of the protein standard mix prior to and after the analysis. Collect fractions in the expectable elution range from 10-28min at 0.19 min per fraction into a cooled 1ml 96-DeepWell plate.
10. Repeat step 9 while collecting fractions in a new 96-well plate.
11. Interrogate the UV/Vis profiles of the two SEC runs of the same lysate and if in agreement, pool the collected fractions across the two replicate injections to obtain one set of fractions.
CRITICAL: It is important to sample chromatographic fractions also of the void volume peak, even if the information of contained analyte size is reduced. This is especially important for quality control measures of the overall global proteome assembly state of the investigated cell system (observed total MS signal in assembled vs. monomeric SEC range). Additionally, the peak detection algorithms employed in downstream protein and protein complex detection benefit from complete elution profiles including shoulder regions of detectable peaks. The right boundary of the relevant protein elution range can be established empirically by SDS PAGE analysis of the late fractions (> F70). We suggest to use the elution volume of the small molecule uridine contained in the SEC standard sample. We recommend sampling until inclusive of uridine peak elution as a subset of proteins and complexes may display secondary interactions with the stationary phase and thus delayed elution in this fraction range.
12. To monitor SEC stability and to calibrate the apparent molecular weight per SEC fraction, analyze 5μl of the SEC column performance check standard after the SEC experiment.
13. Transfer an aliquot of the unfractionated sample to the collection plate. Pipette 1/40th of the volume injected for SEC (25μg by OD280) into wells H11&H12 and fill to 200μl with SEC buffer to align digest conditions with the individual SEC fractions.
CRITICAL STEP: Include an aliquot of the unfractionated sample in the proteomic analysis to ensure comparable digest conditions as for the chromatographic fractions. The data acquired from the unfractionated mild proteome is used in the PyProphet machine learning step in peptide-centric analysis, generating one scoring function applied across all chromatographic fractions to ensure aligned scoring and consistent quantification of peptides across all chromatographic fractions.
PAUSE POINT: Undigested SEC fractions can be stored at -80˚C for several weeks.
Optionally, if extended storage is desired, it is recommended to denature proteins by boiling in sodium deoxy-cholate (next step) before freezing for storage.
Tryptic digest and C-18 cleanup of chromatographic fractions for MS analysis. TIMING: 4+12h
14. Denature proteins by adding sodium deoxy-cholate to 1 % v/v (20μl from 10% stock solution) and incubate 5min in a hot water-bath (95˚C).
CAUTION: Ensure that the plate is properly sealed before incubation in the water bath to avoid sample loss or contamination.
15. Let plate cool to room temperature and centrifuge at 500×g to collect liquid at the base of the plate.
16. Reduce proteins by adding TCEP to 5 mM (22 μlfrom 50 mM solution, 1:10 dilution of 500 mM stock in Ammonium bicarbonate 50mM pH 8.8). Incubate 30min at room temperature.
CRITICAL STEP: Ensure that the TCEP stock solution is titrated to pH 8.8 to avoid acidification of the samples and premature precipitation of sodium deoxy-cholate.
17. Alkylate proteins by adding iodo-acetamide (IAA) to 10 mM (24 μlfrom 100 mM stock. Incubate 20min at room temperature, in the dark.
CAUTION: Work in reduced light conditions and incubate in the dark due to IAA light sensitivity.
CRITICAL STEP: Ensure that the pH is ≥ 8.0 to avoid gel formation or partial precipitation of deoxy-cholate during the digest. Test the samples for gel formation using a pipette tip and if very high viscosity or formation of a gel are observed, adjust the pH by adding NaOH (In steps of 5 μlof 2M stock solution until the samples display low viscosity and pH 8.0 - 8.5).
18. Add 0.2 µg trypsin (Promega) per fraction (2 µl of 0.1 µg/µl stock in Trypsin buffer). Re-seal plate, shake, spin down for 1 min at 2,000×g & incubate over night at 37 °C.
19. Stop the digest and precipitate deoxycholic acid by adding TFA to 1 %, ACN to 1 % (26μl of 10% TFA / 10% ACN stock). Close and mix the plate thoroughly using a new plate seal and 10 inversions. Spin down for 1 min at 2,000×g.
20. Prepare MacroSpin plate for C-18 cleanup. Tap plate to loosen resin material and spin down for 1 min at 1000 ×g. Activate resin by adding 200 μlACN per well and centrifuging at 1,000×g for 1 min. Equilibrate the resin by 3 washes with 150ul 5 % ACN/0.1 % FA spinning at 1,000×g for 2 min. Discard washing solution from the collection plate.
21. Directly before loading the samples for C-18 cleanup, pellet the precipitated deoxycholic acid for 10 min at 3,220×g. Transfer 80 % (220 ul) of the cleared supernatant onto the equilibrated C-18 resin.
CAUTION: Ensure minimal transfer of precipitate onto the C18 resin to avoid sample contamination.
22. Load samples at 1000×g for 2 min. To maximize recovery, re-load the flow-through onto the C-18 resin a second time. Keep the flow-through for potential trouble-shooting.
23. Wash the C-18 resin by 3×200μl 5% ACN/0.1% FA, spinning at 1,000×g for 1 min each.
24. Elute the samples into a fresh collection plate with 2×150μl 50%ACN/0.1% FA.
25. Dry samples in a speed-vac equipped with a plate rotor and adequate tara plate filled with the same volume of C-18 elution buffer (45˚C, 0.2 atm, ca. 4h).
PAUSE POINT: Dried peptide samples can be stored for several weeks at -20 or -80˚C.
MS analysis: TIMING: 12h (QC) + 14 days (DIA only) OR 28 days (DIA + DDA)
26. Re-suspend dried peptide samples in 18μl 2% ACN/0.1% FA, supplemented with internal retention time calibration peptides (iRT kit, Biognosys, CH, 1:20 dilution as opposed to manufacturer's instruction of 1:10 to accommodate larger injection volumes). The spiked in iRT peptides allow the normalization of retention times across different LCMS runs and enable the streamlined generation of spectral libraries and queries of peptides from repository-scale spectral libraries in the DIA/SWATH data maps 26,31. Re-suspend the samples by 5 min sonication in an ice-cooled water bath to avoid sample heating and evaporation.
27. Collect liquid and remove potential residual deoxy-cholate by centrifugation at 3,220×g for 5min. Transfer 16μl of the sample to MS injection vials.
CAUTION: Transfer the peptide samples pipetting at an angle and leaving ca. 2μl in order to avoid transfer of potential residual deoxy-cholate precipitate from the lowest points of the wells.
28. Before analyzing the full set of fractions, test sample set quality by analyzing 2ul of the unfractionated sample and the two fractions with highest absorbance at OD280 as monitored during SEC fractionation (In our chromatographic setup, fractions 5 and 50).
Judge sample quality based on the following criteria:
· no increase of chromatographic backpressure
· TIC signal intensity in SWATH64vw mode is ≥ 2e7 (120min gradient) (Figure 2B)
· The m/z map is well-populated with isotopic envelopes
To acquire the full dataset, maximize sample injection volumes to target 1e8 in the highest-abundant SEC fraction (In the HEK293 case, fraction 50 and an injection volume of 4 μl).
TROUBLESHOOTING
29. If a project-specific spectral library should be generated, each fraction should be analyzed in both data-independent SWATH and data-dependent acquisition mode.
CAUTION: Datasets acquired exclusively in SWATH acquisition mode can typically be interpreted using spectral libraries from public repositories. Note that the library employed for interpretation needs to be representative for the tissue type that is being analyzed. Depending on the availability of such libraries and the research question at hand it might further be preferable to generate a project-specific spectral library by DDA acquisition of a subset of or the full sample set analyzed by SEC-SWATH-MS.
Peptide-centric SWATH-MS analysis: TIMING: 3 days
Here, we employ a docker container (see installation and initialization in Equipment Setup) that provides a stable solution for running peptide-centric scoring by OpenSWATH, PyProphet and TRIC on any computing system. Example files and a script including all processing steps are provided in our GitHub repository (https://github.com/CCprofiler/SECSWATH_PeptideCentricAnalysis.git).
30. Create a data analysis folder
i. Open a command line interpreter
ii. Clone and enter our analysis folder template from GitHub:
git clone https://github.com/CCprofiler/SECSWATH_PeptideCentricAnalysis.git
cd SECSWATH_PeptideCentricAnalysis
31. Prepare all required input data for peptide-centric analysis
i. MS file conversion and centroiding
On the conversion computer, use MSconvert to convert and centroid .wiff raw files into .mzML or mzXML format 45.
i. Open MSconvertGUI
ii. Under Files/browse, select the .wiff files.
iii. Under Options, leave the defaults and activate in addition 'Package in gzip'.
iv. Under 'Filters', select 'Peak Picking'.
v. Under 'Algorithm', select 'Vendor'.
vi. Under 'MS Levels', enter '1-2'.
vii. Hit 'Add'.
viii. Start the conversion (Button in the lower right).
ix. Once the conversion is finished, move the.gz file(s) to the peptide centric analysis computer and into the folder
1. SECSWATH_PeptideCentricAnalysis/data_dia/
2. Then, move the .gz files generated from the unfractionated sample into the subfolder
3. SECSWATH_PeptideCentricAnalysis/data_dia/unfractionated_secinput/
NOTE: The centroiding significantly reduces file size and processing time and is highly recommended, in particular if peptide-centric analysis is to be performed on a personal or laptop computer.
ii. Information on retention time calibration peptides (iRT spike-in or ciRT peptide set)
Example iRT and ciRT libraries are provided in the data_library folder in the cloned GitHub repository.
iii. Prepare a file specifying the SWATH window settings
An example file with SWATH window settings is provided in the data_library folder in the cloned GitHub repository (also see Table 3).
iv. Prepare a spectral library
a. Create a sample-specific spectral library according to the previously published protocol by Schubert et al. 26
b. Download a public library such as the combined human assay library that we used for our analysis here 31
wget -O data_library\spectrast2tsv.tsv https://db.systemsbiology.net/sbeams/cgi/downloadFile.cgi?name=phl004_canonical_s64_osw.csv;format=tsv;tmp_file=8becf7ae782dd305c0eade59f282bcd1;raw_download=1
32. Initialize the OpenSWATH docker container (see installation in Equipment Setup or follow instructions in https://github.com/CCprofiler/SECSWATH_PeptideCentricAnalysis/blob/master/SECSWATH_PeptideCentricAnalysis.sh)
docker attach openswath
33. Prepare the spectral library for OpenSWATH and PyProphet analysis
i. Convert library to .pqp file format recommended for OpenSWATH
TargetedFileConverter -in /data/data_library/spectrast2tsv.tsv \
-out /data/data_library/spectrast2tsv.pqp
ii. Generate decoys for scoring and FDR estimation in PyProphet
OpenSwathDecoyGenerator -in /data/data_library/spectrast2tsv.pqp \
-out /data/data_library/spectrast2tsv_td.pqp
34. Peptide-centric signal detection with OpenSWATH
i. Run OpenSwath on unfractionated input sample(s)
for file in /data/data_dia/unfractionated_secinput/*ML.gz; do \
bname=$(echo ${file##*/} | cut -f 1 -d '.'); \
OpenSwathWorkflow \
-in /data/data_dia/$bname.*ML.gz \
-tr /data/data_library/spectrast2tsv_td.pqp \
-tr_irt /data/data_library/irtkit.TraML \
-min_upper_edge_dist 1 \
-batchSize 1000 \
-out_osw /data/results/$bname.osw \
-Scoring:stop_report_after_feature 5 \
-rt_extraction_window 600 \
-mz_extraction_window 30 \
-ppm \
-threads 6 \
-use_ms1_traces \
-Scoring:Scores:use_ms1_mi \
-Scoring:Scores:use_mi_score ; done
ii. Run OpenSWATH on fractionated samples
for file in /data/data_dia/*ML.gz; do \
bname=$(echo ${file##*/} | cut -f 1 -d '.'); \
OpenSwathWorkflow \
-in /data/data_dia/$bname.*ML.gz \
-tr /data/data_library/spectrast2tsv_td.pqp \
-tr_irt /data/data_library/irtkit.TraML \
-min_upper_edge_dist 1 \
-batchSize 1000 \
-out_osw /data/results/$bname.osw \
-Scoring:stop_report_after_feature 5 \
-rt_extraction_window 600 \
-mz_extraction_window 30 \
-ppm \
-threads 6 \
-use_ms1_traces \
-Scoring:Scores:use_ms1_mi \
-Scoring:Scores:use_mi_score ; done
NOTE: OpenSWATH creates several warnings and errors that can be ignored when analyzing SEC-SWATH-MS datasets, including:
· Warning “windows were sparce” and/or “empty chromatogram”: Sparsity of certain windows is expected for some fractions, especially in the beginning and end of the SEC.
· Error “Transition does not have a corresponding chromatogram”
35. Peptide-centric scoring with PyProphet
a. Train Model: pyProphet analysis of unfractionated sample
pyprophet score
--threads 6
--in=/data/results/unfractionated_secinput/unfractionated_secinput.osw \
--out=/data/results/unfractionated_secinput/model.osw
--level=ms1ms2
b. Apply global model to score peak groups in all runs evenly
i. Scoring and plotting
for file in /data/results/*.osw; do \
bname=$(echo ${file##*/} | cut -f 1 -d '.'); \
pyprophet score --in=/data/results/$bname.osw \
--apply_weights=/data/results/unfractionated_secinput/model.osw \
--level=ms1ms2; done
ii. Exporting of output files
for file in /data/results/*.osw; do \
bname=$(echo ${file##*/} | cut -f 1 -d '.'); \
pyprophet export --in=/data/results/$bname.osw \
--out=/data/results/$bname.tsv \
--max_rs_peakgroup_qvalue=0.1 \
--no-transition_quantification \
--format=legacy_merged; done
NOTE: We advise to manually check if .tsv output files are actually written for all runs.
iii. Plotting of all score distributions
for file in /data/results/*.osw; do \
bname=$(echo ${file##*/} | cut -f 1 -d '.'); \
pyprophet export --in=/data/results/$bname.osw \
--format=score_plots; done
36. TRIC based feature alignment across all SEC fractions
feature_alignment.py \
--in /data/results/*.tsv \
--out /data/results/feature_alignment.tsv \
--out_matrix /data/results/feature_alignment_matrix.tsv \
--method LocalMST \
--realign_method lowess \
--max_rt_diff 60 \
--mst:useRTCorrection True \
--mst:Stdev_multiplier 3.0 \
--target_fdr -1 \
--fdr_cutoff 0.05 \
--max_fdr_quality 0.1 \
--alignment_score 0.05
SEC-SWATH-MS data processing and complex-centric analysis in CCprofiler: TIMING 2 days
CRITICAL STEP: Part 3 of the PROCEDURE describes how to use the open-source CCprofiler R-package to extract information about the global proteome assembly state and specific protein complexes from co-fractionation MS experiments, here generated by SEC-SWATH-MS. The analysis includes: data import and pre-processing (Steps 34-37), automated parameter selection (Step 38), protein-centric analysis (Step 39) and complex-centric analysis (Step 40). All CCprofiler analysis steps are also provided as a supplementary R-script that performs the presented analysis based on the exemplary HEK293 SEC-SWATH-MS dataset. The R-script can easily be adapted to other datasets by changing the input files (Step 34-35). All exemplary data and the script are available on GitHub: https://github.com/CCprofiler/SECSWATH_ComplexCentricAnalysis (also see the Supplementary Manual).
To set up your work environment you can clone the GitHub repository by:
git clone https://github.com/CCprofiler/SECSWATH_ComplexCentricAnalysis.git
cd SECSWATH_ComplexCentricAnalysis
NOTE: Due to parallelization of some of the CCprofiler processing steps and involved random number generation that is beyond our control, the results of the workflow are subject to minor variation despite setting a seed value. If fully reproducible results are desired, only a single processing core should be selected. This is however connected to much longer processing times.
PAUSE POINT: The following computational analysis can essentially be paused at any point when a certain function is completed. Before closing R it is important to save the environment in order to resume the analysis at a later stage. For this, use the following commands:
save.image(file='CCprofiler_analysis.RData')
To resume your analysis, you can load the previous status of your R environment with the following command:
load(file='CCprofiler_analysis.RData')
34. Prepare data for CCprofiler import
Prepare all necessary data that needs to be loaded into R for the CCprofiler analysis. For convenience we recommend saving all input data in the same directory where you want to perform the analysis. All data necessary and used for this protocol are provided in the GitHub repository and will be available in the SECSWATH_ComplexCentricAnalysis folder after you cloned it (see above).
i. Prepare quantitative peptide-level data
A. Quantitative peptide matrix generated by OpenSWATH (as described in Part 2)
i. The output table from TRIC can directly be imported into CCprofiler (see ‘feature_alignment.tsv’ or ‘quantData_OpenSWATH.rds’ (already in R data format))
B. Quantitative peptide matrix generated by any software tool
i. Remove decoys
CAUTION: Decoys might be valuable for certain processing steps downstream (e.g. selecting a sibling peptide correlation based FDR cutoff). We have specifically tested the propagation of decoys for datasets processed by an OpenSWATH-based workflow. If other data processing tools have been used, the decoys should be treated with caution. To be on the conservative side, we would generally recommend removing the decoys.
ii. Remove non-proteotypic peptides
iii. Bring data in either long or wide format:
a. Required column names for long format: protein_id, peptide_id, filename and intensity (see ‘examplePCPdataLong.tsv’)
b. Required column names for wide format: protein_id, peptide_id, <filename1>, <filename2>, …, <filenameX>
(see ‘examplePCPdataWide.tsv’)
ii. In addition to the quantitative peptide matrices, CCprofiler requires a fraction annotation table that maps each filename to a given chromatographic fraction number. The required column names are: filename and fraction_number (see ‘exampleFractionAnnotation.tsv‘).
CAUTION: The filenames used in the fraction annotation table need to match the filenames in the quantitative matrix exactly. Further, the fraction_number entries need to start with 1 and continuously increase in integer steps of 1 until the last sampled fraction.
iii. For native complex separation via SEC, a molecular weight (MW) calibration table can be generated by measuring the apex fractions of an external standard set of reference proteins fractionated on the same SEC setup. By providing such a MW calibration table, CCprofiler can establish a transformation function based on the log-linear relationship between elution fractions and apparent MWs inherent to SEC, thus enabling the annotation of all sampled fractions with an apparent MW. The required column names in the calibration table are: std_weights_kDa and std_elu_fractions (see ‘exampleCalibrationTable.tsv’).
iv. CCprofiler can further annotate protein traces with additional information provided in a trace annotation table, e.g. adding the gene names or monomeric MW from UniProt (https://www.uniprot.org/) (see ‘exampleTraceAnnotation.tsv’). Adding information on monomeric MWs of the analyzed proteins is critical for the assignment of proteins to monomeric or complex-assembled state from SEC datasets with calibrated apparent MW and is required for the assessment of global proteome assembly states.
CAUTION: The protein_id column in the quantitative matrix needs to match one of the column entries in the annotation table. Typically, the common entry are the UniProt identifiers.
v. Finally, a necessary component for downstream detection of protein complexes by complex-centric analysis (Step 40), is the selection of prior protein connectivity information which can be provided either in the form of defined protein complexes, e.g. as annotated in CORUM 41,46, or binary interaction networks generated by various approaches, as for example the BioPlex 1,2 or StringDB 42,43 networks.
A. Defined complex hypotheses
A table with defined complexes should contain the following columns: complex_id, complex_name and protein_id (see ‘corumComplexHypothesesRedundant.csv’).
B. Binary protein-protein interaction network
The format for a binary interaction network is a table with two columns: a and b. Both columns contain protein identifiers and each row represents a binary connection (an ‘edge’) in the interaction network (see ‘BioPlexPPIs.tsv’).
CAUTION: The protein_id / a & b entries need to correspond to the protein_id in the quantitative matrix, e.g. UniProt identifiers.
35. Load input tables into R and inspect
i. Load libraries in R
library(data.table)
library('CCprofiler')
ii. Set working directory to the location where all files are stored.
setwd("SECSWATH_ComplexCentricAnalysis")
iii. Load and inspect the quantitative peptide matrix
i. Quantitative peptide-level data generated by OpenSWATH (as described in Part 2)
quantData_OpenSWATH <- readRDS("quantData_OpenSWATH.rds")
ii. Quantitative peptide matrix generated by any software tool
a. Long format
quantData_long <- fread("examplePCPdataLong.tsv")
head(quantData_long)
b. Wide format
quantData_wide <- fread("examplePCPdataWide.tsv")
head(quantData_wide[,1:5])
iv. Load and inspect fraction annotation table
fractionAnnotation <- fread("exampleFractionAnnotation.tsv")
head(fractionAnnotation)
v. Load and inspect calibration table
calibrationTable <- fread("exampleCalibrationTable.tsv")
calibrationTable
vi. Load and inspect trace annotation table
uniprotAnnotation <- fread("exampleTraceAnnotation.tsv")
head(uniprotAnnotation)
vii. Load and inspect protein connectivity information
i. Defined complex hypotheses from the Corum database
corumComplexes <- fread("corumComplexHypothesesRedundant.csv")
head(corumComplexes)
ii. Binary protein-protein interaction network from BioPlex (v1.0 1, http://bioplex.hms.harvard.edu )
BioPlexPPIs <- fread("BioPlexPPIs.tsv")
head(BioPlexPPIs)
36. Import peptide level data into CCprofiler traces format and annotate
i. Import quantitative peptide matrix as traces object
The traces object is the main data class used in the CCprofiler package. It stores the quantitative profiles (‘traces’) of peptide or protein intensities across the analyzed chromatographic fractions. Additionally, a traces object can store specific information about each of the peptides, proteins and chromatographic fractions. As the analysis proceeds more information will be added to the traces object.
i. Quantitative peptide level data generated by OpenSWATH
pepTraces <- importFromOpenSWATH(data = quantData_OpenSWATH,
annotation_table = fractionAnnotation,
verbose = FALSE)
ii. Quantitative peptide matrix generated by any software tool
NOTE: CCprofiler will automatically detect if peptide tables are in long or wide format.
a) Long format
pepTraces_exampleSubset_long < importPCPdata(input_data = quantData_long,
fraction_annotation = fractionAnnotation,
rm_decoys = FALSE)
b) Wide format
pepTraces_exampleSubset_wide <- importPCPdata (input_data = quantData_wide,
fraction_annotation = fractionAnnotation,
rm_decoys = FALSE )
ii. Perform molecular weight calibration and annotation
i. Perform molecular weight calibration based on a provided calibration_table (Figure 3A):
calibration = calibrateMW(calibration_table = calibrationTable,
PDF = plotPDF)
ii. Annotate traces with the apparent molecular weight associated with each SEC fraction as extrapolated from the standard protein molecular weights and associated elution fraction numbers:
pepTraces <- annotateMolecularWeight(
traces = pepTraces,
calibration = calibration)
CAUTION: Apparent molecular weight calibration is of limited accuracy as, inherent to the analytical procedure wherein analyte shape and propensity for unintended secondary interaction with the stationary phase affect elution volumes/fraction number and inferred apparent molecular weight. Predictions, especially those outside the range of standard protein elution, should be interpreted with caution.
iii. Annotate traces with information from UniProt
pepTraces <- annotateTraces(traces = pepTraces,
trace_annotation = uniprotAnnotation,
traces_id_column = "protein_id",
trace_annotation_id_column = "Entry")
37. Pre-process traces object to increase data quality
i. Optional: Detect and impute missing values
In most proteomics pipelines, zero intensity values indicate either that the signal is missing at random (no detection due to technical reasons such as interferences from other peptides) or missing not at random (no detection due to cellular concentrations below the detection limit). We suggest that a zero value is likely missing at random in case a quantitative (non-zero) signal has been detected in both the previous and following fraction. The detected missing at random values are subsequently imputed by a spline fit across the fractionation dimension.
1. Convert zeros in missing at random value locations to NA:
pepTracesMV <- findMissingValues(traces = pepTraces,
bound_left = 2,
bound_right = 2,
consider_borders = FALSE)
2. Impute NA values by fitting a spline:
pepTracesImp <- imputeMissingVals(
traces = pepTracesMV,
method = "spline")
3. Plot imputation summary:
plotImputationSummary(
traces = pepTracesMV,
tracesImp = pepTracesImp,
max_n_traces = 5,
PDF = plotPDF)
NOTE: In the original complex-centric study of the HEK293 proteome 15 no missing values were imputed. Generally, quantitative matrices from SWATH-MS, particularly with TRIC alignment, display only few missing values and imputation thus has little influence in such datasets. However, imputation improves overall workflow robustness and flexibility for different input data types. For example, loss of data from an entire SEC fraction due to failed MS acquisition can robustly be compensated by imputation rather than re-analysis of the fraction or repeat of the entire experiment. Further, missing value imputation should improve the interpretability of datasets affected by more missing values, e.g. when acquired via classical data-dependent mass spectrometry.
ii. Filter peptides by consecutive peptide detection
Peptides that have never been detected in more than N consecutive fractions, here N=2, are removed from the traces object. This effectively removes false positive peptide detections from the dataset.
pepTracesConsIds <- filterConsecutiveIdStretches(
traces = pepTracesImp,
min_stretch_length = 3,
remove_empty = TRUE)
iii. Select high-quality proteins based on their average sibling peptide correlation
i. Calculate the average sibling peptide correlation (SPC) for each peptide
For each peptide, the average pairwise correlation with the quantitative traces of its sibling peptides, i.e. peptides derived from the same protein, is calculated (Figure 3B).
pepTracesSibPepCorr <- calculateSibPepCorr(
traces = pepTracesConsIds,
PDF = plotPDF)
ii. Filter by SPC
Peptides below a minimum average SPC cutoff are removed. The rational is that outlier peptides as well as proteins with very heterogeneous quantitative peptide traces are excluded from further analysis. The filtering cutoff can either be automatically determined by a target-decoy based FDR estimation approach (a), or a fixed cutoff can be applied (b):
a. SPC based FDR cutoff (Figure 3C)
A conservative FFT can be estimated from the unfractionated SEC input sample that was also used to train the PyProphet model for peptide-centric analysis. This is conservative, because we expect to see cumulatively more proteins in the SEC fractions than in the single unfractionated input sample. The estimated pi0 ~ FFT is reported in the protein-level pdf report. For this dataset the FFT was estimated to be 0.491.
estimatedFFT <- 0.491
Filter by FDR cutoff using the estimated FFT:
pepTraces_filtered_FDR <- filterBySibPepCorr(
traces = pepTracesSibPepCorr,
fdr_cutoff = 0.01,
FFT = estimatedFFT,
rm_decoys = TRUE,
PDF = plotPDF)
CAUTION: This option is only valid if you have continuously kept decoys in your analysis. The most conservative strategy is to then apply a FFT of 1. However, if you have a FFT estimation available this will significantly boost your sensitivity and result in a higher number of remaining proteins for the downstream analysis. We have specifically tested this option for datasets processed by an OpenSWATH-based workflow. If other data processing tools have been used, the decoy based FDR estimation on SEC level should be treated with caution.
b. Absolut sibling peptide correlation cutoff
pepTraces_filtered_absoluteCutoff <- filterBySibPepCorr(
traces = pepTracesSibPepCorr,
fdr_cutoff = NULL,
absolute_spcCutoff = 0.25,
rm_decoys = TRUE,
PDF = plotPDF)
iv. Inspect resulting peptide-level traces object
i. Summary statistics
summary(pepTraces_filtered_FDR)
ii. Plot some example traces, here the exemplary visualization of the Proteasome subunit alpha type-1 (UniProt ID = P25786, Figure 3E)
test_protein <- c("P25786")
test_peptide_traces <- subset(
traces = pepTraces_filtered_FDR,
trace_subset_ids = test_protein,
trace_subset_type = "protein_id")
plot(test_peptide_traces,
PDF = plotPDF,
name = paste0("pepTraces_",test_protein))
v. Protein quantification
i. Perform protein quantification by selecting the top N, here N=2, peptides based on their global intensity across all fractions.
protTraces <- proteinQuantification(pepTraces_filtered_FDR,
topN = 2,
keep_less = FALSE)
ii. Inspect summary statistics of the resulting protein traces
summary(protTraces)
CRITICAL STEP: Compare the number of remaining proteins to the number of proteins on the peptide level traces. If the number of proteins is dramatically reduced during the protein quantification step, many proteins might have been detected by a single peptide only. Careful consideration is necessary to decide whether you want to trust such single peptide hits and include them in your downstream analysis by reducing the quantification criteria.
iii. Visualize and inspect example protein traces
Exemplary visualization of the Proteasome subunit alpha type-1 (UniProt ID = P25786)
test_protein_traces <- subset(
traces = protTraces,
trace_subset_ids = test_protein,
trace_subset_type = "protein_id")
plot(test_protein_traces,
colour_by = "Entry_name",
PDF = plotPDF,
name = paste0("protTraces_",test_protein))
vi. Overall workflow QC to evaluate the global proteome assembly state
The protein-level profiles can then be used to estimate the overall complex assembly state observed in the sample as a quality control to ensure the successful extraction and profiling of largely intact complexes. Here, we evaluate the total MS signal in assembled vs. monomeric range (Figure 3D).
summarizeMassDistribution(protTraces,
PDF = plotPDF)
38. Automatically identify optimal processing parameters based on a protein-level parameter grid search
A grid search can be performed to determine an optimal set of parameters for the protein- and/or complex-centric proteome profiling workflow. This optimal parameter set depends mostly on the co-fractionation characteristics and MS setup.
i. Randomly select a subset of proteins for the grid search
The selected subset of proteins should be representative of the proteome, thereby providing a trade-off between coverage and computational run-time. From our experience, selecting < 100 proteins suffers in regard to robustness, while >500 proteins will require a lot of processing time. We therefore propose a random selection of ~500 proteins.
all_proteins <- unique(pepTraces_filtered_absoluteCutoff$trace_annotation$protein_id)
testProtein_idx <- sample(1:length(all_proteins), 500)
testProteins = all_proteins[testProtein_idx]
peptideTracesSubset = subset(
traces = pepTraces_filtered_FDR,
trace_subset_ids = testProteins,
trace_subset_type = "protein_id")
ii. Perform parameter grid search
The grid search performs a peptide co-elution peak group finding for a selected combination of parameters with the goal to determine a good parameter set for the following analyses. Please note that the selection of suitable parameters is for the grid seach is critical
gridFeatures <- performProteinGridSearch(
traces = peptideTracesSubset,
corrs = c(0.9,0.95),
windows = c(8,10),
smoothing = c(7,9),
rt_heights = c(1,3),
n_cores = 3)
CRITICAL The selection of parameters for the grid search is critical. Guidelines for the selection of reasonable parameters are discussed in Box 2.
iii. Score protein features across all grid search parameters and select the best parameter set
gridFeatures_scored <- lapply(gridFeatures,
calculateCoelutionScore)
gridFeatures_qvalues <- lapply(gridFeatures_scored,
calculateQvalue,
plot = FALSE)
gridFeatures_stats <- qvaluePositivesPlotGrid(
featuresGrid = gridFeatures_qvalues,
colour_parameter = "corr",
PDF = plotPDF)
bestParameters <- getBestQvalueParameters(
stats = gridFeatures_stats,
FDR_cutoff = 0.05)
bestParameters
write.table(bestParameters,
"bestParameters.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
CRITICAL Inspect the pseudo ROC curves generated by the grid search (Figure 4A). Optimal parameters are at the upper left corner of the observed distribution. Parameters that are consistently in the upper left corner are especially important.
39. Perform protein-centric analysis
Protein-centric analysis detects peptide co-elution peak groups along the chromatographic dimension. Each detected peak (‘protein feature’) represents the protein in a specific assembly state, i.e. monomeric or bound to different protein complexes.
i. Perform protein feature finding
proteinFeatures <- findProteinFeatures(
traces = pepTraces_filtered_FDR,
corr_cutoff = bestParameters$corr,
window_size = bestParameters$window,
rt_height = bestParameters$rt_height,
smoothing_length = bestParameters$smoothing_length,
collapse_method = "apex_only",
perturb_cutoff = "5%",
parallelized = TRUE,
useRandomDecoyModel = TRUE)
ii. Score detected protein features and estimate FDR
proteinFeatures_scored <- scoreFeatures(
features = proteinFeatures,
FDR = 0.05,
PDF = plotPDF)
write.table(proteinFeatures_scored,
"proteinFeatures_scored.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
CRITICAL STEP: Inspect the p-value density histogram (Figure 4B/C). There should be a high peak close to zero and a uniform distribution across all other p-values.
TROUBLESHOOTING
iii. Inspect summary statistics on resulting protein features
The resulting figures provide information about the number of unique assembly states detected for all the proteins as well as about the number of proteins with at least one assembled protein signal (MW ≥ 2x monomeric MW in SEC) (Figure 4D).
summarizeFeatures(feature_table = proteinFeatures_scored,
PDF = plotPDF,name = "proteinFeatures_summary")
iv. Visualize and inspect protein features (Figure 4E)
plotFeatures(feature_table = proteinFeatures_scored,
traces = pepTraces_filtered_FDR,
calibration = calibration,
feature_id = test_protein,
annotation_label = "Entry_name",
onlyBest = FALSE,
peak_area = TRUE,
monomer_MW = TRUE,
PDF = plotPDF,
name = paste0("protFeatures_",test_protein))
Plot all detected proteins
allDetectedProteins <- unique(proteinFeatures_scored$protein_id)
pdf("allDetectedProteins.pdf", height = 6, width = 8)
for (protein in allDetectedProteins) {
plotFeatures(feature_table = proteinFeatures_scored,
traces = pepTraces_filtered_FDR,
calibration = calibration,
feature_id = protein,
annotation_label = "Entry_name",
onlyBest = FALSE,
peak_area = TRUE,
monomer_MW = TRUE,
PDF = FALSE)
}
dev.off()
CRITICAL STEP: Inspect some detected protein features and evaluate if the detected peak groups correspond to what you would have also selected as peak groups during manual inspection.
TROUBLESHOOTING
40. Complex-centric analysis
Complex feature finding represents the central step of complex-centric analysis using CCprofiler. Based on prior protein interaction data and quantitative fractionation profiles, CCprofiler detects groups or subgroups of locally co-eluting proteins, indicating the presence of protein-protein complexes in the biological sample. Target complex queries are supplemented with decoy complex queries to support error control of the reported results. The result is a table summarizing the presence and composition of protein-protein complexes in the biological sample analyzed.
i. Prepare target complex queries
There are two options for protein complex target generation in CCprofiler: (a) use defined protein complex models for direct use as queries (2 or more subunits, e.g. from CORUM) or (b) use a protein-protein interaction network from which target complex queries can be extracted.
a) Inspect the coverage of pre-defined protein complex queries from the previously loaded CORUM database (Figure 5A)
plotSummarizedMScoverage(hypotheses = corumComplexes,
protTraces = protTraces,
PDF = plotPDF,
name_suffix = "CORUM")
b) Generate and inspect protein complex queries from binary PPI networks, here based on BioPlex
Decoy complex queries are generated based on the target complex query set and its underlying network structure. The minimum distance specifies the minimal number of edges between any two proteins within any generated decoy complex query. It is important that the interaction network based on the targets is large enough to generate a random decoy set that does not overlap with the target complex queries. We recommend complex query sets of at least 1000 targets for the decoy based approach.
i. Calculate pairwise distances between any two proteins in the interaction network
pathLengthBioPlexPPIs <- calculatePathlength(BioPlexPPIs)
ii. Generate protein complex targets by grouping proteins based on a user-defined distance cutoff. Here we consider only direct neighbours of each protein.
networkTargetsBioPlexPPIs <- generateComplexTargets(dist_info = pathLengthBioPlexPPIs,
max_distance = 1,
redundancy_cutoff = 0)
iii. Inspect newly generated protein complex queries
head(networkTargetsBioPlexPPIs)
plotSummarizedMScoverage(
hypotheses = networkTargetsBioPlexPPIs,
protTraces = protTraces,
PDF = plotPDF,
name_suffix = "BioPlex")
CRITICAL:
· It is essential that the chosen protein complex queries match the experimental dataset. Therefore, inspect the protein and complex coverage pie charts (Figure 5A). We recommend that at least half of the proteins and protein complexes represented in the complex query set should be (partially) detected in the experiment.
· One critical question during complex query generation is how to handle redundancies, i.e. protein complex queries that partially or fully overlap. Due to the complex-centric scoring functions in CCprofiler, we recommend to also keep protein complex subsets in the target queries. Instead of merging / removing overlapping queries at this stage we recommend to collapse detected complex signals at Step 40vi.
· If you are especially interested in some protein complexes that are not present in any available database, you can manually append these complexes to a generated target query list. It is important to keep in mind that the target query list should always contain at least around 1000 complexes in order to ensure robust decoy based FDR estimation and sensitive detection rates. If less complex queries are selected, feature finding can still be performed, but decoy generation and FDR estimation are not applicable.
TROUBLESHOOTING
ii. Prepare decoy complex queries
binaryCorumComplexes <- generateBinaryNetwork(corumComplexes)
pathLengthCorumComplexes <- calculatePathlength(binaryCorumComplexes)
corumComplexesPlusDecoys <- generateComplexDecoys(
target_hypotheses = corumComplexes,
dist_info = pathLengthCorumComplexes,
min_distance = 2,
append = TRUE)
TROUBLESHOOTING
CRITICAL: Decoy complex queries are generated based on the target complex query set and its underlying network structure. The minimum distance specifies the minimal number of edges between any two proteins within any generated decoy complex query. It is important that the interaction network based on the targets is large enough to generate a random decoy set that does not overlap with the target complex queries. We recommend complex query sets of at least 1000 targets for the decoy based approach.
iii. Perform complex feature finding
Protein complex features are determined similar to the protein features described above. First, a sliding window strategy is applied, where all proteins of a protein complex hypothesis are tested for local profile correlation. If a subset of the proteins within a protein complex hypothesis correlate better then the specified cutoff, a protein complex feature is initiated, followed by peak detection within the regions of high correlation.
complexFeatures <- findComplexFeatures(
traces = protTraces,
complex_hypothesis = corumComplexesPlusDecoys,
corr_cutoff = bestParameters$corr,
window_size = bestParameters$window,
rt_height = bestParameters$rt_height,
smoothing_length = bestParameters$smoothing_length,
collapse_method = "apex_network",
perturb_cutoff = "5%",
parallelized = TRUE,
n_cores = 3)
CRITICAL: If no parameter selection was performed on the protein-centric level you can also do a complex level grid search 15.
i. Filter complex features according to their apparent molecular weight, removing protein complex features that elute at an apparent molecular weight lower than any of the monomeric molecular weights of its subunits.
complexFeaturesFilteredMW <- filterFeatures(
feature_table = complexFeatures,
min_monomer_distance_factor = 2)
ii. Select only the best complex feature, i.e. the complex signal with most subunits and highest correlation. This step is necessary prior to the statistical scoring, because individual elution peaks are not independent.
complexFeaturesBest <- getBestFeatures(
feature_table = complexFeaturesFilteredMW)
complexFeaturesBest_scored <- scoreFeatures(
features = complexFeaturesBest,
FDR = 0.05,
PDF = plotPDF,
name = "complex_qvalueStats")
summarizeFeatures(complexFeaturesBest_scored,
PDF = plotPDF,
name = "complexFeaturesBest_feature_summary")
CRITICAL STEP: Inspect the p-value density histogram (Figure 5B/C). There should be a high peak close to zero and a uniform distribution across all other p-values.
TROUBLESHOOTING
iii. Append secondary features based on a user defined local subunit correlation cutoff, here 0.5.
complexFeaturesAll <- appendSecondaryComplexFeatures(
scoredPrimaryFeatures = complexFeaturesBest_scored,
allFeatures = complexFeaturesFilteredMW,
peakCorr_cutoff = 0.5)
write.table(complexFeaturesAll,
"complexFeaturesAll.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
iv. Inspect summary statistics on resulting protein features (Figure 5D)
summarizeFeatures(complexFeaturesAll,
PDF = plotPDF,
name = "complexFeaturesAll_feature_summary")
plotSummarizedComplexes(
complexFeatures = complexFeaturesAll,
hypotheses = corumComplexes,
protTraces = protTraces,
PDF = plotPDF)
v. Visualize and inspect detected complex features (Figure 5E)
testComplex <- "181"
plotFeatures(feature_table = complexFeaturesAll,
traces = protTraces,
calibration = calibration,
feature_id = testComplex,
annotation_label = "Entry_name",
onlyBest = FALSE,
peak_area = TRUE,
monomer_MW = TRUE,
PDF = plotPDF,
name = paste("complexFeatures_",testComplex))
Plot all detected complexes
allDetectedComplexes <- unique(complexFeaturesAll$complex_id)
pdf("allDetectedComplexes.pdf", height = 6, width = 8)
for (complex in allDetectedComplexes) {
plotFeatures(feature_table = complexFeaturesAll,
traces = protTraces,
calibration = calibration,
feature_id = complex,
annotation_label = "Entry_name",
onlyBest = FALSE,
peak_area = TRUE,
monomer_MW = TRUE,
PDF = FALSE)
}
dev.off()
CRITICAL STEP: Inspect some detected complex features and evaluate if the detected peak groups correspond to what you would have also selected as peak groups during manual inspection.
TROUBLESHOOTING
vi. Collapse overlapping and redundant co-elution evidence to delineate complexes and complex families with defined co-elution of subunits in SEC
complexFeaturesUnique <- getUniqueFeatureGroups(
feature_table = complexFeaturesBest_scored,
rt_height = 0,
distance_cutoff = 1.25)
complexFeaturesCollapsed <- callapseByUniqueFeatureGroups(
feature_table = complexFeaturesUnique,
rm_decoys = TRUE)
write.table(complexFeaturesCollapsed,
"complexFeaturesCollapsed.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE)
CRITICAL STEP: To retrieve unique , non-redundant protein complex signals, the reported complex signals need to be collapsed based on a strategy that considers (i) subunit composition and (ii) resolution in the chromatographic dimension.
vii. Visualize and inspect all collapsed complex features
allCollapsedComplexes <- unique(complexFeaturesCollapsed$complex_id)
pdf("allCollapsedComplexes.pdf", height = 6, width = 8)
for (complex in allCollapsedComplexes) {
plotFeatures(feature_table = complexFeaturesCollapsed,
traces = protTraces,
calibration = calibration,
feature_id = complex,
annotation_label = "Entry_name",
onlyBest = FALSE,
peak_area = TRUE,
monomer_MW = TRUE,
PDF = FALSE)
}
dev.off()