Quantitative mass spectrometry (MS) based proteomics aims to quantify all proteins in a sample1. Quantitative approaches fall in two main groups: labelled and label free. In labelled approaches the quantification is based on the labelling of the peptides using an isotopic or isobaric mass tag. Label-free approaches, does not require these additional costs for sample preparation and can be performed on unlimited number of samples. The most accurate label-free quantification methods are based on MS1 signals, extracting peptide intensities by finding the best peak in the three relevant dimensions (m/z, retention time, intensity). The associated workflow consist in the feature detection and the feature alignment2.
A feature is a triplet composed by the mass-over-charge (m/z), RT and intensity founded in the raw data. In the feature combination step, features that belongs to the same peptides are grouped in cluster where the m/z values correspond to the isotopic masses of a peptides and the RT time interval correspond to the elution profile of the peptides. The intensity of possible peptide (a cluster of features) is the sum of all the peaks in the retention time interval identified. The feature alignment (called “match-between-runs”) is intended to match features across runs that lack identified fragment spectra in some of the runs.
MaxQuant3 is the most popular software for protein quantification, it detects features by fitting a Gaussian peak shape to the three relevant dimensions (intensity, RT, and m/z) and then estimates peptide intensity as the volume of this complex 3D feature. Despite the precise intensities computed, MaxQuant suffers of speed penalties when the size of the dataset is increased and of a lack of integration in own pipeline.
The increasing size also the complexity of the proteomics data in public repository (ProteomeXchange4) and their re-analysis has been shown to be promising for novel discovery5. To face this new challenge there is a need of quantification tool fast reliable and cluster friendly that can scale with the increasing size of complex quantitative data sets present in public proteomics repositories.
moFF Overview
moFF (modest Feature Finder) is a simple, fast and operating system independent MS1-based relative quantification algorithm. moFF is based on python and works directly on Thermo raw file and mzML as well.
The access to Thermo raw file is based on the unthermo raw file library 6 that allow moFF to work both on Linux and Windows system. The access to mzML files is based on the python library pymzML7.
moFF consists in two modules : the match-between-run and the apex extraction module. The complete workflow is showed in Figure 1
See figure in Figures section.As input, moFF needs a list of identified features (e.g the result of Mascot or X!Tandem) where each feature should be characterized by a minimum set of information.
The match-between-runs module (mbr) performs a RT alignment across the runs, in order to match undefined features that are identified in other run. This process increase the number of quantified feature across the replicates and reduces the missing values in the MS1 intensity matrix used in further analysis.
Both matched and identified feature are then processed by the apex module where the apex peaks are extracted directly from their XiC retrieved from the raw files (see Figure 2).
See figure in Figures section.moFF provides two quality measures of the peak extracted:
- Shape of the peak (log_L_R): if the peak has a symmetrical shape the value will be around 0, otherwise for left or right skewed shape the value is respectively greater or less than 0
-Signal-to-noise (SNR): this measure how the apex intensity is higher with respect to the level of the noise presence in the XiC extracted.
The parameters of moFF are the following:
The size of XiC windows retrieved for each feature.
The retention time (RT) window used to search the apex.
The precursor mass tolerance
The match-between-run has also other parameters:
The retention time (RT) windows used to search the apex intensity for the matched peak
Outlier filter and its width value. This filter works on the training set used to train the RT predicted models
Weighted or an unweighted combination of the predicted retention time model when a features is matched in several runs.