1. Load GC-MS data (Data tab)
1.1 First-time use
Firstly, it is recommended you define a unique PARADISe session name[2].
In PARADISe, GC-MS data can be imported as:
a. CDF (Computable Document Format) files: Data tab > Add CDF Files. It is advisable that all files are loaded from the same directory location (local) in order to avoid importing failures.
Generally, it is possible to export raw data files as CDF files from the GC-MS software the user is working with e.g., exporting Data to AIA format from Chemstation. However, open source software for chromatographic data are also available to convert any type of raw data to CDF files, such as Openchrom from Lablicate.
b. Data array built in Matlab (*.mat): Data tab > Import/Export > Import chromatographic data.
For the *.exe installation, the installation folder (default is C:\Program Files\University of Copenhagen\paradise\application\) contains a sub-folder named “ExampleFiles” which includes an example example-data-array.mat file. The format (v6.0.1) requires the following variables:
- Data which is an array with size: samples × retention time points (scans) × mz channels.
- rt which is a column vector specifying the time (in minutes) for each retention time point.
- mz which is a row vector specifying the mz values.
- PathsAndFilenames which is a cell array (samples × 2), where the first column contains the paths to the sample and the second column contains the filename of each sample.
The *.mat file should be saved in –v7.3 or later – to allow reading the compressed file efficiently.
The imported files can be removed from PARADISe by selecting them in the Data tab and then clicking the Remove Selected Files button. Furthermore, if the user presses the Reset PARADISe Session, all elements belonging to that specific session (data, intervals, and models) will be removed – this is equivalent to starting a new session.
Note that if the imported dataset contains blanks (sample or column blanks), PARADISe will include them for modelling. Thus, it is the user’s choice to decide whether to keep or remove blanks from the data.
1.2 Continued use
The user is able to import previous PARADISe sessions (Data tab > Import/Export > Import Paradise session) or intervals that have been previously defined for any session (Data tab > Import/Export > Import Paradise session). Intervals can also be imported as an .xlsx file with the header row Start and End, which list the start and end times for each interval (in minutes), as shown in Figure 7.
2. Visual assessment of GC-MS chromatograms
The Editor tab shows the overlaid chromatograms of all imported samples (Figure 8a), but it is also possible to visualize a single or a few overlaid chromatograms by selecting specific samples in the left panel list called Signals (Figure 8b)[3].
It is possible to navigate through the overlaid chromatograms by left-clicking and dragging the cursor over a selected area to zoom in and by right clicking and selecting Reset Zoom to zoom back out. The small lower panel shows that selected region highlighted in purple (Figure 9).
Tip: The user is also able to pan through the chromatogram while zoomed in by sliding the purple viewed section left and right or using the mouse scroll.
By default, the Total Ion Chromatogram (TIC) is shown, but the user can switch to the Base Ion Chromatogram (BPC). Furthermore, specific masses can be investigated by typing them into the EIC Mass(es) field, hover over the field to see a tooltip describing how to specify masses. These visualization tools are found at the bottom of the Editor tab (Figure 9).
When having a large dataset, it is quite usual to observe retention time shifting across samples. Even though PARAFAC2 models handle shifted data, sometimes raw GC-MS data needs to be aligned in advance to ease the interval selection when shifts are too big. The user is welcome to carry out this step automatically in PARADISe by pressing the Data Alignment (Auto) button at the bottom-right side of the Editor tab [4]. Note that the alignment is a fully automated approach and it cannot currently be undone, so if the result is not satisfactory, the user will need to reload the data. The automated alignment consists of an initial coshifting (Larsen, Van den Berg & Engelsen, 2006) to handle possible single samples that are shifted dramatically differently from the bulk. This is followed by a correlation optimized warping (Tomasi, Van den Berg & Andersson, 2004), where the parameters are estimated based upon an optimization routine (Skov, Van den Berg, Tomasi & Bro, 2006). Usually, the results are simply assessed visually.
Tip: If the alignment did not distort peaks and the intervals seem more easily identified, then the alignment is kept.
3. Interval selection
The user is in charge of performing the manual selection of intervals through the Editor tab as follows:
1. Hold down shift and then left click and drag from left to right to form the boundaries of the peak. The interval will appear shaded in grey and an interval entry for the peak will appear in the bottom left-hand “Intervals” panel.
2. The user is able to adjust interval boundaries by hovering the cursor against the boundary (a red line will appear at the boundary, along with a sideways arrow). Click and drag the boundary where desired. Alternatively, it is also possible to manually enter the numerical retention time for the boundary in the Intervals panel (bottom left-hand corner) (Figure 10).
3. Intervals can also be removed by right clicking either on the interval (grey area) or in the Intervals panel and selecting Delete/Remove Selected Intervals.
4. Repeat the interval selection across all peaks.
We are aware that GC-MS data is usually very complex and that could be a reason for the user to run into hesitations when setting intervals. Therefore, we provide some tips to handle some of the most common issues in this step in the Troubleshooting section (part A) of this protocol.
4. Fitting PARAFAC2 models
Once the intervals have been selected, one model for each interval will be fitted by pressing Fit models on the Paradise Workflow in the upper-left part of the Analysis tab [5]. When the model calculations are complete, an informative pop-up window will appear.
Tip: The general recommendation is to not change the default settings. However, the user can go through the model settings (Settings tab > ParadiseSettings) to change the default modelling parameters (Figure 11), if desired.
· Max. Number of Components: it is set to eight by default. You may wish to reduce or increase this number, depending on the complexity of the selected intervals e.g., co-eluted compounds within a single interval. It is also possible to add additional components to selected intervals and recalculate models later, if needed.
· Max. Iterations (per model) and Convergence Threshold: The data is decomposed using PARAFAC2 based models, which require an iterative fitting approach. The fitting continues until the least squares error is below the convergence threshold or a maximum number of iterations are reached. The default convergence threshold is 10-8, which is our recommendation. The default maximum number of iterations is 5000, which is enough for most datasets. If particularly difficult intervals are selected, such as many overlapping peaks, then more iterations can be beneficial. Decreasing the maximum number of iterations will speed up the modelling, but too few iterations can result in poor modelling.
· Method (ID): while method 4 is selected by default, other methods are also available. It is beyond the scope of this tutorial to go into the details of these algorithms and we note that there is little reason to change the default settings.
1. Nway PARAFAC2
2. Nway PARAFAC2 with non-negativity
3. Nway PARAFAC2 with fast non-negativity
4. Flexible coupling PARAFAC2 with non-negativity
5. Flexible coupling PARAFAC2 with fast non-negativity
The user is also able to switch computation mode to parallel when they have many intervals (>10) and many cores (>2), then PARADISe will use one core to fit one interval. Note that MATLAB has an upper limit of 512 cores.
5. Assessment of fitted models
The user is able to inspect the fitted PARAFAC2 models in order to choose the optimal number of components.
Tip: Note that PARAFAC2 components are not nested, as in PCA, so every model should be assessed independently according to the features described in this section.
PARADISe has a supporting tool available to identify actual chemical compounds, based on deep learning (Risum & Bro, 2019). This should automatically have been run after fitting the models, but if the model fitting was cancelled by the user, then just press the Analyse Peaks (DL) button on the Paradise Workflow in the upper-left panel of the Analysis tab. This process takes much less time than the model-fitting step. As a result, the bottom-left panel in the Analysis tab (Figure 12), where the models fitted with different number of components are displayed, will be painted according to the number of chemical compounds found. The darker the background, the greater the number of true peaks present in that model. Contrarily, the background remains white where the deep learning tool finds no peaks [6]. That same panel also shows:
- Fit (%): indicator of how well the model describes the information contained within the GC-MS chromatograms. A higher fit % value means that a better fit has been achieved for a model including a given number of components; therefore, we aim to maximize the fit and expect values of at least >95%. Generally, increasing the number of model components will improve the fit, but when it reaches a plateau, increasing the number of components may result in an over-fitted model.
- Core consistency: the internal colour of the circle gives an indication of the “Core consistency” of the model, which is a measure of how adequate the model is. The higher the core consistency (i.e., darker shading), the better. As the number of components in the model increases, the core consistency will typically decrease towards zero (black ~100%, white ~0%). Ideally, we aim to maximize both core consistency and fit, but in practice, the fit (%) will take precedence.
- The thick red circle indicates which component model is currently being displayed.
- The arrow at the top of the panel allows the user to increase the number of components in case the maximum number of components is not enough to describe the complexity of a given interval. After this, the new models have to be fitted by pressing again the Fit models button (upper-right panel) which will skip already calculated models.
Panel A in Figure 12 shows the Total Ion Chromatogram (TIC) at the given interval used for modelling (raw or aligned). Here, one can make a visual assessment of the number of peaks to expect within the current interval. By clicking on the TIC, it displays the Model residuals, which are essentially a measure of the TIC (i.e., the whole signal) minus the individual elution profiles for the model components (i.e., the structured signals (peaks) we are trying to model).
Tip: The residuals are the parts of the chromatograms that are not described by the model components. Residuals showing peak structures, as shown in Figure 13 (above - model with two components, fit ~96%), means that important information has been left out and that the model needs more components (Figure 13, below – model with 7 components, fit of ~100%).
Panel B in Figure 12 shows the individual elution profiles for the model components (i.e., the structured signals). Elution profiles can be coloured according to the number of components or the type of peak defined by the deep learning tool outcome (true peak=red, baseline=blue, offset peak=green, and unknown=grey). The user can switch the colouring type in the upper-right panel (Figure 14, green arrow).
It is possible to change the view in panel B to see the normalized elution profiles by clicking directly on the panel. The normalization scales each component, such that its Euclidian norm is one across samples. This can sometimes be helpful to assess if components have a nice peak shape (i.e., approximately Gaussian). However, it can also scale up noise, which often gives rise to a saw-tack pattern (Figure 15) – this happens especially for small peaks and peaks that are only present in a few samples. If this occurs, it is advisable to view the non-normalized profiles and iteratively remove the largest peak from view (untick the View box for that component).
In panel D (Figure 12), the user is able to select/deselect the compounds to be displayed on panel B (View column), selected as true peaks (Peak column), or exported to the peak table (Export column). In some cases, it may be necessary to deselect a large peak from view, in order to better visualize smaller peaks and baseline components. Selected components to be exported to the peak table are plotted as a bold line instead of a thin line.
Finally, panel C in Figure 12 shows the resolved mass spectra for each model component (only if they have been selected in panel D to be visualized). This information is crucial to decide whether two peaks (model components) are indeed different chemical compounds, when they present different mass spectra. If, contrarily, these components have the same or very similar mass spectra, then it is very likely that the model is over-fitted and too many components have been selected. Additionally, the user can choose in which way the mass spectra is plotted (Bars, Lines, Stairs, and Stems) to the right of the plot.
Sometimes, it is also useful to check the distribution of relative concentrations (scores as estimated area) for each component between samples. This information can also be displayed in panel C by clicking on it once (Figure 16).
In case the user is interested on looking up if a given mass spectrum matches with a specific chemical compound, it is possible to do so by pressing the Lookup Components in NISTMS button at the middle-bottom part of the Analysis tab, where Interval Specific Options are shown (Figure 12). This requires that the NIST MS Search Program has been separately obtained and installed [7].
The location of the MSSEARCH folder (containing the nistms.exe and nistms$.exe files) has to be specified at: Settings tab > SpectralDatabases > NIST > Location [8]. In that same tab, the user is also able to specify reporting parameters, such as how compounds are sorted in the peak list (by retention time within or across intervals), among others. The number of NISTMS hits and other search parameters needs to be specified through the nistms.exe program.
Back to the Interval Specific Options, it is also possible to Clear Selected Model (which can then be refitted by pressing Fit models) and Export Interval Data. Exporting an interval may prove useful when having problematic or illustrative intervals for further processing, the exported data is in .mat (MATLAB) format.
6. Create a PARADISe report
The user must go through all models fitted for the selected intervals (left-middle Intervals panel in the Analysis tab, Figure 12) to decide the optimal number of components for each model and select which components should be exported to the peak table. After this, the final step is to create the peak table by pressing the Export Selected Peaks (PARADISe Report) in the bottom right-hand corner of the Analysis tab (Figure 12). The user will be prompted to select the filename and where to save the report.
The report (Excel file) contains several sheets:
Overview – Gives a few details on the data, the steps performed in PARADISe, and the software version used.
Peak Area – Gives the peak integration (peak areas) of all exported peaks for each sample, together with the interval and model information, and the most suitable chemical identification according to the NIST match factor (highest match factor).
Resolved Mass Spectra – Gives the estimated mz-spectra (as also shown in PARADISe) for each exported peak.
Top NIST hits for each component that has been exported, where the maximum number of hits is defined through nistms.exe.
Interval Details – Details on the PARAFAC2 model performance, which is nice to have along with the report in case of troubleshooting.
The peak table (Peak Area sheet) can now be used for further analysis outside of PARADISe. Note, if NISTMS search is not available, the report will still be generated, but the sheet Top NIST hits will be empty and the Compound Name and Match Quality will also not be available. The Resolved Mass Spectra can then be used as input to for spectral matching using other software.
[2] Note: if the chosen session name already exists, all previous session details will be overwritten. The session name and PARADISe directory can also be defined in the Settings tab.
[3] Note: be aware that the y-axis scale does not automatically adjust; it is fixed to the highest signal out of all of the imported chromatograms. This can hinder the quick assessment of individual files.
[4] Note: the auto-alignment can take a very long time (>24 h) depending on the number of samples imported and the computer used.
[5] Note: depending on the number of samples and intervals set, this step can take a while. For this reason, we recommend the use of a (big) auxiliary dedicated computer to carry out the model fitting.
[6] Note: PARADISe does not always guess correctly which components in the models are peaks, but it provides a good guide.
[7] Note: this is often included with vendor software – otherwise see https://chemdata.nist.gov/
[8] Note: an error warning that the NISTMS is not available can appear. If so, check that the NIST directory has been correctly specified in the Settings tab > SpectralDatabases. Try to check the NIST functionality by locating and running the “nistms.exe” application file independently.