A protocol for reconstructing the dynamics of real-world systems from observational data: Application for establishing a digital proxy of a bioreactor (DIYBOT)

doi:10.21203/rs.3.pex-1052/v1

Method Article

A protocol for reconstructing the dynamics of real-world systems from observational data: Application for establishing a digital proxy of a bioreactor (DIYBOT)

https://doi.org/10.21203/rs.3.pex-1052/v1

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

This protocol sets out a broadly-applicable framework for empirically reconstructing the dynamics of biosystems from observational data, modeling reconstructed dynamics phenomenologically with systems of ordinary differential equations extracted from the data, and detecting and quantifying causal interactions among system components. The protocol draws from well-established procedures in Nonlinear Time Series Analysis (NLTS) that can be run with existing R packages. The protocol was applied in McLamore et al. (2020)¹ to develop a phenomenological digital proxy of a bioreactor (DIYBOT) from real-time sensor data. DIYBOT was used to investigate whether bioreactor dynamics self-corrected in response to contamination spikes of increasing degrees; and consequently, the extent to which corrective human-in-the-loop management might be required.

Biotechnology

Computational biology and bioinformatics

Chemical biology

Environmental sciences

Complex networks

Information theory and computation

Engineering

digital proxy

bioreactor

Nonlinear Time Series Analysis

signal processing

phase space reconstruction

surrogate data testing

convergent cross mapping

phenomenological modeling

Introduction: With the advent of increasing connectivity, use of real time sensor data to inform biosystem management is a concept that is evolving toward development of digital twins¹. One of the first steps towards a digital twin in the water reuse industry is the development of a digital proxy using real time sensor data combined with cross-sectional data. Observed time series records from real time sensors provide an essential evidentiary portal to understanding the systematic behavior of real-world environmental systems, which are complex, open (ever changing), and beyond anyone's capacity to model closely. Phenomenological models are used for determining ordinary differential equations which govern biosystem behavior, and are then combined with other models describing form and function to establish a digital proxy of a bioreactor (DIYBOT). McLamore et al ¹ recently established the DIYBOT platform for development of smart water treatment systems as a first step toward development of a digital twin.

This protocol outlines Nonlinear Time Series Analysis (NLTS) as a collection of methods designed to empirically reconstruct and model the dynamics of bioreactors from the observational data that they generate, and is modified from previous versions in Huffaker and Fearne (2019)² and Huffaker et al. (2017)³. Reconstruction is possible because even a single observed variable in an interdependent dynamic system encodes the history of its interactions with co-variates. The famous naturalist John Muir expressed this succinctly: “When we try to pick something up by itself, we find it hitched to everything else in the universe.”⁴ Empirically-reconstructed dynamics can be used to diagnose whether volatility in observational data is generated by a self-correcting dynamic system responding to exogenous shocks, or an endogenously unstable system that is not self-correcting. This distinction—pivotal in managing and regulating real-world dynamic systems—is imperceptible to casual inspection, and to empirical approaches presuming stochastic forcing². Reconstructed dynamics also can be used to detect and quantify causal interactions among system variables, which helps to guide subsequent formulation of mechanistic models corresponding to reality. The seminal book on NLTS is Kantz and Schreiber (1997)⁵, and a more practical guide targeted to practitioners is Huffaker, Bittelli, and Rosa (2017)⁶.

Overview of the procedure

A sequence of NLTS methods is used to reconstruct deterministic system dynamics from observed output without prior knowledge of model equations. We present a brief synopsis of the methods outlined in Fig. 1, which include detection, reconstruction, and modeling of bioreactor dynamics.

(A) Signal Processing: We begin with signal processing to prepare the record for NLTS diagnostics. We compute the Fourier Spectrum to determine whether there are recurrent oscillations in the observational data, and to ascertain the sampling interval required to adequately sample dominant oscillations. For example, there might be computational advantages to working with substantially fewer observations if hourly averages taken from 15-minute sampling intervals adequately capture dominant diurnal cycles in a record. We continue with Singular Spectrum Analysis (SSA)¹ signal processing to separate the signal in the data (structured variation composed of trend and oscillatory components) from the noise (unstructured random variation). SSA also gives us a measure of signal strength as the fraction of the total variation that the signal accounts for in the observational record.

(B) Phase Space Reconstruction: We next apply Time-Delay Embedding⁷ to empirically reconstruct the phase space dynamics of the real-world system(s) that generated strong output signals. Phase space is the graphical portrayal of deterministic system dynamics. Phase space coordinates are provided by the system variables and their time-delayed copies, and each multidimensional point records the levels (states) of system variables at a point in time⁸. Phase space trajectories connecting these points depict the co-evolution of system variables from given initial states. If system dynamics are ‘dissipative’, these trajectories are bounded within a low-dimensional subset of phase space, and forever evolve along an ‘attractor’ in this subspace—a geometric structure with noticeable regularity². NLTS generates plots of phase space attractors, and applies concepts from multidimensional geometry to analyze attractor characteristics. Surrogate data testing^9,10 tests the null hypothesis (H₀) that apparent geometric regularity in a reconstructed shadow attractor is most likely generated by linear-stochastic dynamics as opposed to nonlinear deterministic dynamics. A significance level set at 95% generates surrogates for a one-sided test of predictive skill. Popular test statistics include nonlinear prediction¹¹ and permutation entropy¹². With nonlinear prediction, we run an upper-tailed test that rejects H₀ if an empirical attractor predicts more skillfully in-sample than attractors reconstructed from randomized surrogate data. Using rank-order statistics, H₀ is rejected if prediction skill (as measured by Nash-Suttlciffe Model Efficiency) for an empirical attractor is among the k largest NSE values measured for the ensemble of attractors reconstructed from the surrogates. With permutation entropy, we run a lower tailed test that rejects H₀ if permutation entropy for an empirical attractor is among the k smallest NSE values measured for the ensemble of attractors reconstructed from the surrogates³.

(C) Empirical Causality Detection: Convergent Cross Mapping¹³ (CCM) uses reconstructed attractors to map out a network of pairwise interactions in multivariate records. In short, CCM tests whether covariates reconstruct the dynamics of the same real-world system—if so, then they are causally interactive. In particular, Y is found to drive X if an attractor reconstructed from lagged copies of X can be used to skillfully predict values of Y with nonlinear prediction methods. An attractor skillfully cross predicts another variable when the Pearson correlation coefficient converges to a high value as the number (library) of points on the attractor used to cross predict increases². To distinguish causal interaction from synchronized behavior, cross mappings run at backward and forward delays must perform best at non-positive delays¹¹.

(D) Phenomenological Modeling: Phenomenological modelling^6,14 extracts models of ordinary differential equations (ODE) from interacting output signals that mathematically reproduce real-world attractors empirically reconstructed directly from the signals. These models can be used to extrapolate system dynamics beyond the empirically reconstructed attractor, and quantify interactions detected with CCM by computing numerical partial derivatives measuring the marginal response of a response variable to an incremental change in a driving variable. The variables in these models are the time-delay coordinates from the corresponding reconstructed attractors. The time derivative of each variable is approximated by taking a 4^th order centered finite difference. Each ODE is specified as a polynomial whose order is selected so that the solution of the ODE system faithfully reproduces the empirically-reconstructed attractor. Since the ODE system is linear in the parameters, parameters can be estimated with a variety of linear regression techniques.

The protocol in Fig. 1 was applied in McLamore et al. (2020)¹ to develop a digital proxy of bioreactor (DIYBOT) for wastewater treatment. DIYBOT was used in McLamore et al. (2020)¹ to investigate whether bioreactor dynamics self-corrected in response to contamination spikes of increasing degrees; and consequently, the extent to which corrective human-in-the-loop management might be required.

Demonstration of protocol

Continuous sensor data (pH and DO) were collected at 15-minute intervals for a 60-day period. During the first 40 days of operation, the bioreactor transitioned toward steady state operation (less than 5% deviation in treatment) characterized by dampened oscillatory behavior. Starting on day 40, a pulse of 1.0 mg/L AgNP was added, and then at subsequent 30-day intervals two more pulses incremented by 1.0 mg/L were added (1.0 mg/L on day 40; 2.0 mg/L on day 70; 3.0 mg/L on day 100). Throughout this period the reactor was operating continuously. The protocol below describes the computational packages and steps required to repeat the analysis, and the approach can be used for any system with the appropriate amount of data. Time series data were stored each week (and backed up on an external hard drive) for subsequent processing vai the protocol below. Excel files with representative data from the experiments (pH and DO data after three pulse additions of nanoparticles) can be found in the supplemental section of this protocol.

Computational packages: The steps in the protocol were run with R 3.6.1 and the following packages: forecast v8.12 (Fourier spectrum)¹⁵, Rssa 0.13-1 (singular spectrum analysis)¹⁶; tseriesChaos 0.1-13 (phase space reconstruction and surrogate data analysis) ¹⁷; fractal 2.0-1 (compute AAFT surrogates) ¹⁸; multispatialCCM¹⁹ (convergent cross mapping); and Code 9.8 (phenomenological modeling) detailed in Huffaker, Bitelli, and Rosa⁶, and available at http://www.dista.unibo.it/~bittelli/.

Step 1: Signal processing of the three AgNP pulses for the O₂ and pH records as described in McLamore et al. (2020)¹.

Critical Step: See troubleshooting section for a summary of the failure of signal processing to detect strong signals in sensor data.

A. Fourier Frequency Spectrum

The raw data and results of Fourier analysis are shown in Fig. 2. Fig 2A shows the raw O₂ data (15 min time scale) and Fig 2B shows the raw pH data (60 sec timescale) for three consecutive pulse additions of AgNP (see Excel file in supplemental section for sample data). Fourier spectra (Fig 2C-D) are developed using the forecast v8.12 package and plotted (the spectrum is generated by the R package). As demonstrated by the peaks in Fig 2C-D, the three-pulse series for both O₂ and pH each display a dominant cycle length of 200 (15-minute blocks) = 50 hours. This 50-hour dominant cycle length was easily identifiable in the plots.

B. Singular Spectrum Analysis

Singular Spectrum Analysis (SSA) involves analysis of waveform(s), and then determination of the strength of the signals. To prepare for SSA, each sensor time series is standardized by removing the mean and dividing by the standard deviation.

The standardized data (black curves) and isolated signals (red curves) resulting from SSA are plotted in Fig. 3. Each time course refers to a AgNP concentration following the convention in Fig 2. The dominant 50-hour cycle can be adequately captured by resampling the data in 2-hour blocks (i.e., averaging every eight 15-minute blocks). Benefits of this reduction include working with substantially fewer observations, and reducing noise by averaging higher-frequency volatility. Fig 3 shows the records replotted on a 2-hour time scale, where damped oscillations are apparent for both O₂ and pH.

Critical Step: Rerun the Fourier Spectrum on the resampled 2-hour blocks to ensure that it reproduces the dominant 50-hour oscillation in 25 two-hour blocks.

Critical Step: When the raw data (black curves in Fig 3) are plotted with the isolated signals (red curves in Fig 3), the plots will be aligned if there is no noise in the system; that is, if the signal accounts for all of the variation in the sensor record. Since we expect measurements to have some noise (i.e., unstructured variation), we do not expect the curves to be coincident.

Critical Step: Some signals contain only one dominant oscillation (see for example Table 1 and Fig 3). For example, pulse 1 for the O₂ sensor data contained one 50-hour dominant oscillation (signal strength of 0.82 in Table 1 and also black/red curves in Fig 3A). Some signals are composed of multiple oscillations, for example, pulse 2 for the O₂ sensor data contained three oscillations (signal strengths in Table 1 and also black/red/blue curves in Fig 3C).

The package Rssa 0.13-1 is used for singular spectrum analysis. Table 1 reports strength of these signals and constituent oscillations. We note that: (1) Signal strengths are relatively strong in explaining over half the total variation in the corresponding records (with the exception of pH Pulse 3); (2) Strengths are stronger for the O₂ record than the pH record; (3) Strengths tend to decrease as AgNP contamination increases from pulse 1 to 3; and (4) As expected from the Fourier Spectra, 50-hour cycles account for most of signal strength.

Critical Step: Determination of whether signal strengths classify as “strong” is done on a case-by-case basis, and requires some expert judgement. In this case, signal strengths were strong in explaining over half the total variation in the corresponding records, as noted above. Care must be taken to establish a benchmark for making the evaluation of signal strengths when conducting singular spectrum analysis. This is similar to the interpretation of regression coefficients in linear plots, where users make judgements on what is a “good” R².

Step 2: Phase Space Reconstruction

Time delay embedding and surrogate testing were carried out using the package tseriesChaos 0.1-13; fractal 2.0-1 was used to compute AAFT surrogates.

A. Time-Delay Embedding

Empirical attractors reconstructed from O₂ and pH signals for the three AgNP pulse series are shown in the Fig. 4A. These attractors exhibit geometric regularity in the form of three-dimensional spiral-sink dynamics for each AgNP addition using coordinates of: (1) real-time (current) values of dissolved oxygen, denoted as O₂(t); (2) dissolved oxygen values with a delay of five two-hour blocks, denoted as O₂(t+5); and (3) current values of pH, denoted as pH(t).

B. Surrogate Data Testing

The results of testing the null hypothesis that apparent geometric regularity shown in the empirical attractors was most likely generated by a mimicking linear stochastic process are found in Table 2. We see that all O₂ AgNP pulse series’ pass surrogate tests for: (1) predictive skill since NSE values taken from empirical attractors (first column) are among the k largest values taken from surrogate attractors (the k largest values rest above the NSE values reported in the second column); and (2) permutation entropy since values taken from empirical attractors (third column) are among the k smallest values taken from surrogate attractors (the k smallest values rest below the those reported in the fourth column). The surrogate results for the pH signals are more inconclusive. Two of AgNP pulse series (1 and 3) fail the prediction test; while all three border-line pass the entropy test.

Critical Step: Compare empirical attractor characteristics to corresponding surrogate attractor characteristics. Test the null hypothesis that there is no significant difference in the measured characteristics. If null hypothesis is valid, apparent regularity in the empirical attractor is most likely due to linear stochastic dynamics and not deterministic non-linear dynamics.

Critical Step: Summary tables of surrogate data testing should be prepared for rapid analysis of predictive skill and permutation entropy tests. See for example summary Table 2 and also Overview section B.

Step 3: Convergent Cross Mapping (CCM)

Convergent cross mapping was run with package multispatialCCM¹⁹. CCM results indicate that causal interactions between O₂ and pH shift with increased pulses of AgNP contamination (Fig. 4B). In this example dataset we see that pH remains a relatively strong driver of O₂ for all concentrations of AgNP contamination, since CCM curves converge to correlation coefficients close to one (black curves). In the other direction, O₂ becomes an increasingly weak driver of pH as injected contamination increases (from left to right in the figure), and by AgNP pulse 3 the interaction is indistinguishable from non-causal synchronized behavior, since cross mappings run at backward and forward delays peak at a positive level (Fig. 4C, right-most plot).

Critical Step: If CCM curves converge to correlation coefficients close to 1.0, this indicates that the process X is a strong driver of Y.

Critical Step: The convergence level for the CCM curve is used to infer the relative strength of the causal interaction.

Critical Step: When conducting extended CCM, non-zero positive peak values (i.e., greater than one) indicate that the interaction noted in CCM is indistinguishable from non-causal synchronized behavior.

These shifting interactions are summarized in the causal diagram shown in Fig. 5. In this simple example the causal maps are intuitive since there is only two variables and three test conditions, but in more complex data sets the use of causal maps is a highly useful visualization tool for understanding interactions between more than 3 nodes with multiple interactions.

Critical step: Causal diagrams are generated in external software based on exported output from package multispatialCCM¹⁹.

Step 4: Phenomenological Modeling

Code 9.8⁶ was used for phenomenological modeling. The phenomenological digital proxies extracted from the O₂ and pH signals are shown in Fig. 6A for each AgNP pulse, along with stability information (equilibria and eigenvalues) in Fig. 6B. This indicates that the models successfully reproduce the spiral-sink dynamics characterizing empirically-reconstructed bioreactor dynamics. Another indication of success is that the attractors reconstructed from solutions of the phenomenological models bear striking visual resemblance to the empirical attractors (Fig. 6C). Finally, maximum Lyapunov Exponents computed from the empirical and phenomenological attractors were similar as described in McLamore et al ¹.

Critical Step: Eigenvalues can be used to determine if the digital proxy successfully reproduces the empirically reconstructed spiral-sink dynamics as described in Huffaker, Bitelli, and Rosa⁶, and available at http://www.dista.unibo.it/~bittelli/. In the case of the example data set shown here, this is the case since the computed eigenvalues are a triplet of a negative real eigenvalue and a complex conjugate pair with negative real parts (as in Fig 6).

Critical Step: To confirm reproduction of time series dynamics, compute the maximum Lyapunov Exponents from the empirical and phenomenological attractors as described in McLamore et al ¹.

A. Causality Quantification

We used numerical partial derivatives computed from the phenomenological models—measuring the marginal response of a response variable to an incremental change in a driving variable—to quantify the interactions detected with CCM (Fig. 7). We observe that when AgNP concentrations are at or below 2.0 mg/L, pH (the system driver) has a negative marginal impact on O₂ (black curves in Fig. 7A-B), while O₂ (as the driver) has a positive marginal impact on the pH (red curves in Fig. 7A-B). However, the impact of pH on O₂ decays to zero as time progresses. When AgNP concentration is 3.0 mg/L, pH has a detrimental impact on O₂ and this relationship decays with time (Fig. 7C).

Critical Step: If the curve of X driver on Y output is positive, this indicates a positive marginal impact. Negative values indicate a negative marginal impact.

Critical Step: If the curve decays to zero (as in this case for some of the curves), this indicates that over time the impact decays (i.e., is non-persistent).

Troubleshooting

NLTS methods can fall short of successfully formulating a digital proxy of system output for several reasons including: (1) Signal processing may detect signals that are too weak to model (i.e., signals that account for only a small portion of total variance in the sensor data); (2) Sensor data may be insufficiently informative to reconstruct the real-world attractor, for example, data might only sample transitory dynamics heading toward the attractor; or (3) A low-dimensional nonlinear attractor may not exist. However, we do not know any of this until we have tested the output data for it². When NLTS fails to detect strong signals, or diagnose dimension-reducing nonlinear real-world dynamics in sensor signals, then underlying dynamics may well be high dimensional and the information-extraction problem cannot be shrunk to a low-dimensional digital proxy without losing essential dynamic information. In these events, machine learning techniques may offer a better alternative. Details beyond the protocol here maybe found in Huffaker, Bittelli, and Rosa (2017)⁶. Troubleshooting tips are summarized in Table 3.

Due to technical limitations, Tables 1-3 can be found in the Supplementary files section.

References

1 McLamore, E. et al. Digital proxy of a bio-reactor (DIYBOT) combines sensor data and data analysis to improve greywater treatment and wastewater management systems. Nat. Sci. Rep. (accepted, DOI: 10.1038/s41598-020-64789-5) (2020).

2 Huffaker, R. & Fearne, A. Reconstructing systematic persistent impacts of promotional marketing with empirical nonlinear dynamics. PLOS ONE 14, e0221167 (https://doi.org/0221110.0221371/journal.pone.0022167) (2019).

3 Huffaker, R., Canavari, M. & Munoz-Carpena, R. Distinguishing between endogenous and exogenous price volatility in food security assessment: An empirical nonlinear dynamics approach. Agricultural Systems 160, 98-109 (2018).

4 Muir, J. My First Summer in the Sierra. (Houghton Mifflin, 1911).

5 Kantz, H. & Schreiber, T. Nonlinear Time Series Anaysis. (Cambridge University Press, 1997).

6 Huffaker, R., Bittelli, M. & Rosa, R. Nonlinear Time Series Analysis with R. (Oxford University Press, 2017).

7 Takens, F. in Dynamical Systems and Turbulence (ed D. Rand, Young, L. ) 366-381 (Springer, 1980).

8 Deyle, E. & Sugihara, G. Generalized Theorems for Nonlinear State Space Reconstruction. PLoS One 6, 1-8 (2011).

9 Theiler, J., Eubank, S., Longtin, A., Galdrikian, B. & Farmer, J. Testing for nonlinearity in time series: The method of surrogate data. Physica D 58, 77-94 (1992).

10 Schreiber, T. & Schmitz, A. Surrogate time series. Physica D 142, 346-382 (2000).

11 Kaplan, D. & Glass, L. Understanding Nonlinear Dynamics. (Springer, 1995).

12 Brandt, C. & Pompe, B. Permutation entropy: a natural complexity measure for time series. Phys. Rev. Lett. 88, 174102 (2012).

13 Sugihara, G. et al. Detecting causality in complex ecosystems. Science 338, 496-500 (2012).

14 Brunton, S., Proctor, J. & Kurtz, J. Discovering governing equations from data by sparse identification of nonlinear dynamic systems. PNAS 113, 3932-3937 (2016).

15 Hyndman, R. Forecasting functions for time series and linear models. Retrieved from https://cran.r-project.org/package=forecast (2020).

16 Golyandina, N. & Korobeynikov, A. Basic singular spectrum analysis and forecasting with R. Computational Statistics and Data Analysis 71, 934-954 (2014).

17 Di Narzo, A. & Di Narzo, F. tseriesChaos: analysis of nonlinear time series. Retrieved from https://cran.r-project.org/package=tseriesChaos (2013).

18 Constantine, W. & Percival, D. fractal: Fractal time series modeling and analysis. Retrieved from https://cran.r-project.org/package=fractal. (2014).

19 Clark, A. multispatialCCM: multispatial convergent cross mapping. Retrieved from https://cran.r-project.org/package=multispatialCCM (2014).

Acknowledgements

This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2018-67016-27578 awarded as a Center of Excellence from the USDA National Institute of Food and Agriculture.

supplement1.png
Singular Spectrum Analysis.
Table1.png
Singular Spectrum Analysis.
supplement2.png
Surrogate Data Testing.
Table2.png
Surrogate Data Testing.
supplement3.png
Troubleshooting common problems related to the protocol for establishment of a digital proxy based on sensor data.
Table3.png
Troubleshooting common problems related to the protocol for establishment of a digital proxy based on sensor data.

Download PDF

Version 1

posted

You are reading this latest protocol version

A protocol for reconstructing the dynamics of real-world systems from observational data: Application for establishing a digital proxy of a bioreactor (DIYBOT)

Status:

Version 1

Abstract

Figures

Introduction

Procedure

Troubleshooting

Tables

References

Acknowledgements

Supplementary Files

Associated Publications

Status:

Version 1

Privacy Policy

Terms of Service

Cookie Settings