Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

doi:10.21203/rs.3.pex-1655/v1

Method Article

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

https://doi.org/10.21203/rs.3.pex-1655/v1

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this older protocol version

Read the latest protocol version →

This protocol aims to develop, validate, and deploy a prediction model using high dimensional data by both human and machine learning. The applicability is intended for clinical prediction in healthcare providers, including but not limited to those using medical histories from electronic health records. This protocol applies diverse approaches to improve both predictive performance and interpretability while maintaining the generalizability of model evaluation. However, some steps require expensive computational capacity; otherwise, these will take longer time. The key stages consist of designs of data collection and analysis, feature discovery and quality control, and model development, validation, and deployment.

Information theory and computation

Computational biology and bioinformatics

Risk factors

Biomarkers

causal diagram

machine learning

deep neural network

medical history

electronic health record

dimensional reduction

Several health outcomes have been predicted by machine learning algorithms with satisfying performances for both categorical and numerical outcomes.^1-4 Recently, the most of well-known successes are deep-learning models that surpass human-level performance for diagnostic tasks.^5-7 However, prognostication with causal reasoning is warranted for prevention purpose. Machine learning can learn data distribution for prognostication but cannot infer causality yet, which is what may happen if the distribution differs from those existing in the dataset. But, human learning can learn causality, although our comprehension on big data is limited.⁸ Therefore, application of both human and machine learning are inevitable to achieve responsible clinical prediction. We proposed an analysis pipeline using several protocols previously described elsewhere.^9-12

The first building block of this pipeline is a systematic human learning algorithm.⁹ Based on hypothetico-deductive reasoning, human learning involves collection of prior knowledge to construct a causal diagram as a central assumption for hypothesis testing. Subsequently, statistical methods are used to verify the assumption. Solely using statistical analysis on available data without contextual knowledge can introduce data-driven bias. However, humans need machines to deal with enormous amounts of data. Causal diagram is expected to systematically mitigate this type of bias when a human interprets pattern from machine learning results.¹³

The second building block of this pipeline is optional for those using electronic medical records. We provide data extraction protocol for medical histories using Kaplan-Meier estimators, so called historical rate.¹⁰ These are sensible quantities to be differential through time for affecting a future health state. This protocol also allows comparable medical history across healthcare providers, since this type of data is often isolated within each healthcare provider. By population-level historical rate, it is possible to utilize the isolated data across healthcare providers without accessing the respective databases.

The third building block of this pipeline is to deal with high-dimensional data. We proposed a resampling protocol for dimensional reduction resulting a few latent variables.¹¹ Most prediction models that used latent variables, if not all, conducted a dimensional reduction without either resampling or data partition, which exposed to a risk of optimistic bias, and is not robust for samples beyond the training set. This is because resampling or data partition are more well-known in either predictive modeling or supervised machine learning, compared to a dimensional reduction that is typically used for statistical inference and unsupervised machine learning.

The fourth building block of this pipeline is predictive modeling by machine learning. For machine learning, we applied competition winning, state-of-the-art algorithms for tabular data, which were random forest (RF) and gradient boosting machine (GBM). For prognostication, these algorithms were found to outperform others for similar outcomes.¹ We also developed a deep-insight visible neural network (DI-VNN) protocol,¹² according to recent studies.^14,15 But, unlike any of those studies, our protocol allows a human to deeply explore the ‘subconscious mind’ of the machine learning prediction model to gain insights and to identify bias exploited for predictions.

We applied this pipeline to several studies which were parts of a DI-VNN project. These studies applied our DI-VNN algorithm to a variety of predicted outcomes with comparison to those applying human learning with statistical methods and predictive modeling with other machine learning algorithms. Ethical clearance was waived by the Taipei Medical University Joint Institutional Review Board (TMU-JIRB number: N202106025). We followed guidelines for developing and reporting machine learning predictive models in biomedical research.¹⁶ We also conducted a self-assessment following the PROBAST guidelines.¹⁷ Those guidelines are specific to multivariable prediction models for making individualized, prognostic, or diagnostic predictions, instead of those applying multivariable modeling to identify risk or prognostic factors.¹⁷ To ensure the clinical suitability of our models, we also fulfilled a clinician checklist for assessing the suitability of machine learning applications in healthcare.¹⁸ We also followed other guidelines to find comparable models to evaluate success criteria.¹⁹ This protocol aimed to provide several building blocks that can be applied either completely or partially to develop and validate clinical prediction using high dimensional data by both human and machine learning.

We used R 4.0.2 programming language (R Foundation, Vienna, Austria) to conduct most steps of the data analysis. For steps related to the DI-VNN, we also used Python 3.6.3 programming language (Anaconda Inc., Austin, TX, USA). The integrated development environment software was RStudio 1.3.959 (RStudio PBC, Boston, MA, USA). To ensure reproducibility, we used Bioconductor 3.11;²⁰ thus, versions of the included R packages were all in sync according to versions in this Bioconductor version. For all models except the DI-VNN, we used an R package of caret 6.0.86 that wraps R packages for the modeling algorithms, which were glmnet 4.1, Rborist 0.2.3, and gbm 2.1.8. For the DI-VNN, we used keras 2.3.0 and tensorflow 2.0.0 Python libraries via R packages of reticulate 1.16, keras 2.3.0.0, and tensorflow 2.0.0. We also created R packages and a Python library for many steps in the data analysis, including the DI-VNN, medhist 0.1.0, gmethods 0.1.0, rsdr 0.1.0, clixo 0.1.1, and divnn 0.1.3 (both an R package and Python library). All of these packages/libraries are available for download from this repository https://github.com/herdiantrisufriyana. For model deployment, we used Shiny Server 1.4.16.958 and Node.js 12.20.0. Details on other R package versions and all of the source codes (vignette) for the data analysis are available (Table 6 in Supplementary Information).

To reproduce our work, a set of hardware requirements may be needed. We used a single machine for all models, except the DI-VNN, with 16 logical processors for the 2.10 GHz central processing unit (CPU) (Xeon® E5-2620 v4, Intel®, Santa Clara, CA, USA), 128 GB RAM, and 11 GB graphics processing unit (GPU) memory (GeForce GTX 1080 Ti, NVIDIA, Santa Clara, CA, USA). Parallelism was applied for CPU computing. Meanwhile, the DI-VNN required a higher GPU capability than that provided by the previous specification. For hyperparameter tuning and training, we used multiple machines in a cloud with 90 GB RAM and 32 GB GPU memory (Tesla V100-SXM2, NVIDIA, Santa Clara, CA, USA). For predictions, the DI-VNN only needed a CPU in a local machine, or that in a cloud machine for the web application.

1. Choose the study design.

Either prospective or retrospective cohort paradigm was recommended to prevent temporal bias for prognostic prediction, and case-control should be avoided.²¹ The latter study design may introduce collider-stratification or selection bias.¹³ For diagnostic prediction, a cross-sectional design was recommended by PROBAST.¹⁷ These guidelines also warned utilizing randomized-controlled trial for prediction study. Study design should reflect similar situation in population level according to whether we develop either prognostic or diagnostic prediction. In our example, a retrospective design was applied to select subjects from database.

2. Define target population by selection criteria.

Selection criteria of subjects with any outcome should be defined. These are not defined for each of outcome, i.e. case and control group; otherwise, we implicitly apply case-control design. In our example, all 12~55-year-old females were included, particularly whose visits were recorded within the dataset period. This represented population of visitors of the healthcare providers, since we intended to use our models for this population. Each period of pregnancy of the same subject should be treated as different subjects. The medical history before or during the first pregnancy was retained for the second instance. All visits after delivery should be excluded within the dataset period. Codes of diagnosis and procedure for determining delivery or immediately after delivery care should be determined and listed in supplementary information.

3. Determine the type of prediction task.

Machine learning prediction tasks may be either classifying a categorical outcome or estimating a numerical outcome, and either prognostic or diagnostic prediction. The estimation task was also commonly known as regression task. We avoid this word to prevent confusing it with regression algorithms which can also serve for both classification and estimation tasks. In our example, we determined our tasks were prognostic classification of prelabor rupture of membranes (PROM) and prognostic estimation of the time of delivery.

4. Determine outcome definition.

4. a. Define outcome for the classification task.

In the context of electronic health records, as shown in our example, we assigned one or more codes of diagnosis and/or procedure for an outcome based on International Classification of Disease version 10 (ICD-10) coding system. A subject was otherwise assigned as a nonevent. In the context of pregnancy, we may utilize the same codes for determining the end of the pregnancy period.

4. b. Define outcome for the estimation task.

This would be an infinite set of numbers. In the context of pregnancy, as shown in our example, we did not have information about gestational age. But, we may infer the number of days from a visit, at which the time of prediction, to the last visit at which the time of true outcome. We used the same codes for the classification task to determine such visits encountering those of events and nonevents.

5. Preserve censoring information in the dataset.

Censoring outcome should not be excluded at first. For example, we may not know whether the subjects would be pregnant or not, and those who pregnant would be delivered or not up to the end of the study period. Instead of removing instances with censored outcomes, we assigned censoring labels. These were taken into account for causal inferences, and weighting of uncensored outcomes over both censored and uncensored outcomes when training the models later. Indeed, none of the censored outcomes would be included for developing prediction models. This would preserve similar distributions of any outcomes with those in the target population and resolve the class imbalance problem by inverse probability weighting.

6. Set priority based on assessment of practical costs of under- and over-prediction.

To evaluate a prediction model, practical costs of prediction errors should be considered.¹⁶ We need to relate potential consequences of under- and over-prediction. Then, we need to choose which one would be likely more frequent or larger magnitude. A priority on dealing with under-prediction would need well calibration and higher true positive rate, while that of over-prediction would also need well calibration but higher specificity. For estimation task, we need to set limit of maximum error which is considerably safe to use a prediction model. In our example, we set priority on dealing with under-prognosis and limit of error within 2 to 4 weeks of the true time of delivery.

7. Determine candidate predictors.

We need to collect data for variables classified into two groups. The first group were data for baseline variables, i.e. demographics. Meanwhile, the other variables were candidate predictors. We avoid using baseline variables for candidate predictors, particularly those which need to use private data and reflect social and economic background. However, we need to understand the characteristics of our population based on the baseline variables. Using these variables, we may find whether future, unobserved data have similar characteristics, since this would describe how likely the predictive performance of our models for those new instances. Baseline variables were indeed still included for causal inferences. We also avoided maternal age, although this variable is often a strong predictor. The reason why we avoid the variable was that machine learning models often memorize age non-linearly. In turn, a large weight is often assigned to maternal age followed by weight shrinking of other predictors. All of these demographic variables were respectively assigned either 0 or 1 for no or yes in each category of each variable. Meanwhile, we extracted ICD-10 codes for diagnoses or procedures from medical histories.

8. Define and verify causal factors as parts of candidate predictors.

We made gmethods 0.1.0, an R package, that allows future investigators to conduct statistical tests for causal inference. Please kindly follow the protocol.⁹ After verifying causal factors, we only included those in a prediction model that applied a logistic regression with a shrinkage method, as recommended by PROBAST, instead of using a stepwise selection method.¹⁷ We chose an RR, which applies L₂-norm or beta regularization, because this method retains all causal factors within the model after weights are updated by training.²² We understood that this model would not necessarily be the best model, because of the use of causal factors. Predictive modeling normally exploits confounders to achieve better performances, while causal models cannot explain all variations among individuals, to which confounding factors contribute.¹³ However, by comparing a predictive model to one that uses only causal factors, we could imagine how much confounder effects were exploited to improve the predictive performance by machine learning algorithms. This, in turn, can warn a human user of machine learning algorithms about conducting a critical appraisal of internal properties of a machine learning model. We followed the same procedures for hyperparameter tuning and parameter fitting (training) as those for machine learning, as described in the following sections. Nonetheless, we viewed tuning and training of this prediction model as already a part of machine learning since fewer interventions are required by human users.

9. Remove candidate predictors with perfect separation problems.

We identified candidate predictors that took a single value or had zero variance. We removed all candidate predictors which were positive (value of 1) in only one of the outcomes in a training set. This is a perfect separation problem.¹⁶ A predictor may be exclusive for one of the outcomes by chance due to sampling error. To prevent such a bias, we removed perfect-separation candidate predictors. Details of the candidate predictors and selection should be described in supplementary information.

10. Remove candidate predictors that may cause outcome leakage.

We need to identify these kinds of candidate predictors. In the context of electronic health records, these would be any variable indicated by codes of diagnosis and procedure, which are explicitly or implicitly following the outcome definition. In our example, these referred to maternal or baby diagnosis/procedure codes that typically occur only during the delivery or post-delivery period. Otherwise, these codes would unexpectedly leak outcome information. The excluded codes should be listed and reported in supplementary information.

11. Filter irredundant candidate predictors.

To conduct this step, we need to compute pair-wise Pearson correlation coefficients. A pair of candidate predictors which has a high correlation (i.e., >~0.70) should be removed. This may be done by using one of the paired of candidate predictors, or unifying both under a single definition. In our example, we had pairs of candidate predictors that were highly correlated, but, the coefficients were borderline (i.e. ~0.7), while those were causal factors and some codes defining the factors themselves. Yet, we interpreted the meaning was not considerably the same; thus, we passed those candidate predictors.

12. Construct provider-wise datasets for model development and validation.

We used medical histories and causal factors as candidate features/predictors. Instead of using nationwide medical histories, we need to extract the provider-wise ones by estimation. We only considered medical history of a subject recorded in a single healthcare provider and treated that of the other providers as separated medical histories, like in real-world setting.

13. Quantify medical histories with the Kaplan-Meier (KM) estimator.

Electronic health records across healthcare provider are unlikely connected. We need to have nationwide historical (KM) rates for each code, derived from the training set only. All medical histories in days were transformed into historical rates. This technique allowed generalization of individual data based on nationwide, population-level data, without the need to access data from other providers. We made medhist 0.1.0, an R package, that allows future investigators to implement this historical rate. Please kindly follow the protocol.¹⁰

14. Conduct dimensional reduction by resampling for candidate features of the machine-learning prediction models.

We made rsdr 0.1.0, an R package, that allows future investigators to apply resampled dimensional reduction. Please kindly follow the protocol.¹¹ Resampling is important to prevent overfitting in machine learning.⁸ But, this is commonly applied in supervised instead of unsupervised machine learning; thus, this procedure is expected to deal with this problem.

15. Consider the number of events per variable when choosing the number of candidate predictors for each model.

The minimum number is 20 events per variable (EPV), which was recommended for logistic regression, as recommended by the prediction model risk of bias assessment tools (PROBAST) guidelines.¹⁷ Dimensional reduction can be conducted to get latent variables fewer than the original candidate predictors. Furthermore, pre-selection may be applied by a regression model, which needs the least EPV. Then, we picked fewer latent variables with the highest absolute, non-zero weights as candidate predictors for the models which have higher EPV requirement, e.g. 200 EPV for random forest and gradient boosting machine.²³

16. Choose modeling approaches to be compared.

Five modeling approaches were applied for supervised machine learning, consisting of a set of procedures covering feature representation, feature selection, hyperparameter tuning, and training strategies. For all models, we applied a grid search method for hyperparameter tuning of a minimum of 10 alternatives for each modeling approach. The best hyperparameters or tuning parameters were used for training the model or fitting the parameters.

17. Compute outcome weights to overcome class imbalance problem.

For classification task, the outcome (Y) was weighted by half of the inverse probability/prevalence (w_i), including the censored outcome (∅). For example, if the prevalence of Y = 1 is 0.2, then w_i is 1 ÷ 0.2 × 0.5 for the outcome of Y = 1.The sum of the three probabilities is equal to 1. The weight formula is shown as Figure 1. Weights were plugged into a general equation of the loss function in this study (Figure 2), in which training was generally conducted to estimate parameter θ_j in a model f(x_ij,θ_j) that minimizes L, where n is the number of visits, p is the number of candidate predictors, x_ij is the value for each i^th instance and j^th candidate predictor, and α and λ are regularization factors. Meanwhile, no weighting was applied in the estimation task (w = 1 for any i). In addition, a specific training strategy was applied for the last modeling approach, which was the DI-VNN, a model using the pipeline we developed in a previous protocol.¹²

18. Determine hyperparameters for the human-learning prediction model.

The first model was the causal RR. We used all causal factors except ones that included demographics. This model applied a filter method for feature selection by verifying assumptions based on domain knowledge with a statistical test for the causal inference, as described in the previous section. The RR was applied as the parameter-fitting algorithm. The tuning parameter was λ = {10^-9,10^-8,⋯,100} while keeping α = 0. These

values were plugged into the loss function (Figure 2).

19. Determine hyperparameters for the machine-learning prediction model using regression algorithm.

The second model was PC-ENR (elastic net regression). We used all PCs for the second model since the EPV was already >20 using the training set. This number, instead of 10, is recommended when applying a regression algorithm.¹⁷ This model also applied a shrinkage method for feature selection. Tuning parameters were combinations of α = [0,0.25,⋯,1] and λ = [10^-9+g(10⁰-10^-9)/G,⋯,10⁰] for g = [1,2,⋯,G] and G = 5; thus, the best tuning parameters were searched over 5 × 5 alternatives. These values were plugged into the loss

function (Figure 2), where α = 0 means this model becomes an RR but using PCs instead of the original candidate predictors. But if α > 0, some candidate predictors x_(j) might be zero after being multiplied by θ_j which minimizes the training loss. Each time, this was applied to different α and λ values to find a couple of values that minimized the validation loss.

20. Conduct further pre-selection of candidate features for machine-learning prediction with larger number of events per variable.

The 3^rd and 4^th models respectively applied the PC-RF (random forest) and PC-GBM (gradient boosting machine). These models also applied the wrapper method for feature selection. This means that candidate predictors were pre-selected by a model before being used for these models. The wrapper model was the PC-ENR. We expected a smaller number of PCs to be pre-selected; thus, the EPV was ≥200 using the training set. This is allowed by either using the wrapper method only or subsequently applying a filter method. The filter method was conducted by ranking the PCs in descending order based on the corresponding θ_j. We selected l of PCs where l is a number such that the EPV is 200. Unlike regression algorithms, modern machine algorithms are data hungry, of which the tested algorithms were the classification and regression tree (CART), RF, support vector machine, and shallow neural network.²³

21. Determine hyperparameters for the machine-learning prediction model using ensemble algorithms.

RF and GBM are ensemble algorithms. This means that both use prediction results of multiple models that apply the same algorithm, i.e., CART. However, the RF ensembles models in parallel, while the GBM sequentially ensembles the models.^24,25 Parallel ensemble means predictions are respectively the majority and average of CART predictions for classification and estimation tasks. Meanwhile, sequential ensemble means a simple CART is made to predict the classification or estimation error of an earlier CART model. Tuning parameters for the RF were combinations from a number of random candidate predictors used to build a CART each time [5+g(45-5)/G,⋯,45] and a number of minimum unique instances per node for G = 5, while 500 CARTs were built for each of the candidate models. For the GBM, we set tuning parameters as combinations from a number of CARTs [100,200,⋯,2500] and a shrinkage factor or L2-norm regularizer [0.0005+g(0.05-0.0005)/G,⋯,0.05] for G = 25, while the minimum sample number per node (tree) was 20 and only 1 random predictor was used to build a CART each time. Since exhaustively comparing all possible combinations is time consuming, if not impossible, all numbers for both ensemble models were determined based on common practices for simplicity. However, we considered the sample size, the number of candidate predictors, and the diversity of approaches between the two algorithms to heuristically optimize the hypothesis search. The prediction results of these models were plugged into the loss function (Figure 2) with α = 0 and λ = 0 to compute the errors that were minimized by training.

22. Develop and validate DI-VNN model.

This model works like neurons in the brain. Predictors are fed as inputs into neurons, and the outputs follow an all-or-none activation function. These inputs are arrayed as a minimum of two-dimensional imaging such as an object projected onto retina. We called this an ontology array. A predictor that is correlated to another predictor would have a closer position than another less-correlated predictor. From the receptive field on the retina, the signals propagate to the primary visual cortex and then the visual association cortex. The signal pathway of the neural network is dedicated for specific parts of the array. We called this an ontology network. By seeing similar objects with uncertain variations, the neural network is maintained at specific activation thresholds and weights. This creates visual memory in the association cortex to recognize a particular object by segmenting it into several parts. We made divnn 0.1.3, an R package and Python library, that allows future investigators to develop and validate this model. Please kindly follow the protocol.¹²

23. Evaluate all prediction models.

23. a. Calculate 95% confidence interval from all resampling subsets.

We needed to evaluate our models in terms of both classification and estimation tasks. The 95% confidence interval (CI) was used to express all evaluation metrics as estimates. That interval was calculated from the metrics of multiple resampling subsets for model validation.

23. b. Assess whether a model is well-calibrated for classification task.

Calibration measures were evaluated by fitting the predicted probabilities to true probabilities using a univariable linear regression; thus, the calibration measures were the intercept and slope of this linear regression. We also demonstrated a calibration plot. A model is well-calibrated if the interval of the calibration intercept and slope respectively fall closer onto 0 and 1, and the calibration plot approximately hugs the reference line.

23. c. Compare discriminative ability among well-calibrated models for classification task.

Meanwhile, the discriminative ability of a model was quantified by an AUROC interval estimate. A non-overlapping, higher interval of the AUROC determined the best model among the most calibrated ones.

23. d. Computer prediction error for estimation task.

For estimation, we applied the root mean squared error (RMSE) to train the models. However, since a longer interval is reasonably more difficult to predict, a single evaluation by RMSE might be insufficient to evaluate the estimation performance. In our example, we also evaluated the estimation plot between the predicted and true times of delivery, binned per week.

23. e. Consider to differentiate prediction error of estimation based on predicted classification.

Different lines in the estimation plot were provided for positive and negative predictions by the best classification model. This would be reasonable for clinical applications by giving an interval estimate of days for the true time of delivery conditional on the predicted one. In our example, we also limited this evaluation to an estimated time of 42 weeks or less, which is the maximum duration of pregnancy.

23. f. Compare predictive performance for estimation task considering clinical relevance.

To determine the best estimation model, we computed a proportion of weeks in which each predicted time, in weeks, was included within an interval estimate of the true one. The interval had to be the maximum ± x weeks when predicting > x weeks. In our example, if a model predicted that a woman would deliver in 6 weeks, then the interval estimate of the true time of delivery should be the maximum ± 6 weeks. For any women predicted to deliver in 6 weeks, this number should fall into that interval.

23. g. Limit minimum and maximum values that can be estimated precisely.

In our example, for the best estimation model, we determined the minimum and maximum predicted times of delivery with acceptable precision for each predicted outcome of PROM based on a visual assessment using internal validation. We also computed the RMSE within this acceptable range.

24. Assess fulfillment of success criteria of the model development.

Success criteria of the modeling should be at least by a threshold-agnostic evaluation metric which is better than those of recent models of the similar outcome using similar predictors. Consider to add a criterion or more to prevent overfitting, as recommended by PROBAST guidelines.¹⁷ If there is no previous studies to be compared with, AUROC 0.8 or more can be considered for success criterion.¹⁸

25. Compare to models from previous studies.

25. a. Determine the context of model comparison across studies.

Model comparison should be within the context of its clinical application. This is represented in the eligibility criteria to select studies to be compared with.

25. b. Set up comparison guidelines.

Evaluation of success criteria needs comparator models. To find these models, one can apply PRISMA 2020 guidelines; thus, the models are comparable and the comparison is also fair. Since a systematic review and meta-analysis were not the main purposes of a study in our example, we only applied all items in section methods in the guidelines, except item numbers 11 and 14 regarding the risk of bias assessment and item numbers 13 and 15 regarding synthesis method and certainty. This was because we applied the guidelines only to facilitate fair comparisons to previous studies, and did not consider conclusions as to how valid and accurate PROM could be predicted.

25. c. Define study selection criteria.

The eligibility criteria should follow the PICOTS framework: (1) population, targeted subjects that may or may not have the outcome; (2) index, a prediction model of interest; (3) comparator, previous prediction models to be compared with; (4) outcome, definition and data scale for the outcome with or without additional criterion on the sample size (e.g. events per variable); (5) time, prognostic or diagnostic prediction with specified time interval; and (6) setting, healthcare setting such as primary care, hospital, inpatient, outpatient, et cetera. In our example, only original articles and conference abstracts full papers were included. No grouping of the studies was needed because we would not conduct a meta-analysis for data synthesis.

25. d. Develop the search strategy.

A search date interval and literature databases should be defined. We also need to describe the keywords. Some configuration for searching studies should also be described, e.g., limiting publication date and language.

25. e. Develop the review strategy.

We need to describe which author does what step of review. If there is a conflicting decision among reviewers, then how to achieve final decision should be described.

25. f. Choose the extracted data.

In our example, we extracted evaluation metrics of the best model with the most similar outcome definition from each study to that of our study. Any data that are needed to briefly assess potential risk of bias were also extracted. These included the study design, population, setting, outcome definition, sample size, details on events and nonevents, number of candidate predictors, EPV, predictors in the best final model, and the most recommended validation techniques (external over internal validation; bootstrapping over cross-validation, or cross-validation over test split). If no AUROC was reported by a previous study, but there were sensitivity and specificity, then we computed AUROC from both of the metrics using trapezoidal rule (Figure 3).

25. g. Define model evaluation method.

Plots of evaluation should be overlaid those of our models. In our example, we plotted a point at a sensitivity and a specificity for each of our models at the optimum threshold on each ROC curve. AUROCs were also compared with the models in our example (with or without the 95% CI).

26. Calibrate the predicted probability of any model for classification task.

A model can be calibrated by training an additional model to estimate the true probability of an outcome for classification task given the predicted probability by the model. In our example, we applied a general additive model using locally weighted scatterplot smoothing (GAM-LOESS). Instead of using pre-calibrated models, the final comparison was made for models calibrated using predicted probabilities from each of the developed models. This calibration was intended to obtain more-linear predicted probabilities. No calibration was conducted for the estimation models.

27. Validate the predictive performances using several data partition methods.

27. a. Conduct data partition for non-randomized, external validation.

We split the dataset into two subsets which were intended for internal and external validation. For all external validation sets, evaluation metrics were resampled by bootstrapping 30 times to obtain interval estimates. As recommended, we split out a dataset for external validation by geographical, temporal, and geotemporal splitting. Geographical splitting was conducted by stratified random sampling of cities with as many as

proportion of cities in each state. Temporal splitting was also conducted by stratified random sampling of days with as many as p proportion of days in each subtropical season. Visits of subjects from either the sampled cities or days were respectively excluded as either a geographical or temporal subset. Then, we further excluded as many as p proportion of cities in the geographical subset or that number of days in the temporal subset. Visits of subjects from both of the excluded subsets were overlapped as a new geotemporal subset. Either a geographical or geotemporal subset did not include visits or subjects in this geotemporal subset. We randomly tried different p proportions for geographical, temporal, and geotemporal splitting such that approximately ~20% of visits belonged to any of the three subsets. These external validation sets were used for stress tests of our models since the distributions of predictors and outcome were conceivably uncertain among excluded cities, days, or overlaps. This reflects the situation in some real-world settings, but this might not reflect common situations nationwide.

27. b. Conduct data partition for randomized, external validation.

To estimate the predictive performance nationwide, ~20% of the remaining set was randomly split out thus leaving ~64% of the original sample size for the training set. This random subset was more representative for external validation to estimate the predictive performances of our models nationwide. After splitting out this random subset, the remaining samples were intended for internal validation.

27. c. Conduct data partition for calibration.

We split out ~20% of the internal validation set to calibrate each model. The final predictive performance of internal validation came from this calibration subset. We resampled it by bootstrapping 30 times to compute interval estimates. To compute the EPV, we only used ~80%, which was, the pre-calibrated subset. This was used to train the models.

27. d. Choose resampling techniques for any validation. For hyperparameter tuning, we applied 5-fold cross-validation, while the final training was conducted using the best tuned parameters by bootstrapping 30 times. Both tuning and training used the same pre-calibrated set by applying cross-validation for all models except the DI-VNN. Please kindly follow the protocol of DI-VNN.¹²

28. Consider to assess the best time window for the best model of prognostic prediction.

The best time window should be evaluated using the best model for prognostic prediction. This provided additional insights when interpreting the best model. In our example, using the internal validation set, we grouped samples by binning the days to the end of pregnancy every 4 weeks. An interval estimate of the AUROCs was computed for each bin from all resampled subsets. By observing the plot, the best time window should mostly cover interval estimates that were greater than an AUROC of 0.5, which represents prediction by simple guessing.

29. Deploy the best model for each type of tasks.

29. a. Determine input for the deployed models.

The best model can be deployed for both classification and estimation tasks. A model can be quite complex to apply without a software/application; thus, a web application need to be provided for using the best models. This can be made using R Shiny. In our example, a user needs to upload a deidentified, two-column comma-separated value (CSV) file of admission dates and ICD-10 codes of the medical history of a subject. A use case should be described with an example dataset for immediate demonstration of how to use the web application.

29. b. Design user interface.

For responsible clinical prediction, a threshold is expected to be flexible, set by a user. For the classification performance, a threshold should be able to be chose under expected population-level performances of evaluation metrics of interest. For the estimation performance, a true value should also be estimated based on a subpopulation with the same estimated value. This may also be conditioned on the same classification result. All population- and subpopulation-level metrics are expressed as interval estimates. In our example, the population data for the classification task were those of the calibration split, while those for the estimation task were data of the internal validation set which consisted of both pre-calibrated and calibrated subsets. In addition, for estimation task, a timeline should also be shown to visualize each predicted value with the true interval estimate at the subpopulation level. The timeline should show the time of prediction and those code entries that were used as features; thus, a medical history of a subject is visualized using this timeline. All results can be saved online and/or downloaded as a report. An example of the report should be shown. Inference durations 10 times should also be measured for a use case and reported as interval estimates.

Step 1 to 6

Problem

Failure to assign outcome

Possible reason

Absence of codes for outcome, or none in the target population

Solution

Consider to use other codes of outcome or get another dataset, or to change selection criteria for target population.

Step 7

Problem 1

Premature stop of computation

Possible reason 1

Dataset consists of a large sample size or number of variables

Solution 1

Consider to use computer with larger memory size or RAM. Alternatively, split table into ones consisting unique sets of variables.

Problem 2

No candidate predictors

Possible reason 2

Absence of codes for predictors

Solution 2

Consider to use other codes of predictors or get another dataset.

Step 8

Problem

No causal factor

Possible reason

Absence of codes for causal factor

Solution

Consider to use other codes of causal factor or get another dataset.

Step 9 to 12

Problem 1

No eligible candidate predictors

Possible reason 1

Perfect separation problem on all candidate predictors due to extremely small sample size of any class of outcome

Solution 1

Consider to use other codes of outcome or get another dataset.

Problem 2

Almost perfect prediction

Possible reason 2

Codes for outcome being those for predictors

Solution 2

Ensure these codes are not used for predictors.

Problem 3

Redundant candidate predictors

Possible reason 3

Codes for causal factor being those for other predictors

Solution 3

Consider to remove either causal factor or the other predictors, or assess if the context or meaning is different between those predictors.

Step 13

Problem

Unable to infer historical rate

Possible reason

A predictor is only available in an instance.

Solution

Remove this predictor.

Step 14

Problem

Unable to run dimensional reduction procedures

Possible reason

None or only 1 candidate predictor with non-zero variance

Solution

Consider to use other candidate predictors.

Step 15 to 17

Problem

<20 events per variable (EPV)

Possible reason

Too many candidate predictors or too small sample size of outcome

Solution

Apply pre-selection more strict, use other codes of outcome or get another dataset, or ensure models generalized in all the external validation sets.

Step 18 to 22

Problem

Premature stop of computation

Possible reason

Large dataset or complex model

Solution

Consider to use computer with larger GPU (DI-VNN) or CPU (other models) memory size. Alternatively, reduce batch size and adjust learning rate accordingly for DI-VNN, or reduce number of parallel processes for other models.

Step 23

Problem 1

Unable to compute calibration measures

Possible reason 1

Narrow range of predicted probabilities

Solution 1

Increase digits for decimal rounding.

Problem 2

Unable to compute receiver operating characteristic

Possible reason 2

Narrow range of predicted probabilities

Solution 2

Increase digits for decimal of thresholds.

Step 18 to 19

Problem

Slow processing

Possible reason

Dataset consists of a large sample size or complex architecture.

Solution

Consider to use computer with larger GPU memory size. Alternatively, increase batch size and adjust learning rate accordingly, stop training earlier, or reduce number of bootstrapping.

Step 24 to 25

Problem

No previous model to be compared with

Possible reason 1

New prediction problem

Solution 1

Consider to use AUROC 0.8 or more to be considered for success criterion.

Possible reason 2

No study found using the keywords or literature databases

Solution 2

Consider to add more relevant the keywords or literature databases.

Step 26

Problem

Unable to calibrate the models

Possible reason

Narrow range of predicted probabilities

Solution

Consider to calibrate using another algorithm, or not to calibrate the models.

Step 27

Problem

Incomplete outcome in a subset

Possible reason

Small sample size

Solution

Consider to increase proportion for the subset, or get another dataset.

Step 28

Problem

Unable to compute a metric for a period

Possible reason

Small sample size

Solution

Consider to widen the range of each period, or get another dataset.

Step 29

Problem

Slow prediction

Possible reason

Complex model

Solution

Reduce model complexity, or design interface that shows progress of prediction in multiple, short steps.

All steps

Approximate time: 1 hour to 8 days (pre-computed)

Step 1 to 6

Approximate time: 1 to 5 minutes (pre-computed)

Step 7

Approximate time: 5 to 15 minutes (pre-computed)

Step 8

Approximate time: 25 minutes to 3 hours per causal factor

Step 9 to 12

Approximate time: 1 to 5 minutes

Step 13

Approximate time: 1 to 2 minutes

Step 14

Approximate time: 5 to 30 minutes

Step 15 to 17

Approximate time: 1 to 2 minutes

Step 18 to 23, and 26 to 28

Approximate time: 1 to 20 hours per classification/estimation model

Step 24 to 25

Approximate time: 1 to 7 days

Step 29

Approximate time: 10 to 20 minutes

We may expect better generalization with data partition technique in this protocol. The predictive performance would be reasonably higher in random split by chance. Compared to non-random splits, that split includes more variation across healthcare facilities in a country. Some models apply overfitting strategy (i.e., random forest and gradient boosting machine) to outperform others but having lower generalization. This will expose to false decision in model selection. This may be prevented by evaluating models based on calibration set which is a different subset to that for training the models by the algorithm of interest.

Although causal RR uses causal factors, this does not mean this model would have the best predictive performance. A prediction model normally exploits confounders to predict all variations in the dataset. But, this model would be a good reference model for the others. We also may use this model to interpret important predictors in other models.

We may or may not find any well-calibrated model. But, the most calibrated one(s) should be chosen among the prediction models, preferably those with distributions of predicted probabilities from 0 to 1. A wide range of predicted probabilities will help clinicians to widely adjust threshold based on local data distribution. The optimal threshold for the best model(s) should be given as initial threshold. In addition, distributions of predicted probabilities may or may not be visually differentiated between events and nonevents.

A well-calibrated prediction model may be either more sensitive or more specific based on receiver operating characteristic (ROC) curves. This should be assessed by internal validation (calibration split). A more-sensitive model (i.e. high sensitivity) at 95% specificity is considerably better for screening a disease from a population-level standpoint. But, sensitivity and specificity at the optimal threshold also need to be considered.

Non-random splitting may not describe generalization but this can confirm the robustness of the best model. The best model, which is chosen by internal validation, is expected to be the best in most, if not all, of the non-random splits of external validation. Yet, the random split may reflect common situations nationwide. We also need to ensure that the best model has interval estimates of the predictive performance better than simple guessing, i.e. the AUROC interval should be higher than 0.5 for all the non-random splits. This can be a safety factor to implement the best model before re-calibration using local data.

Comparison to prediction models from previous studies is important to assess success model development. But, we may or may not find comparable models following methods in the preferred reporting items for systematic reviews and meta-analyses (PRISMA) 2020 expanded checklist.¹⁹ In this situation, we may need to consider more literature databases or other keyword strategy. Furthermore, we may also consider modification of eligibility criteria. If we find the previous models, we need to compare using a threshold-agnostic evaluation metric (e.g. AUROC) along with additional metrics requiring a threshold and the sample size for each class of outcome. In addition, it is also important to point out if the predictors are those in either low- or high-resource settings.

For estimation task, we need to consider clinical acceptance of the prediction error for each range of estimated values. Models may or may not fulfill the clinical acceptance for each of the ranges. DI-VNN may not estimate the true time of delivery in our example mainly due to the differential analysis. It only filters predictors based on categorical outcomes only. If classification and estimation tasks are related in a circumstance, as demonstrated by our example, we need to stratify the estimated value for each of the classes predicted by the best classification model. As for classification task, the best model for estimation task should also be confirmed to consistently outperform the other models using external validation sets.

Since an estimation model may not fulfill the clinical acceptance for each of the ranges, we need to determine the precise estimation window of the best estimation model using an internal validation set. This estimation window should be determined for each of the predicted classes. In addition, different estimated values for each of the class may gain insights to the outcome disease.

1. Sufriyana, H., et al. Comparison of Multivariable Logistic Regression and Other Machine Learning Algorithms for Prognostic Prediction Studies in Pregnancy Care: Systematic Review and Meta-Analysis. JMIR Med Inform 8, e16503 (2020).

2. Fleuren, L.M., et al. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Med 46, 383-400 (2020).

3. Lee, Y., et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: A meta-analysis and systematic review. J Affect Disord 241, 519-532 (2018).

4. Gonem, S., Janssens, W., Das, N. & Topalovic, M. Applications of artificial intelligence and machine learning in respiratory medicine. Thorax 75, 695-701 (2020).

5. Bien, N., et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med 15, e1002699 (2018).

6. Hannun, A.Y., et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat Med 25, 65-69 (2019).

7. Rajpurkar, P., et al. Deep learning for chest radiograph diagnosis: A retrospective comparison of the CheXNeXt algorithm to practicing radiologists. PLoS Med 15, e1002686 (2018).

8. Wilkinson, J., et al. Time to reality check the promises of machine learning-powered precision medicine. Lancet Digit Health 2, e677-e680 (2020).

9. Sufriyana, H., Wu, Y.W. & Su, E.C. Systematic human learning by literature and data mining for feature selection in machine learning. Protocol Exchange (2021).

10. Sufriyana, H., Wu, Y.W. & Su, E.C. Quantifying medical histories with the Kaplan-Meier (KM) estimator for feature extraction of electronic health records in machine learning. Protocol Exchange (2021).

11. Sufriyana, H., Wu, Y.W. & Su, E.C. Resampled dimensional reduction for feature representation in machine learning. Protocol Exchange (2021).

12. Sufriyana, H., Wu, Y.W. & Su, E.C. Deep-insight visible neural network (DI-VNN) for improving interpretability of a non-image deep learning model by data-driven ontology. Protocol Exchange (2021).

13. Hernán, M.A. & Robins, J.M. Causal Inference: What If. , (Chapman & Hall/CRC, Boca Raton, 2020).

14. Sharma, A., Vans, E., Shigemizu, D., Boroevich, K.A. & Tsunoda, T. DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture. Sci Rep 9, 11399 (2019).

15. Ma, J., et al. Using deep learning to model the hierarchical structure and function of a cell. Nat Methods 15, 290-298 (2018).

16. Luo, W., et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res 18, e323 (2016).

17. Moons, K.G.M., et al. PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann Intern Med 170, W1-w33 (2019).

18. Scott, I., Carter, S. & Coiera, E. Clinician checklist for assessing suitability of machine learning applications in healthcare. BMJ Health Care Inform 28(2021).

19. Page, M.J., et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Bmj 372, n71 (2021).

20. Huber, W., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12, 115-121 (2015).

21. Yuan, W., et al. Temporal bias in case-control design: preventing reliable predictions of the future. Nat Commun 12, 1107 (2021).

22. Van Calster, B., van Smeden, M., De Cock, B. & Steyerberg, E.W. Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study. Stat Methods Med Res 29, 3166-3178 (2020).

23. van der Ploeg, T., Austin, P.C. & Steyerberg, E.W. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Med Res Methodol 14, 137 (2014).

24. Friedman, J. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29, 1189-1232 (2001).

25. Breiman, L. Random Forests. Machine Learning 45, 5-32 (2001).

The social security administrator for health or badan penyelenggara jaminan sosial (BPJS) kesehatan in Indonesia gave permission to access the sample dataset in this protocol (dataset request approval number: 5064/I.2/0421). This protocol was funded by the Ministry of Science and Technology (MOST) in Taiwan (grant number MOST109-2221-E-038-018 and MOST110-2628-E-038-001) and the Higher Education Sprout Project from the Ministry of Education (MOE) in Taiwan (grant number DP2-110-21121-01-A-13) to Emily Chia-Yu Su.

suppprotocolhmlp.pdf
Supplementary Information

Download PDF

Version 1

posted

You are reading this older protocol version

Read the latest protocol version →

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

Status:

Version 1

Abstract

Figures

Introduction

Equipment

Procedure

Troubleshooting

Time Taken

Anticipated Results

References

Acknowledgements

Supplementary Files

Status:

Version 1

Privacy Policy

Terms of Service

Human and machine learning pipelines for responsible clinical prediction using high-dimensional data

Status:

Version 1

Abstract

Figures

Introduction

Equipment

Procedure

Troubleshooting

Time Taken

Anticipated Results

References

Acknowledgements

Supplementary Files

Status:

Version 1

Privacy Policy

Terms of Service

Manage Cookie Preferences