**1. Choose the study design.**

Either prospective or retrospective cohort paradigm was recommended to prevent temporal bias for prognostic prediction, and case-control should be avoided.21 The latter study design may introduce collider-stratification or selection bias.13 For diagnostic prediction, a cross-sectional design was recommended by PROBAST.17 These guidelines also warned utilizing randomized-controlled trial for prediction study. Study design should reflect similar situation in population level according to whether we develop either prognostic or diagnostic prediction. In our example, a retrospective design was applied to select subjects from database.

**2. Define target population by selection criteria.**

Selection criteria of subjects with any outcome should be defined. These are not defined for each of outcome, i.e. case and control group; otherwise, we implicitly apply case-control design. In our example, all 12~55-year-old females were included, particularly whose visits were recorded within the dataset period. This represented population of visitors of the healthcare providers, since we intended to use our models for this population. Each period of pregnancy of the same subject should be treated as different subjects. The medical history before or during the first pregnancy was retained for the second instance. All visits after delivery should be excluded within the dataset period. Codes of diagnosis and procedure for determining delivery or immediately after delivery care should be determined and listed in supplementary information.

**3. Determine the type of prediction task.**

Machine learning prediction tasks may be either classifying a categorical outcome or estimating a numerical outcome, and either prognostic or diagnostic prediction. The estimation task was also commonly known as regression task. We avoid this word to prevent confusing it with regression algorithms which can also serve for both classification and estimation tasks. In our example, we determined our tasks were prognostic classification of prelabor rupture of membranes (PROM) and prognostic estimation of the time of delivery.

**4. Determine outcome definition.**

**4. a. Define outcome for the classification task.**

In the context of electronic health records, as shown in our example, we assigned one or more codes of diagnosis and/or procedure for an outcome based on International Classification of Disease version 10 (ICD-10) coding system. A subject was otherwise assigned as a nonevent. In the context of pregnancy, we may utilize the same codes for determining the end of the pregnancy period.

**4. b. Define outcome for the estimation task.**

This would be an infinite set of numbers. In the context of pregnancy, as shown in our example, we did not have information about gestational age. But, we may infer the number of days from a visit, at which the time of prediction, to the last visit at which the time of true outcome. We used the same codes for the classification task to determine such visits encountering those of events and nonevents.

**5. Preserve censoring information in the dataset.**

Censoring outcome should not be excluded at first. For example, we may not know whether the subjects would be pregnant or not, and those who pregnant would be delivered or not up to the end of the study period. Instead of removing instances with censored outcomes, we assigned censoring labels. These were taken into account for causal inferences, and weighting of uncensored outcomes over both censored and uncensored outcomes when training the models later. Indeed, none of the censored outcomes would be included for developing prediction models. This would preserve similar distributions of any outcomes with those in the target population and resolve the class imbalance problem by inverse probability weighting.

**6. Set priority based on assessment of practical costs of under- and over-prediction.**

To evaluate a prediction model, practical costs of prediction errors should be considered.16 We need to relate potential consequences of under- and over-prediction. Then, we need to choose which one would be likely more frequent or larger magnitude. A priority on dealing with under-prediction would need well calibration and higher true positive rate, while that of over-prediction would also need well calibration but higher specificity. For estimation task, we need to set limit of maximum error which is considerably safe to use a prediction model. In our example, we set priority on dealing with under-prognosis and limit of error within 2 to 4 weeks of the true time of delivery.

**7. Determine candidate predictors.**

We need to collect data for variables classified into two groups. The first group were data for baseline variables, i.e. demographics. Meanwhile, the other variables were candidate predictors. We avoid using baseline variables for candidate predictors, particularly those which need to use private data and reflect social and economic background. However, we need to understand the characteristics of our population based on the baseline variables. Using these variables, we may find whether future, unobserved data have similar characteristics, since this would describe how likely the predictive performance of our models for those new instances. Baseline variables were indeed still included for causal inferences. We also avoided maternal age, although this variable is often a strong predictor. The reason why we avoid the variable was that machine learning models often memorize age non-linearly. In turn, a large weight is often assigned to maternal age followed by weight shrinking of other predictors. All of these demographic variables were respectively assigned either 0 or 1 for no or yes in each category of each variable. Meanwhile, we extracted ICD-10 codes for diagnoses or procedures from medical histories.

**8. Define and verify causal factors as parts of candidate predictors.**

We made gmethods 0.1.0, an R package, that allows future investigators to conduct statistical tests for causal inference. Please kindly follow the protocol.9 After verifying causal factors, we only included those in a prediction model that applied a logistic regression with a shrinkage method, as recommended by PROBAST, instead of using a stepwise selection method.17 We chose an RR, which applies L2-norm or beta regularization, because this method retains all causal factors within the model after weights are updated by training.22 We understood that this model would not necessarily be the best model, because of the use of causal factors. Predictive modeling normally exploits confounders to achieve better performances, while causal models cannot explain all variations among individuals, to which confounding factors contribute.13 However, by comparing a predictive model to one that uses only causal factors, we could imagine how much confounder effects were exploited to improve the predictive performance by machine learning algorithms. This, in turn, can warn a human user of machine learning algorithms about conducting a critical appraisal of internal properties of a machine learning model. We followed the same procedures for hyperparameter tuning and parameter fitting (training) as those for machine learning, as described in the following sections. Nonetheless, we viewed tuning and training of this prediction model as already a part of machine learning since fewer interventions are required by human users.

**9. Remove candidate predictors with perfect separation problems.**

We identified candidate predictors that took a single value or had zero variance. We removed all candidate predictors which were positive (value of 1) in only one of the outcomes in a training set. This is a perfect separation problem.16 A predictor may be exclusive for one of the outcomes by chance due to sampling error. To prevent such a bias, we removed perfect-separation candidate predictors. Details of the candidate predictors and selection should be described in supplementary information.

**10. Remove candidate predictors that may cause outcome leakage.**

We need to identify these kinds of candidate predictors. In the context of electronic health records, these would be any variable indicated by codes of diagnosis and procedure, which are explicitly or implicitly following the outcome definition. In our example, these referred to maternal or baby diagnosis/procedure codes that typically occur only during the delivery or post-delivery period. Otherwise, these codes would unexpectedly leak outcome information. The excluded codes should be listed and reported in supplementary information.

**11. Filter irredundant candidate predictors.**

To conduct this step, we need to compute pair-wise Pearson correlation coefficients. A pair of candidate predictors which has a high correlation (i.e., >~0.70) should be removed. This may be done by using one of the paired of candidate predictors, or unifying both under a single definition. In our example, we had pairs of candidate predictors that were highly correlated, but, the coefficients were borderline (i.e. ~0.7), while those were causal factors and some codes defining the factors themselves. Yet, we interpreted the meaning was not considerably the same; thus, we passed those candidate predictors.

**12. Construct provider-wise datasets for model development and validation.**

We used medical histories and causal factors as candidate features/predictors. Instead of using nationwide medical histories, we need to extract the provider-wise ones by estimation. We only considered medical history of a subject recorded in a single healthcare provider and treated that of the other providers as separated medical histories, like in real-world setting.

**13. Quantify medical histories with the Kaplan-Meier (KM) estimator.**

Electronic health records across healthcare provider are unlikely connected. We need to have nationwide historical (KM) rates for each code, derived from the training set only. All medical histories in days were transformed into historical rates. This technique allowed generalization of individual data based on nationwide, population-level data, without the need to access data from other providers. We made medhist 0.1.0, an R package, that allows future investigators to implement this historical rate. Please kindly follow the protocol.10

**14. Conduct dimensional reduction by resampling for candidate features of the machine-learning prediction models.**

We made rsdr 0.1.0, an R package, that allows future investigators to apply resampled dimensional reduction. Please kindly follow the protocol.11 Resampling is important to prevent overfitting in machine learning.8 But, this is commonly applied in supervised instead of unsupervised machine learning; thus, this procedure is expected to deal with this problem.

**15. Consider the number of events per variable when choosing the number of candidate predictors for each model.**

The minimum number is 20 events per variable (EPV), which was recommended for logistic regression, as recommended by the prediction model risk of bias assessment tools (PROBAST) guidelines.17 Dimensional reduction can be conducted to get latent variables fewer than the original candidate predictors. Furthermore, pre-selection may be applied by a regression model, which needs the least EPV. Then, we picked fewer latent variables with the highest absolute, non-zero weights as candidate predictors for the models which have higher EPV requirement, e.g. 200 EPV for random forest and gradient boosting machine.23

**16. Choose modeling approaches to be compared.**

Five modeling approaches were applied for supervised machine learning, consisting of a set of procedures covering feature representation, feature selection, hyperparameter tuning, and training strategies. For all models, we applied a grid search method for hyperparameter tuning of a minimum of 10 alternatives for each modeling approach. The best hyperparameters or tuning parameters were used for training the model or fitting the parameters.

**17. Compute outcome weights to overcome class imbalance problem.**

For classification task, the outcome (*Y*) was weighted by half of the inverse probability/prevalence (*w**i*), including the censored outcome (∅). For example, if the prevalence of *Y = 1* is 0.2, then *w**i* is 1 ÷ 0.2 × 0.5 for the outcome of *Y = 1*.The sum of the three probabilities is equal to 1. The weight formula is shown as Figure 1. Weights were plugged into a general equation of the loss function in this study (Figure 2), in which training was generally conducted to estimate parameter *θ**j* in a model *f(x**ij**,θ**j**)* that minimizes *L*, where *n* is the number of visits, *p* is the number of candidate predictors, *x**ij* is the value for each *i**th* instance and *j**th* candidate predictor, and *α* and *λ* are regularization factors. Meanwhile, no weighting was applied in the estimation task (*w = 1* for any *i*). In addition, a specific training strategy was applied for the last modeling approach, which was the DI-VNN, a model using the pipeline we developed in a previous protocol.12

**18. Determine hyperparameters for the human-learning prediction model.**

The first model was the causal RR. We used all causal factors except ones that included demographics. This model applied a filter method for feature selection by verifying assumptions based on domain knowledge with a statistical test for the causal inference, as described in the previous section. The RR was applied as the parameter-fitting algorithm. The tuning parameter was *λ = {10**-9**,10**-8**,⋯,100}* while keeping *α = 0*. These

values were plugged into the loss function (Figure 2).

**19. Determine hyperparameters for the machine-learning prediction model using regression algorithm.**

The second model was PC-ENR (elastic net regression). We used all PCs for the second model since the EPV was already >20 using the training set. This number, instead of 10, is recommended when applying a regression algorithm.17 This model also applied a shrinkage method for feature selection. Tuning parameters were combinations of *α = [0,0.25,⋯,1]* and *λ = [10**-9**+g(10**0**-10**-9**)/G,⋯,10**0**] *for *g = [1,2,⋯,G] *and *G = 5*; thus, the best tuning parameters were searched over *5 × 5* alternatives. These values were plugged into the loss

function (Figure 2), where *α = 0* means this model becomes an RR but using PCs instead of the original candidate predictors. But if *α > 0*, some candidate predictors *x**(j)* might be zero after being multiplied by *θ**j* which minimizes the training loss. Each time, this was applied to different *α* and *λ* values to find a couple of values that minimized the validation loss.

**20. Conduct further pre-selection of candidate features for machine-learning prediction with larger number of events per variable.**

The 3rd and 4th models respectively applied the PC-RF (random forest) and PC-GBM (gradient boosting machine). These models also applied the wrapper method for feature selection. This means that candidate predictors were pre-selected by a model before being used for these models. The wrapper model was the PC-ENR. We expected a smaller number of PCs to be pre-selected; thus, the EPV was ≥200 using the training set. This is allowed by either using the wrapper method only or subsequently applying a filter method. The filter method was conducted by ranking the PCs in descending order based on the corresponding *θ**j*. We selected *l* of PCs where *l* is a number such that the EPV is 200. Unlike regression algorithms, modern machine algorithms are data hungry, of which the tested algorithms were the classification and regression tree (CART), RF, support vector machine, and shallow neural network.23

**21. Determine hyperparameters for the machine-learning prediction model using ensemble algorithms.**

RF and GBM are ensemble algorithms. This means that both use prediction results of multiple models that apply the same algorithm, i.e., CART. However, the RF ensembles models in parallel, while the GBM sequentially ensembles the models.24,25 Parallel ensemble means predictions are respectively the majority and average of CART predictions for classification and estimation tasks. Meanwhile, sequential ensemble means a simple CART is made to predict the classification or estimation error of an earlier CART model. Tuning parameters for the RF were combinations from a number of random candidate predictors used to build a CART each time *[5+g(45-5)/G,⋯,45]* and a number of minimum unique instances per node for *G = 5*, while 500 CARTs were built for each of the candidate models. For the GBM, we set tuning parameters as combinations from a number of CARTs *[100,200,⋯,2500]* and a shrinkage factor or L2-norm regularizer *[0.0005+g(0.05-0.0005)/G,⋯,0.05]* for *G = 25*, while the minimum sample number per node (tree) was 20 and only 1 random predictor was used to build a CART each time. Since exhaustively comparing all possible combinations is time consuming, if not impossible, all numbers for both ensemble models were determined based on common practices for simplicity. However, we considered the sample size, the number of candidate predictors, and the diversity of approaches between the two algorithms to heuristically optimize the hypothesis search. The prediction results of these models were plugged into the loss function (Figure 2) with *α = 0* and *λ = 0* to compute the errors that were minimized by training.

******22. Develop and validate DI-VNN model.**

This model works like neurons in the brain. Predictors are fed as inputs into neurons, and the outputs follow an all-or-none activation function. These inputs are arrayed as a minimum of two-dimensional imaging such as an object projected onto retina. We called this an ontology array. A predictor that is correlated to another predictor would have a closer position than another less-correlated predictor. From the receptive field on the retina, the signals propagate to the primary visual cortex and then the visual association cortex. The signal pathway of the neural network is dedicated for specific parts of the array. We called this an ontology network. By seeing similar objects with uncertain variations, the neural network is maintained at specific activation thresholds and weights. This creates visual memory in the association cortex to recognize a particular object by segmenting it into several parts. We made divnn 0.1.3, an R package and Python library, that allows future investigators to develop and validate this model. Please kindly follow the protocol.12

**23. Evaluate all prediction models.**

**23. a. Calculate 95% confidence interval from all resampling subsets.**

We needed to evaluate our models in terms of both classification and estimation tasks. The 95% confidence interval (CI) was used to express all evaluation metrics as estimates. That interval was calculated from the metrics of multiple resampling subsets for model validation.

**23. b. Assess whether a model is well-calibrated for classification task.**

Calibration measures were evaluated by fitting the predicted probabilities to true probabilities using a univariable linear regression; thus, the calibration measures were the intercept and slope of this linear regression. We also demonstrated a calibration plot. A model is well-calibrated if the interval of the calibration intercept and slope respectively fall closer onto 0 and 1, and the calibration plot approximately hugs the reference line.

**23. c. Compare discriminative ability among well-calibrated models for classification task.**

Meanwhile, the discriminative ability of a model was quantified by an AUROC interval estimate. A non-overlapping, higher interval of the AUROC determined the best model among the most calibrated ones.

**23. d. Computer prediction error for estimation task.**

For estimation, we applied the root mean squared error (RMSE) to train the models. However, since a longer interval is reasonably more difficult to predict, a single evaluation by RMSE might be insufficient to evaluate the estimation performance. In our example, we also evaluated the estimation plot between the predicted and true times of delivery, binned per week.

**23. e. Consider to differentiate prediction error of estimation based on predicted classification.**

Different lines in the estimation plot were provided for positive and negative predictions by the best classification model. This would be reasonable for clinical applications by giving an interval estimate of days for the true time of delivery conditional on the predicted one. In our example, we also limited this evaluation to an estimated time of 42 weeks or less, which is the maximum duration of pregnancy.

**23. f. Compare predictive performance for estimation task considering clinical relevance.**

To determine the best estimation model, we computed a proportion of weeks in which each predicted time, in weeks, was included within an interval estimate of the true one. The interval had to be the maximum ± *x* weeks when predicting > *x* weeks. In our example, if a model predicted that a woman would deliver in 6 weeks, then the interval estimate of the true time of delivery should be the maximum ± 6 weeks. For any women predicted to deliver in 6 weeks, this number should fall into that interval.

**23. g. Limit minimum and maximum values that can be estimated precisely.**

In our example,** f**or the best estimation model, we determined the minimum and maximum predicted times of delivery with acceptable precision for each predicted outcome of PROM based on a visual assessment using internal validation. We also computed the RMSE within this acceptable range.

**24. Assess fulfillment of success criteria of the model development.**

Success criteria of the modeling should be at least by a threshold-agnostic evaluation metric which is better than those of recent models of the similar outcome using similar predictors. Consider to add a criterion or more to prevent overfitting, as recommended by PROBAST guidelines.17 If there is no previous studies to be compared with, AUROC 0.8 or more can be considered for success criterion.18

**25. Compare to models from previous studies.**

**25. a. Determine the context of model comparison across studies.**

Model comparison should be within the context of its clinical application. This is represented in the eligibility criteria to select studies to be compared with.

**25. b. Set up comparison guidelines.**

Evaluation of success criteria needs comparator models. To find these models, one can apply PRISMA 2020 guidelines; thus, the models are comparable and the comparison is also fair. Since a systematic review and meta-analysis were not the main purposes of a study in our example, we only applied all items in section methods in the guidelines, except item numbers 11 and 14 regarding the risk of bias assessment and item numbers 13 and 15 regarding synthesis method and certainty. This was because we applied the guidelines only to facilitate fair comparisons to previous studies, and did not consider conclusions as to how valid and accurate PROM could be predicted.

**25. c. Define study selection criteria.**

The eligibility criteria should follow the PICOTS framework: (1) population, targeted subjects that may or may not have the outcome; (2) index, a prediction model of interest; (3) comparator, previous prediction models to be compared with; (4) outcome, definition and data scale for the outcome with or without additional criterion on the sample size (e.g. events per variable); (5) time, prognostic or diagnostic prediction with specified time interval; and (6) setting, healthcare setting such as primary care, hospital, inpatient, outpatient, *et cetera*. In our example, only original articles and conference abstracts full papers were included. No grouping of the studies was needed because we would not conduct a meta-analysis for data synthesis.

**25. d. Develop the search strategy.**

A search date interval and literature databases should be defined. We also need to describe the keywords. Some configuration for searching studies should also be described, e.g., limiting publication date and language.

**25. e. Develop the review strategy.**

We need to describe which author does what step of review. If there is a conflicting decision among reviewers, then how to achieve final decision should be described.

**25. f. Choose the extracted data.**

In our example, we extracted evaluation metrics of the best model with the most similar outcome definition from each study to that of our study. Any data that are needed to briefly assess potential risk of bias were also extracted. These included the study design, population, setting, outcome definition, sample size, details on events and nonevents, number of candidate predictors, EPV, predictors in the best final model, and the most recommended validation techniques (external over internal validation; bootstrapping over cross-validation, or cross-validation over test split). If no AUROC was reported by a previous study, but there were sensitivity and specificity, then we computed AUROC from both of the metrics using trapezoidal rule (Figure 3).

**25. g. Define model evaluation method.**

Plots of evaluation should be overlaid those of our models. In our example, we plotted a point at a sensitivity and a specificity for each of our models at the optimum threshold on each ROC curve. AUROCs were also compared with the models in our example (with or without the 95% CI).

**26. Calibrate the predicted probability of any model for classification task.**

A model can be calibrated by training an additional model to estimate the true probability of an outcome for classification task given the predicted probability by the model. In our example, we applied a general additive model using locally weighted scatterplot smoothing (GAM-LOESS). Instead of using pre-calibrated models, the final comparison was made for models calibrated using predicted probabilities from each of the developed models. This calibration was intended to obtain more-linear predicted probabilities. No calibration was conducted for the estimation models.

**27. Validate the predictive performances using several data partition methods.**

**27. a. Conduct data partition for non-randomized, external validation.**

We split the dataset into two subsets which were intended for internal and external validation. For all external validation sets, evaluation metrics were resampled by bootstrapping 30 times to obtain interval estimates. As recommended, we split out a dataset for external validation by geographical, temporal, and geotemporal splitting. Geographical splitting was conducted by stratified random sampling of cities with as many as

proportion of cities in each state. Temporal splitting was also conducted by stratified random sampling of days with as many as *p* proportion of days in each subtropical season. Visits of subjects from either the sampled cities or days were respectively excluded as either a geographical or temporal subset. Then, we further excluded as many as *p* proportion of cities in the geographical subset or that number of days in the temporal subset. Visits of subjects from both of the excluded subsets were overlapped as a new geotemporal subset. Either a geographical or geotemporal subset did not include visits or subjects in this geotemporal subset. We randomly tried different *p* proportions for geographical, temporal, and geotemporal splitting such that approximately ~20% of visits belonged to any of the three subsets. These external validation sets were used for stress tests of our models since the distributions of predictors and outcome were conceivably uncertain among excluded cities, days, or overlaps. This reflects the situation in some real-world settings, but this might not reflect common situations nationwide.

**27. b. Conduct data partition for randomized, external validation.**

To estimate the predictive performance nationwide, ~20% of the remaining set was randomly split out thus leaving ~64% of the original sample size for the training set. This random subset was more representative for external validation to estimate the predictive performances of our models nationwide. After splitting out this random subset, the remaining samples were intended for internal validation.

**27. c. Conduct data partition for calibration.**

We split out ~20% of the internal validation set to calibrate each model. The final predictive performance of internal validation came from this calibration subset. We resampled it by bootstrapping 30 times to compute interval estimates. To compute the EPV, we only used ~80%, which was, the pre-calibrated subset. This was used to train the models.

**27. d. Choose resampling techniques for any validation.** For hyperparameter tuning, we applied 5-fold cross-validation, while the final training was conducted using the best tuned parameters by bootstrapping 30 times. Both tuning and training used the same pre-calibrated set by applying cross-validation for all models except the DI-VNN. Please kindly follow the protocol of DI-VNN.12

**28. Consider to assess the best time window for the best model of prognostic prediction.**

The best time window should be evaluated using the best model for prognostic prediction. This provided additional insights when interpreting the best model. In our example, using the internal validation set, we grouped samples by binning the days to the end of pregnancy every 4 weeks. An interval estimate of the AUROCs was computed for each bin from all resampled subsets. By observing the plot, the best time window should mostly cover interval estimates that were greater than an AUROC of 0.5, which represents prediction by simple guessing.

**29. Deploy the best model for each type of tasks.**

**29. a. Determine input for the deployed models.**

The best model can be deployed for both classification and estimation tasks. A model can be quite complex to apply without a software/application; thus, a web application need to be provided for using the best models. This can be made using R Shiny. In our example,** **a user needs to upload a deidentified, two-column comma-separated value (CSV) file of admission dates and ICD-10 codes of the medical history of a subject. A use case should be described with an example dataset for immediate demonstration of how to use the web application.

**29. b. Design user interface.**

For** **responsible clinical prediction,** **a** **threshold is expected to be flexible,** **set by a user. For the classification performance, a threshold should be able to be chose under expected population-level performances of evaluation metrics of interest. For the estimation performance, a true value should also be estimated based on a subpopulation with the same estimated value. This may also be conditioned on the same classification result. All population- and subpopulation-level metrics are expressed as interval estimates. In our example, the population data for the classification task were those of the calibration split, while those for the estimation task were data of the internal validation set which consisted of both pre-calibrated and calibrated subsets. In addition, for estimation task, a timeline should also be shown to visualize each predicted value with the true interval estimate at the subpopulation level. The timeline should show the time of prediction and those code entries that were used as features; thus, a medical history of a subject is visualized using this timeline. All results can be saved online and/or downloaded as a report. An example of the report should be shown. Inference durations 10 times should also be measured for a use case and reported as interval estimates.