The main purpose of this study is to develop disease prediction models to quickly and
accurately turn data into diagnosis. Therefore, this study developed machine learning, deep learning,
and ensemble models for 39 diseases classification (Supplement Table S9) of patients visiting the
emergency room using 88 laboratory test parameters including blood and urine tests (Supplement Table
S1). The overall workflow of disease prediction model based on laboratory tests (DPMLT) is schematically demonstrated in Figure 1. This protocol is largely composed of 5 parts, and the third part explains the machine learning model and the deep learning model.
1.0 Data collection and preprocessing
We collected anonymized laboratory test datasets, including blood and urine test results, along with each patient’s final diagnosis on discharge. We curated the datasets and selected 86 attributes (different laboratory tests) based on value counts, clinical importance-related features, and missing values. For Deep learning (DL), missing values were replaced with the median value for each disease.
2.0 Feature extraction
Feature extraction plays a major role in the creation of machine learning (ML) models.
3.0 Model selection and training
3.1 DL selection
The research in this study was conducted using a deep neural network (DNN) for structured data.
3.2 MLP (multi-layer perceptron)
All features used in this study are numeric data except for the ‘sex’ feature. MLP recognizes only numerical data, so we transformed the categorical feature of ‘sex’ into a number using LabelEncoder of the scikit-learn library. MLP does not allow for null values, so we replaced null values with the median value of each feature.
3.3 Feature normalization
Each feature had a different range. We applied a standard scale to normalize the mean and standard deviation of each feature to (0, 1) by subtracting the mean value of the feature and dividing by its standard deviation value.
3.4 Hidden layer composition
In our study, the hidden layer was comprised of two layers. We employed the Relu (rectified linear unit) activation function for each layer. We applied the dropout technique to each hidden layer, which is a simple method to prevent overfitting in neural networks.
3.5 XGBoost
XGBoost is an algorithm that overcomes the shortcomings of GBM (gradient boosting machine). The disadvantages of GBM include long learning times and overfitting problems. The most common ways to solve these problems are through parallelization and regularization. Our dataset contained null values, which MLP replaced with the corresponding median values, but XGBoost has a procedure to process null values, so utilized that procedure. The max_depth argument in XGBoost is one factor determining the depth of the decision tree. Setting max_depth to a large number increases complexity and can lead to overfitting. This study found that max_depth was optimally set to 2.
3.6 LightGBM
The difference between LightGBM and XGBoost is the method by which the tree grows. XGBoost creates a deeper level within the leaf (level-wise/depth-wise), and LightGBM generates a leaf at the same level (leaf-wise). LightGBM uses a leaf-centered tree-splitting method to split leaf nodes with the maximum loss value, creating an asymmetric tree. To avoid overfitting in LightGBM, an experiment was conducted by adjusting num_leaves and min_child_samples.
3.7 Ensemble model results (DNN, ML)
We developed a new ensemble model by combining our DL model with our two ML models to improve AI performance. We used the validation loss for model optimization.
4.0 K-Fold Cross-validation
In our study, we divided a total of 5145 datasets at a ratio of 8:2 to create the training set and test set. We set the validation data ratio to 0.2 for the training set, which was evaluated using validation loss for model optimization based on the training data. If the number of validation data is increased, the number of training data decreases, leading to a problem of high bias. We used k-fold cross validation to prevent data loss of the training set.
5.0 SHAP (Shapley Adaptive Explanations)
SHAP is an acronym for Shapley Adaptive Explanations. Relating to the Shapley value, as the name suggests. In our experiment of MLP, we can calculate SHAP value using DeepLIFT.