This study will follow a retrospective, observational, patient-level prediction design (https://ohdsi.github.io/TheBookOfOhdsi/PatientLevelPrediction.html). We defined the 'patient-level prediction' as a modeling process wherein an outcome is predicted within a time at risk relative to the target cohort start and/or end date. Prediction will be performed using a set of covariates derived using data prior to the start of the target cohort.
Figure 2 (from (https://ohdsi.github.io/TheBookOfOhdsi/PatientLevelPrediction.html) illustrates the prediction problem we will address. Among a population at risk, we aim to predict which patients at a defined moment in time (t = 0) will experience some outcome during a time-at-risk. Prediction is done using only information about the patients in an observation window prior to that moment in time.
We follow the PROGRESS best practice recommendations for model development and the TRIPOD guidance for transparent reporting of the model results. (8, 9). For all data sources, we refer to the appendices.
In all models we estimated the risk after 1, 2, and 5 years after diagnosis. Our population setting comprises patients with a time-at-risk window between 0 and 365 days, 0 and 730 days, and 0 and 1826 days. In all settings, the minimum lookback period applied to the target cohort is 365 days, without removing patients without time at risk or removing patients with an outcome prior to diagnosis. We included only the first exposure per patient.
Statistical Analysis Method(s)
Lasso logistic regression belongs to the family of generalized linear models, where a linear combination of the variables is learned and finally a logistic function maps the linear combination to a value between 0 and 1. The lasso regularization adds a cost based on model complexity to the objective function when training the model (10). This cost is the sum of the absolute values of the linear combination of the coefficients. The model automatically performs feature selection by minimizing this cost. We use the Cyclic coordinate descent for logistic, Poisson and survival analysis (Cyclops) package to perform large-scale regularized logistic regression: https://github.com/OHDSI/Cyclops
Model evaluation will be based on the calibration plot and the discrimination of the internal and external validation.
The PatientLevelPrediction package itself, as well as other OHDSI packages on which PatientLevelPrediction depends, use unit tests for validation.
This study will be designed using OHDSI tools and run with R.
Reviewing the incidence rates of the outcomes in the target population prior to performing the analysis will allow us to assess its feasibility. The full Shiny app can be observed here: PIONEER watchful waiting.
Data Analysis Plan
For the time at risk analyses we use lasso regression, we use a fixed set seed and a starting lambda value of 0.01.
A covariate included in the model needs to contain at least 0.001 times. In all models we specified medium term as 180 days and long term as 365 days.
In the second model, we included the predictors age and all concept based clinical measurements 6 months before diagnosis defined in the OMOP Common data model.
In the final model, we extended the second model by including all concept based clinical condition one year before diagnosis and the Charlson comorbidity index.
Clinical model development
A total of six clinicians made a selection of their top 10 covariates. Consensus was reached after discussion. The included variables are:
- Grade group (levels: 1, 2, 3, 4, 5)
- PSA (levels: <10, 10-20, >20)
- Total cardiovascular disease event
- Age group (per 5 year)
- Charlson comorbidity index (levels: 0, 1, ≥2)
- cT-stage (levels: T1, T2, ≥T3)
- Family history of PCa
- Stroke (1 year before diagnosis)
- Type 2 diabetes
- Metastatic disease extent
Model Development & Evaluation
In the model development we will split the data in a train set (75%) and a test set (25%) for internal validation. The optimal lambda for the lasso regression will be assessed by 3-fold cross validation on the train set. Discriminative ability between models will be assessed by the area under the receiver operating characteristic curve (AUC). The discrimination of the clinical model will be compared against the concept-based model.
Strengths & Limitations
A strength of the study is the inclusion of multiple data sources such as clinical data and claims data, all adapted with OMOP standards, allowing more generalized results. The analysis of big data may identify predictors that are currently not used in daily clinical practice. This provides a limitation but also a chance for the study. Newly identified significant predictors might not be included in clinical procedures, and therefore this study can be irrelevant for clinical questions. On the other hand, it may provide the chance to adapt current PCa treatment for the future.
A clear limitation of this study is, that in claims data the occurrence of death is not accurately presented and might be biased.
Protection of Human Subjects
Local analyses were run to take into account the sensitive nature of the data. Confidentiality of patient records will be maintained always. All study reports will contain aggregate data only and will not identify individual patients or physicians. At no time during the study will the sponsor receive patient identifying information except when it is required by regulations in case of reporting adverse events.
Tables & Figures
For the incidence rate and characterization, we refer to PIONEER watchful waiting.