# Methodology
To better understand how study design is associated with patient enrolment rate, we build a random forest model to examine relationships between study protocols, SDOH factors (optional), and patient enrolment rate. The random forest model predicts the likelihood of having a high, medium, or low enrolment rate for a given study protocol. The model only predicts ongoing studies with the enrollment status as **Estimated**. 

# Input Data
**Features**: Study protocols (52 factors) & SDOH (optional 3 factors)
| Feature | Type | Values |Description|
|-----------|--------|----------|----------------|
|OrgClass|Categorical|INDUSTRY, NIH, FED, OTHER|The type of organization conducting the study|
|StudyType|Categorical|INTERVENTIONAL, OBSERVATION|The type of the study|
|Phase|Categorical|PHASE1, PHASE2, PHASE1/2, PHASE3, PHASE4, NA|The phase of the study|
|NumberArmGroup|Integer||The number of arm groups in the study|
|DesignAllocation|Categorical|RANDOMIZED, NON_RANDOMIZED, NA|The allocation type used in the study|
|DesignModel|Categorical|PARALLEL, SINGLE_GROUP, SEQUENTIAL, OTHER,  COHORT, CASE_ONLY, CASE_CONTROL...|The design model used by the study|
|DesignPurposeTime|Categorical|TREATMENT, SUPPORTIVE_CARE, DIAGNOSTIC, PREVENTION, OTHER, RETROSPECTIVE, PROSPECTIVE...|The design primary purpose or time perspective of the study|
|DesignMasking|Categorical|SINGLE, DOUBLE, TRIPLE, QUADRUPLE, NONE|Masking of the study|
|HealthyVolunteers|Bolean||Eligibility criteria in healthy participants enrolment|
|Sex|Categorical|ALL, FEMALE, MALE|Eligibility criteria in sex|
|MinimumAge|Numerical||Eligibility criteria in minimum age|
|MaximumAge|Numerical||Eligibility criteria in maximum age|
|CHILD|Bolean||Eligibility criteria in age group|
|ADULT|Bolean||Eligibility criteria in age group|
|OLDER_ADULT|Bolean||Eligibility criteria in age group|
|PI affiliation count|Integer||The number of affiliations involved in the study|
|PrimaryInvestigator count|Integer||The number of primary investigators involved in the study|
|Site country count|Integer||The number of countries where the clinical sites used by the study are situated |
|Site count|Integer||The number of the clinical sites used by the study|
|BEHAVIORAL intervention|Bolean||The type of intervention used in the study|
|BIOLOGICAL intervention|Bolean||The type of intervention used in the study|
|COMBINATION_PRODUCT intervention|Bolean||The type of intervention used in the study|
|DEVICE intervention|Bolean||The type of intervention used in the study|
|DIAGNOSTIC_TEST intervention|Bolean||The type of intervention used in the study|
|DIETARY_SUPPLEMENT intervention|Bolean||The type of intervention used in the study|
|DRUG intervention|Bolean||The type of intervention used in the study|
|GENETIC intervention|Bolean||The type of intervention used in the study|
|OTHER intervention|Bolean||The type of intervention used in the study|
|RADIATION intervention|Bolean||The type of intervention used in the study|
|Bacterial_Infections_and Mycoses conditions|Bolean||The disease domain of interest of the study|
|Cardiovascular conditions|Bolean||The disease domain of interest of the study|
|Congenital_Hereditary and Neonatal conditions|Bolean||The disease domain of interest of the study|
|Digestive_System conditions|Bolean||The disease domain of interest of the study|
|Endocrine_System conditions|Bolean||The disease domain of interest of the study|
|Female_Urogenital conditions|Bolean||The disease domain of interest of the study|
|Hemic_Lymphatic conditions|Bolean||The disease domain of interest of the study|
|Immune_System conditions|Bolean||The disease domain of interest of the study|
|Male_Urogenital conditions|Bolean||The disease domain of interest of the study|
|Mental_Disorders conditions|Bolean||The disease domain of interest of the study|
|Musculoskeletal conditions|Bolean||The disease domain of interest of the study|
|Neoplasms conditions|Bolean||The disease domain of interest of the study|
|Nervous_System conditions|Bolean||The disease domain of interest of the study|
|Nutritional_Metabolic conditions|Bolean||The disease domain of interest of the study|
|Signs_Symptoms conditions|Bolean||The disease domain of interest of the study|
|Respiratory_Tract conditions|Bolean||The disease domain of interest of the study|
|Skin_and_Connective_Tissue conditions|Bolean||The disease domain of interest of the study|
|Virus_Diseases conditions|Bolean||The disease domain of interest of the study|
|Wounds_and_Injuries conditions|Bolean||The disease domain of interest of the study|
|PrimaryOutcome count|Integer||The number of primary outcomes in the study|
|SecondaryOutcome count|Integer||The number of secondary outcomes in the study|
|Social_Vulnerability_Index county weighted sum|Numerical|Between 0 and 1|The mean SVI scores weighted by the population of each county|
|US_counties_population sum|Integer||The sum of counties population|
|has_nonus_sites|Bolean||Include non-US sites or not|





**Target variable**: Enrolment rate class
 - The **enrolment rate class** is the enrollment count divided by study period (from start date to primary complete date). Then, it is split into three buckets: **High**,  **Medium**, and **Low**. **High** represents an enrollment rate of more than 100 patients per year. **Medium** 25 to 100 per year. **Low** fewer than 25 per year.
 <div class="alert">
- Enrollment rate equals **NA** if enrollment count or enrollment period equals zero.
- Enrollment rate equals **enrollment count ** if the enrollment period is shorter than a year.
</div>

# Model
We deploy a Random Forest classification model with hyperparameters optimized by the grid search strategy. 

# Explainability Metrics
 - _Absolute feature importance_: the Shapley value for a feature represents the average contribution of that feature to all possible combinations of features in explaining a particular prediction. Absolute feature importance is the average of the absolute Shapley values computed for each feature.
 -  _Shapley (SHAP) value _ in a model is the average marginal contribution of a feature value across all possible coalitions. As a generalization, the explanation is the difference between the prediction value and the average of prediction values obtained by replacing the feature value with values drawn from the test dataset. ([see Dataiku documentation](https://doc.dataiku.com/dss/latest/machine-learning/supervised/explanations.html))
![Screenshot 2023-10-05 at 11.05.57.png](SzDU59Zo8XGj)
Figure 1: The most important 20 features represent more than 90% of the total feature importance
![Screenshot 2023-10-05 at 11.30.34.png](aRrfDnaT9az8)
Figure 2: Feature effects display multiple Shapley values computed per feature. The Shapley values (x-axis) indicate the relative impact of the feature value (color) on the record's prediction.