## Methodology
To better understand how community-level social factors are associated with rates of health measures (including chronic diseases and other health behavioural data) we use a ridge regression to examine relationships between the tract-level SVI and each tract-level health measure indicator.  Regression models find tracts (and relevant county/state) with either high or low health measure prevalence and show social impacts on that prevalence via model explainability measures. 

## Input Data
Features: Social vulnerability factor percentages (16 factors)
Target Variable: Health measure prevelance perentage

## Model
We deploy Ridge Regression linear models that use L2 regularization to the weights. Grid search strategy is used to optimize the regularization (alpha) tern. 

## Note
The purpose of this session is not to predict the percentage of the health measure value as this is not a causal analysis. Instead, we try to understand what is the relation and contribution of each social vulnerability factor to the health measure prevalence across all the tracts in the data. 

## Explainability Metrics
 -  _Regression coefficients_  reflect how much the predicted health measure percentage is expected to change when the percentage of each social vulnerability variable changes by one. If the sign of the coefficients is positive it implies that there is a direct relationship between the variables. This means that if the independent variable increases (or decreases) then the dependent variable also increases (or decreases). If the sign of the coefficients is negative it means that if the independent variable increases then the dependent variable decreases and vice versa. This means it is an indirect relationship. ([see scikit-learn documentation](https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html))
 
 -  _Shapley (SHAP) _ in linear models is the average marginal contribution of a feature value across all possible coalitions. As a generalization, the explanation is the difference between the prediction value and the average of prediction values obtained by replacing the feature value by values drawn from the test dataset. ([see dataiku documentation](https://doc.dataiku.com/dss/latest/machine-learning/supervised/explanations.html))

In linear models, regression coefficients tell us how much the model target prediction changes across all our data when we change each of the input features and assume that the rest of the features stay constant. SHAP values can further show the positive or negative relationship for each variable with the target at record level as opposed to summarized over all tracts. Moreover, SHAP increases thransparency of local interpretability as each observation gets its own set of SHAP values. The absolute of a SHAP value shows how much a single feature affected the prediction.


Graph interpretation:
![shaplot.png](7xpF28C38Zje)

The x–axis shows whether the effect of each social vulnerability factor value is associated with a “positive” or “negative” prediction. The Colour bar shows the original percentage value range of the social vulnerability factors and how “higher” and “lower” values of the feature will affect the result. In terms of correlation, a high level in the percentage of "No High School Diploma" in an area has a high and positive impact on health measure prevalence. The “high” comes from the purple color, and the “positive” impact is shown on the x-axis. Similarly, we would say that a high percentage of “Group Quarters” is negatively correlated with the health measure percentage values across that region.


## Analysis
Dataiku Machine Learning (ML) Lab consists of a powerful engine that can be used to fit the publicly (or private) available social and health data to identify the impact of SDOH in certain areas with individual health measures (including chronic diseases and other health behaviours). Several "white-box” models are easily interpretable through visual tools as regression coefficients published in the project dashboard. Individual explanations exported from the ML LAB show individual cases of high or low health measure value expected percentage predictions with an explanation of the contribution of each SVI factor. Moreover, by exporting the SHAP values of each factor for each tract we list the top factor that is driving the expected health measure value by the trained model that can be further used to detect deviation of certain tracts among counties or larger areas. 
We apply this method to multiple health measures examined in this work but also enable users to follow a similar process for other health outcomes. 