## Problem Statement
The increase in chronic diseases across the US is highlighted by the Centers for Disease Control and Prevention (CDC) as a major concern of the healthcare system. CDC developed the Social Vulnerability Index (SVI) from US census data as a tool for public health officials to identify communities in need of support in the setting of disease outbreaks and hazardous events. Factors like socioeconomic status, household characteristics, racial and ethnic minority status, and housing type and transportation may be highly linked to alarming increases in individual diseases. Any underlined relation can indicate factors for healthcare programs and social support among segments of the population that are the most vulnerable. Moreover, Census data consist of a good source of information for exploring the relations between SVI factors and chronic disease prevalence. However, we cannot assume that all the participants undergo clinical testing for each disease score recorded in the data. Hence tracts of the population might be undetected for certain diseases or under better health support. Such a situation may create stress and inequalities in the healthcare system.

## Methodology
 1. [Regression Analysis](article:16)
 2. [Clustering Analysis](article:17)

## Data
Social Vulnerability Factors are used as input features in both analyses. For Regression, we filter  [svi_vulnerability_cdc_by_disease](dataset:svi_vulnerability_cdc_by_disease) by health measure and generate a model for each with the target variable set to health measure prevalence (community percentage), and social vulnerability factor percentage values as the predictors. For Clustering Analysis we use the social vulnerability factor percentile values that capture a relative relation between all the tracts ranked based on their social vulnerability percentages.  

## Data Preprocessing
A set of data transformations are applied with Dataiku recipes including pivot, join and group one to process the data both in long and wide format for the data analysis and modeling. Data cleaning techniques handle extreme and missing values. 




