Initial datasets usually contain an extensive range of features that can be in the hundreds or thousands. The goal of this initial filtering is to significantly reduce the number of features to remove the obviously unuseful ones. To achieve this step, some statistics measures are computed; they look for relationships between each feature and the target variable or between the features themselves. Hence, they are not dependent on the type of model that will be used but only try to detect if the features bring information to predict the target.

# Chi Square Independence Test

The [Chi Square Test](https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) aims to check if the feature and the target are independent. To achieve this, each feature is converted to a categorical variable, those that are already categorical are kept as is, and those that are numeric are split in quantiles. Then tables are built to compare the expected number of observations for each pair with the observed number; the first table below displays the observed values, also called the contingency table.

|        | Observed Good             | Observed Bad         | Total    |
|------------|-------------------|-------------|-------------|
| A     | 4500           | 350       | 4850   |
| B    |  2400        | 280      | 2680   |
| C      | 1800      | 110   | 1910  |
| Total     | 8700        | 740 | 9440 |

The expected values are computed as the number of observations in the given feature category times the proportion of goods or bad computed on the overall sample. So for the top left cell, it would be: 

```math
{4850 * \frac{8700}{9440} = 4469.81}
```

|        | Expected Good             | Expected Bad         |
|------------|-------------------|-------------|
| A     | 4469.81           | 380.19       |
| B    |  2469.91        | 210.08      |
| C      | 1760.28      | 149.72   |

Then within each cell, the following computation is applied, illustrated for the top left cell:

```math
{\frac{(observed - expected)^2}{expected} = \frac{(4500 - 4469.81)^2}{4469.81} = 0.20}
```
The values of each cell are summed together to obtain the chi-square statistic. The degree of freedom is also computed as:

```math
{degrees\_of\_freedom = (nb\_rows - 1) * (nb\_cols - 1) = (3 - 1) * (2 - 1) = 2}
```

Then using a chi-square table, for the given degree of freedom, we can determine the p-value and compare it with our predefined critical value:
- if **the p-value is less than the critical value** , the difference between observed and expected is not statistically significant; therefore, the null hypothesis of the independent variables is not rejected. Variables might be dependent.
- if **the p-value is greater than the critical value**, the difference between observed and expected is statistically significant; therefore, the null hypothesis of the independent variables is rejected. There is no obvious relationship between the variables.

In this project, we will filter out all features whose p-values are above a given threshold selected by the user in the Dataiku Application.

# Information Value

Information value is another statistical measure of how features are related to a target. Similarly to chi-square, it requires the features to be categorical. Hence numeric features are also transformed via quantiles. Then within each of the categories of a given feature, the weight of evidence is computed as follows:

```math
{WOE = \log(\frac{goods\_distribution}{bads\_distribution})}
```

with 

```math
{goods\_distribution = \frac{goods}{total\_goods}, bads\_distribution = \frac{bads}{total\_bads}}
```

Hence, using our example above, we would have the following:

|        | Goods Distribution             | Bads Distribution         | Weight of Evidence         |
|------------|-------------------|-------------|-------------|
| A     | 0.52           | 0.47       | 0.089       |
| B    |  0.28        | 0.38      | -0.316       |
| C      | 0.21      | 0.15   | 0.331       |

The positive weights of evidence mean a higher proportion of goods and inversely for negative weights of evidence. And the closer it is to zero, the less discriminating is the category.

Finally, the information value is computed as the sum on all categories of a feature:

```math
{IV = \sum{(goods\_distribution - bads\_distribution) * WOE}}
```

In our case, it yields an information value of 0.055. It is generally admitted that features with information values under 0.02 are uninformative. The user can modify the information value threshold through the Dataiku Application.

# Correlation

Another way to filter features is by looking at relations between features themselves to avoid keeping groups of features that are too strongly correlated to each other, especially since our model will be a linear model. Correlation computations only apply to numeric variables because there needs to be a concept of order for the variable. Here, the user has the ability to select from two methods of correlation, Pearson, and Spearman.

- The Pearson correlation coefficient is the standard way of computing correlation, it is defined as the covariance divided by the standard deviations of the variables.

```math
{\rho = \frac{\operatorname{cov}(X, Y)}{\sigma_X \sigma_Y}}
```

- The Spearman correlation coefficient looks at the ranks of the variables instead of their raw values. It is particularly interesting when variables have very non-linear patterns.

```math
{r = \frac{\operatorname{cov}(R(X), R(Y))}{\sigma_{R(X)} \sigma_{R(Y)}}}
```

Both these metrics are between -1 and 1; the closer they are to zero, the more unrelated. We set a threshold that applies to the absolute value of the chosen metric. If two variables correlate higher than the threshold, we will keep only the one with the highest information value.


