Traditionally, a credit scoring analyst would go through each variable to decide whether to include them in the final model. Some iterations can also take place to try fitting models with different sets of variables and compare them before choosing the best one according to predefined criteria. In this solution, an automated approach is implemented to run the feature selection. For each of the feature selection methods described below, the user defines the number of features that they want to obtain. In practice, feature selection can be made in multiple other ways depending on the constraints imposed on the structure of the final credit risk model.

# Forward Selection

Forward selection is a wrapper method in the sense that it will train the model multiple times and assess its performance over different sets of variables. It is called forward because we start with 0 variables and increase this number in an interactive way, as opposed to the backward selection, which would start by including all the variables and removing them one by one. The algorithm runs as follows:

```
selected_features = empty_list
While n<nb_features:
    for each unselected feature:
        train model with all selected features and the unselected feature and save performance metric
    keep the feature that optimizes performance metric and append it to selected_features
    n++
```

This method yields a local optimum and not a global one, as it does not optimize all the possible sets of variables simultaneously. However, it has the nice property of first including variables that are the most important, which might make sense in terms of interpretability. 

# Lasso Selection

Lasso selection is an embedded method that will simply add a [lasso penalty](https://en.wikipedia.org/wiki/Lasso_(statistics)) to the model to reduce the number of variables. Lasso regression is a technique that shrinks the number of variables by adding a term proportional to the L1 norm of the regression coefficients; the larger the penalty, the more powerful will be the shrinkage. There is a monotonic decreasing relationship between the lasso penalty and the number of variables selected. Here, we target several variables in the final model and obtain the corresponding lasso penalty by dichotomy:

```
Set lasso penalty to initial value c
Fit model for lasso penalty c
Compute n_c=number of selected features
n_a = n_c, a = c
n_b = n_c, b = c
while n_c != nb_features:
    if n_c > nb_features:
        if n_c > n_b:
            b = c*10
            c = b
            n_b = n_c
        else:
            c = (a + b) / 2
    else:
        if n_c < n_a:
            a = c/10
            c = a
            n_a = n_c
        else:
            c = (a + b) / 2
    Fit model for lasso penalty c
    Compute n_c=number of selected features
```

The algorithm looks iteratively for the lasso penalty that will output the desired number of variables. It fits all variables together, therefore, the optimum is global instead of forward selection.

# Tree-Based Selection

Also, an embedded selection method, tree-based selection, will fit a tree-based model and will infer the variables to keep from the variable importance in this model. In our case, a random forest was trained using all the variables and predicting the target variable. The importance of each variable is based on impurity as described by [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.feature_importances_). One caveat of this technique is that it can extract non-linear patterns from the data. In contrast, the final model will not as it is logistic regression. Therefore, some variables may be deemed important according to this method while not yielding meaningful linear coefficients afterward.


