A standard way of handling variables to create scorecards is to encode the features with their weight of evidence. To achieve this encoding, features must be categorical. Therefore, numeric variables must be converted to categorical variables. This transformation is done by optimizing the information value, we are also running an optimization to reduce the number of bins for categorical variables. The weight of evidence and information value are defined in the [previous article](article:7).

# Optimal Binning for Numeric Variables

The numeric variables optimal binning consists of finding the cuts that maximize the information value of the resulting binning while satisfying a few constraints. The resources for the implementation are listed [here](article:4). The implemented constraints are:

- Monotonicity: As the values of the variable increase from one bin to the following, the weight of evidence should either always increase or always decrease.
- Representativeness: Each bin should contain at least a minimum percentage of the total number of observations and a minimum percentage of observations that are bad credits.

Under those constraints, the algorithm will start with a maximum number of bins and merge them progressively while ensuring that bins are different enough. Missing values will also be assigned to a bin.

# Optimal Binning for Categorical Variables

Categorical variables should not be directly encoded into their weight of evidence to avoid having some bins with a very low number of observations, thus not representative. The goal of this step of binning is to merge bins that have similar weights of evidence to maximize the information value while satisfying the constraint of having at least a given percentage of observations within each bin. Information value will decrease as the number of bins decreases, so the stronger the constraints, the least informative will be the resulting feature. A balance must be found between robustness and predictive power.

# Encoding

The weight of evidence is used to compute the information value and encode the variables. Hence the model trained will use numeric variables as input, which are the weights of evidence of the bins defined here. This method allows to compute of only one coefficient per variable and easily convert the logistic regression model into a scorecard as described [later](article:10). The target variable is indirectly encoded in the input so there might be concerns about data leakage when using such encoding, the constraints on the size and representativeness of each bin are meant to guard against such risks.


