This article goes through all the steps, from data ingestion to obtaining the final scorecard. We start from an already prepared dataset that contains 30,000 rows and 181 columns. To configure and execute the process, we will follow the steps listed in the [Dataiku Application](article:15).

# Data Ingestion

The data we are analyzing for this walkthrough is provided with the project in filesystem format. Therefore, we do not need to go through the connection reconfiguration step and can directly jump to Feature Identification. If using your own data, you can connect it directly prior to the next step.

![Feature Identification.png](isYQAZdXbcng)

The credit event variable is called the target, and the id variable is called the id. Then, we input three regulatory-sensitive demographic variables, which are:

- age: the age of the applicant
- code_gender: the gender of the applicant
- name_family_status: the marital status of the applicant.

Depending on local regulation and internal policies, one can adjust these variables to evaluate if the model and analyses are compliant. These three variables will not be used in the model itself, but a specific analysis will be run using them to evaluate potential issues with the features that are directly used. We click the button to run this initial step.

# Feature Filtering

![Feature Filtering.png](Ri4udHIDLzxj)

Starting with 176 features (the number of columns minus the mandatory columns and the sensitive variables), we reduce this set using the statistical metrics: Information Value, Chi-Square, and Correlation. In the Dataiku Application, we set the following parameters:

- Information Value Threshold: 0.02, a usual threshold under which variables are deemed useless for our prediction.
- Chi-Square p-value: 0.05, standard p-value to have a 95% confidence that variables are independent for those with values over the threshold.
- Correlation method: Spearman or rank-based correlation. It is the correlation of the rank of the variables instead of their actual value, it handles better non-linear relationships.
- Correlation threshold: 0.7; we process pairs of variables with correlation above 70% or under -70%.

With this process, 142 variables were filtered out, mostly by the information value filter. Using these metrics, it is also possible to see which variables might be the most interesting in our model:

![Top Information Value.png](1yA8oq0ulgeL)

Variables are ordered by descending Information Value; according to this measure, the top variables are the external scores, then occupation_type and organization_type.

![Top Chi Square.png](b07ETEBuQV8W)

Variables can also be ordered according to their Chi Square p-value, but in ascending order. The first three variables also belong to the five top Information Value variables.

![Correlation.png](oflM9w9oGDv4)

When looking at the correlation matrix, a large block of correlated variables appears that seem related to the applicant's home characteristics. 25 variables are removed from the analysis from the correlation filter.

# Feature Binning

![Feature Binning.png](9gyLJSSJ32ZO)

Next, we are going to bin the remaining features using the weight of evidence as described in this [article](article:8). We set the parameters to the following values:

- Categorization Threshold: 20 numeric variables with less than 20 unique values are treated as categorical variables.
- Minimum Share: 0.1, each bin will contain at least 10% of the observations.
- Minimum Minority Share: 0.01, each bin will contain at least 1% of the minority share observations (only for numeric bins).
- Maximum p-value: 0.05; if the test for mean equality between two neighboring numeric bins passes with this set p-value, they are merged.
- IV filter threshold: 0.02, the same value as previously, now that variables are binned. Variables that have information values under that threshold are dropped.

11 additional variables are dropped after this step to reach 20 variables. Numeric variables are automatically labeled with their min and max values, while categorical variables have default labels with all categories included separated by ```%,%```.

![Editable.png](tRdPUN4O2C50)

We use the link to edit the labels and give them more meaningful names. For instance:

- name_education_type: Academic Degree and Higher Education are regrouped in the same bin that we call High Education. All the other categories are regrouped in another bin that we call Low Education.
- occupation_type: 19 categories have been merged into 3 bins. We name them according to our evaluation of the kind of skill these imply and call them High Skill, Medium Skill, and Low Skill. 

We can check in the graphs from the dashboard if the relationship between the bins and the weight of evidence fits our intuition.

![Woe Categorical.png](lqqTVcgmIhMf)

In the graph above, one can see the clear relationship between how skilled the occupation is and the weight of evidence. We can evaluate directly if this matches our expectation and intuition that higher skill employment is connected to a lower the probability of having a credit event.

# Feature Selection

20 variables might still be too many for the final model. Therefore, we further reduce this number using an automated feature selection algorithm. These algorithms are explained in more detail in this [article](article:9).

![Feature Selection.png](jrXmA2L0Ya0K)

We configure the feature selection process:

- Feature Selection method: Lasso, we will reach a global maximum in terms of selected variables.
- Number of Features: 8.

We run the algorithm and get the results in the dashboard.

![Feature Selection Slide.png](u2BSyDpzywVV)

The bar plot shows the absolute coefficient of each of the selected variables. It gives a sense of their impact on the model. One can see from it that ```ext_source_2``` seems to be the most important variable, while ```days_id_publish``` might be considered out of the model. The user can iterate on some other feature selection methods to check if results align or diverge. Going into the analysis linked on the left-hand side can also provide information on the models that have been trained to reach this result using an iterative process.

# Score Card Building

Finally, we build the scorecard on the 8 selected variables. 

![Score Card Building.png](BrZPXSr8a9zm)

Clicking on the first button triggers the model training, then we can adjust the scale and level of the scorecard by modifying the following parameters:

- Base Score: 600, the tipping point of our scorecard; we will consider that good creditors are those with scores above that threshold.
- Base Odds: 9, the expected odds for that base score. We want accepted applicants to have better odds than this one.
- Points to Double Odds: 50, each time the score increases by 50 points, the odds will double.

![Score Card Slide.png](sYvdubidIaYh)

After launching the scenario, we can access the scorecard in the dashboard. The dataset on the left shows, for each variable, the allocated score per bin. Thus, the user can have a sense of which bins impacts positively or negatively the score. The chart on the right shows the performance of applicants depending on their scores, which were computed on the test dataset that was held out in the analysis. We can check here that there is a clear dependency between the credit event rate and the score. Then we trigger the following scenario to update the webapp and interact with features to understand how the score card reacts.


