The Claim Frequency Modeling zone consists of a model that is trained on [claim_train](dataset:claim_train) and scored on [claim_test](dataset:claim_test). For technical explanation on the GLM plugin, please refer to the [specific documentation](https://www.dataiku.com/product/plugins/glm/).

![Claim Frequency Modeling.png](5YJqT1ZVX0Ja)

# Preprocessing

In the [analysis](analysis:hu0snSHQ), we first add specific steps into the Visual Script within the Model. These script steps are similarly to those of a Prepare recipe. When deploying the model, these initial prepare steps will be computed each time the model is used for scoring or called in an API.

Steps from the [feature processing](article:12) are added to the script, along with the following:

- Vehicle Age is binned in a VehAgeBin variable, in three bins: age 0, age > 0 and < 10, age > 10.
- Driver Age is binned in a DrivAgeBin variable, in custom bins, smaller for younger ages which display more disparate behaviours.
- Vehicle Power is clipped at 9, because higher values are not significant enough.
- Area is transformed into an ordinal variable (A becomes 1, B becomes 2, etc), and the log is taken.

# Target and Objective

The aim of this model is to predict ClaimNb, the number of claims made by a policyholder. It is thus a regression problem, but as ClaimNb is distributed like a Poisson variable, using default evaluation metrics is not the most relevant. Most of them assume that the distribution of the target is normal and might not discriminate well accross models that try to capture a variable that has a different distribution. Therefore, we create a custom metric that we save in a code sample to be reusable elsewhere. The metric of interest in this particular problem is the average Poisson deviance. [Deviance](https://en.wikipedia.org/wiki/Deviance_(statistics) is a goodness-of-fit metric which is the difference between the fitted likelihood and the saturated likelihood. The saturated likelihood is the likelihood with the optimal Poisson fit for each observation. The smaller the deviance the better.

# Feature Handling

In the feature handling, we experiment iteratively, for example: taking Driver Age as a single numerical variable, or fitting the Driver Age bins that were created in the Script, or even using regression splines through a custom preprocessor.

Overall, we follow certain principles:
- we never include a variable twice: as bins and as numerical for example
- we don't rescale variables: to interpret coefficients in a straightforward way
- we handle categorical variables with a [one-hot encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) and drop one dummy: to avoid colinearity between dummy variables

We have not experimented creating combinations of variables, this is a potential improvement of the model.

To create regression splines, we have implemented a custom preprocessor which is saved in code samples. With this code sample, the user can define:
- the degree of the polynomial: 0 for constant, 1 for linear, 2 for quadratic, 3 for cubic.
- the knots: points at which the piecewise polynomials meet
- the prefix of the output variables
- the keep original column flag: if the original column should be kept, it should be left at False

# Model

In the list of algorithms, we select the Generalized Linear Model Regression and select the following options:
- Elastic Net Penalty: 0, we do not regularize our regression. In case we have many variables and some show colinearities between each other, regularization will shrink some coefficients to zero and reduce overfitting risk.
- Distribution: Poisson
- Link function: Log, since ClaimNb is positive. It also enables us to define an exposure that will normalize our observations by Exposure.
- Offset mode: Offsets/Exposures to add an exposure column
- Training dataset: claim_train, which is the dataset used in the analysis
- Offset columns: empty
- Exposure columns: Exposure, to normalize ClaimNb on the period the contract was held

# Results

Results are analyzed mainly through the GLM Summary View available in the Views screen.

The GLM metrics provide three essential indicators to evaluate the fit of the model. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) both combine goodness-of-fit and the number of variables included in the model. Deviance on the other hand is only a goodness-of-fit metric. Deviance should decrease each time a new variable is included in the model. On the contrary, at some point AIC and BIC increase if added new variables do not improve the fit significantly enough.

![GLM metrics.png](eCSYa1dofKT9)

GLM Actual versus Expected graphs help the analyst visualize the dependency between each variable and the response. For each of the variables available (either included or not in the model), the view displays three graphs:
- The Base Graph represent the pure effect of each variable in the model, by setting all other variables to their base values, against the actual values. The base value is defined as the modal value (the most frequent value for categorical variables, and the most frequent bin for numeric variables which are split into 20 uniformly sized bins).
- The Predicted Graph represent the average predicted value in each bin of the selected variable against the actual values.
- The Ratio Graph takes the ratio of predicted and actual.

![GLM graphs.png](utaoN5vDXCDZ)