As described in the [Generalized Linear Models](article:3) article, fitting GLMs require some extensive feature engineering to understand the dependency between the variables and the response and also between the variables themselves to check for colinearity.

# Univariate Analysis

This simple analysis aims at analyzing each of the possible variables one by one to check their distribution. Furthermore, the target variable is plotted against each of the variable to figure out their eventual dependency. For numerical variables, these dependencies can appear linear, polynomial, or more complex and dependencies can vary within the definition interval of the variable. For categorical variables, this analysis highlights categories that might behave similarly and could be regrouped, or categories that do not contain enough observations and should therefore be filtered out or merged with others to avoid overfitting.

# Cross Analysis and Correlation

More complex dependencies are uncovered when looking at variables taken together. For numerical variables, the [correlation matrix ](insight:aMcJ7PU) is built, when high positive or negative correlations are observed, typically above 0.5 in absolute value, caution should be taken if both variables are included in the model. It could mean that one variable's information is included in another one and fitting a linear model against both will result in instability and overfitting.

Correlation matrices are not straightforward to build for categorical variables, especially when there are no obvious order in them. Thus variables of any kind can be cross analyzed by checking their joint distribution. To avoid colinearity, we would want to avoid having one variable's distribution very different conditionnally to another variable.

# Maps

Geographic variables are better visualized on maps, which thanks to prior business knowledge from the analyst, can be easily scanned to check if they match intuition. Dataiku provides powerful map graphing tools embedded in the charting interface. They allow to create visualizations like [this one](insight:LQcLDnQ) to quickly get a feel of the data.

# Regression Splines

As mentionned above in the Univariate Analysis section, variables sometimes exhibit different kinds of identifiable patterns depending on the range of the variable. In our [example](insight:dMuVSic), driver age shows a steep decreasing dependency with claim frequency before age 30, then it slightly increases, and above 52, the pattern is a bit less clear. [Regression splines](https://patsy.readthedocs.io/en/latest/spline-regression.html) let the user define polynomial dependencies between bounds and knots. Splines of degree zero will be piecewise-constant functions, those of degree one will be piecewise-linear, and those of degree three will be cubic splines for example.

# Feature Selection

Feature Selection is an important part of the modelling to avoid overfitting. In sample, when fitting a GLM, the more variables are added, the better the performance will be. But at some point,  the increase of performance in-sample is simply due to overfitting and will lead to a deterioration of the performance out of sample. A few strategies can be put into place to make sure we select the right number of variables:

- Elastic Net Regularization: L1 and L2 penalties are applied to the coefficients of the regression. The L1 in particular will shrink the coefficient of least useful variables to zero. Features that have their coefficients shrinked to zero could be removed from the modeling.
- Forward or Backward Selection using AIC or BIC metrics: Akaike information criterion (AIC) and Bayesian information criterion (BIC) are two metrics that combine goodness-of-fit with a penalization on the number of variables used in the modeling. From an initial starting point, a variable can be removed from the model, this will worsen the goodness-of-fit, but reduce the penalization term. If the AIC or BIC ends up decreasing with the removal, the model will be deemed better. The analyst could carry on removing variables until the metric stops decreasing. This approach is called backward selection. Forward selection would consist in starting with the best variable and iteratively adding a variable until AIC or BIC stops decreasing.

Many other feature selection strategies exist along with feature reduction techniques to try to reduce the number of dimensions of the initial feature space, and keep only the most impactful ones.