Scoring and Complexity

Kerry Back

Model score

The default scoring method for regression models is R-squared.
Equals 1 - MSE / mean squared deviation from mean
- MSE = mean squared error
- So we compare MSE from the model to MSE using the mean to predict
Max R-squared is 1 and can be negative

Tree example

Same data as in 3a-trees
- agr, bm, idiovol, mom12m, roeq
- training data = 2021-12
- test data = 2022-01
Quantile transform features and ret in each cross-section
Fit tree with max_depth=2

View the regression tree

from sklearn.tree import plot_tree
_ = plot_tree(model)

Scores on training and test data

R-squared on training data

model.score(Xtrain, ytrain)

0.18287695496398504

R-squared on test data

Xtest = df[df.date=='2022-01'][features]
ytest = df[df.date=='2022-01']["ret"]
model.score(Xtest, ytest)

0.03330573059873343

Depth and overfitting

If we make a model more complex (more parameters), it will fit the training data better but may make worse predictions on new data.
This is called overfitting.
We can make a tree more complex by increasing its depth.

Complexity and scores for the tree model