Scoring and Complexity





Kerry Back

Model score

  • The default scoring method for regression models is R-squared.
  • Equals 1 - MSE / mean squared deviation from mean
    • MSE = mean squared error
    • So we compare MSE from the model to MSE using the mean to predict
  • Max R-squared is 1 and can be negative

Tree example

  • Same data as in 3a-trees
    • agr, bm, idiovol, mom12m, roeq
    • training data = 2021-12
    • test data = 2022-01
  • Quantile transform features and ret in each cross-section
  • Fit tree with max_depth=2

View the regression tree

from sklearn.tree import plot_tree
_ = plot_tree(model)

Scores on training and test data

  • R-squared on training data
model.score(Xtrain, ytrain)
0.18287695496398504
  • R-squared on test data
Xtest = df[df.date=='2022-01'][features]
ytest = df[df.date=='2022-01']["ret"]
model.score(Xtest, ytest)
0.03330573059873343

Depth and overfitting

  • If we make a model more complex (more parameters), it will fit the training data better but may make worse predictions on new data.
  • This is called overfitting.
  • We can make a tree more complex by increasing its depth.

Complexity and scores for the tree model