Forests

Kerry Back

Random Forests

Create multiple random sets of training data
Train a tree on each set and average the predictions
To create a random set of training data,
- Draw observations (rows) randomly from the original data with replacement
- Draw as many observations as in the original data
The random data is created and the trees are trained and averaged by scikit-learn’s RandomForestRegressor or RandomForestClassifier

Gradient Boosting

Fit a tree to the training data
Compute the errors from the tree, and fit a second tree to the errors
- The predicted values are now the predictions from the first tree plus a learning rate times the predictions from the second tree
Compute the errors and fit a third tree to the errors, etc.
Learning rate < 1 avoids overshooting

Adaptive Boosting

Trees are fit sequentially
Weights of observations are adjusted according to errors from current estimator
Later trees focus on more difficult observations

Example

Same data as in 3a-trees
- agr, bm, idiovol, mom12m, roeq
- training data = 2021-12
- test data = 2022-01
Quantile transform features and ret in each cross-section

Random Forest

Fitting a random forest

from sklearn.ensemble import RandomForestRegressor

Xtrain = df[df.date=='2021-12'][features]
ytrain = df[df.date=='2021-12']["ret"]

model = RandomForestRegressor(
  max_depth=3,
  random_state=0
)
model.fit(Xtrain, ytrain)

Complexity and Scores

Gradient Boosting

Fitting gradient boosting

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
  max_depth=3,
  learning_rate=0.05,
  random_state=0
)
model.fit(Xtrain, ytrain)

Scores on Test Data

Adaptive Boosting

Fitting adaptive boosting

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(
    DecisionTreeRegressor(
        max_depth=3,
        random_state=0
    ),
    learning_rate=0.5,
)
model.fit(Xtrain, ytrain)

Scores on Test Data