Forests





Kerry Back

Random Forests

  • Create multiple random sets of training data
  • Train a tree on each set and average the predictions
  • To create a random set of training data,
    • Draw observations (rows) randomly from the original data with replacement
    • Draw as many observations as in the original data
  • The random data is created and the trees are trained and averaged by scikit-learn’s RandomForestRegressor or RandomForestClassifier

Gradient Boosting

  • Fit a tree to the training data
  • Compute the errors from the tree, and fit a second tree to the errors
    • The predicted values are now the predictions from the first tree plus a learning rate times the predictions from the second tree
  • Compute the errors and fit a third tree to the errors, etc.
  • Learning rate < 1 avoids overshooting

Adaptive Boosting

  • Trees are fit sequentially
  • Weights of observations are adjusted according to errors from current estimator
  • Later trees focus on more difficult observations

Example

  • Same data as in 3a-trees
    • agr, bm, idiovol, mom12m, roeq
    • training data = 2021-12
    • test data = 2022-01
  • Quantile transform features and ret in each cross-section

Random Forest

Fitting a random forest

from sklearn.ensemble import RandomForestRegressor

Xtrain = df[df.date=='2021-12'][features]
ytrain = df[df.date=='2021-12']["ret"]

model = RandomForestRegressor(
  max_depth=3,
  random_state=0
)
model.fit(Xtrain, ytrain)

Complexity and Scores

Gradient Boosting

Fitting gradient boosting

from sklearn.ensemble import GradientBoostingRegressor

model = GradientBoostingRegressor(
  max_depth=3,
  learning_rate=0.05,
  random_state=0
)
model.fit(Xtrain, ytrain)

Scores on Test Data

Adaptive Boosting

Fitting adaptive boosting

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor

model = AdaBoostRegressor(
    DecisionTreeRegressor(
        max_depth=3,
        random_state=0
    ),
    learning_rate=0.5,
)
model.fit(Xtrain, ytrain)

Scores on Test Data