Trees

Kerry Back

Decision Trees

Split data sequentially into subsets based on the value of a single feature
- Above a threshold into one group
- Below the threshold into the other
Prediction in each subset is the plurality class (for classification) or the cell mean (for regression).
Try to minimize impurity in classification and (usually) mean squared error in regression.

Example

Example: train from 2021-12, predict for 2022-01

Get data from the SQL database

df = pd.read_sql(
    """
    select ticker, date, ag, bm, idiovol, mom12m, roeq, ret
    from data
    where date in ('2021-12', '2022-01')
    """, 
    conn
)
features = ["ag", "bm", "idiovol", "mom12m", "roeq"]

Transform each cross-section

from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution="normal")

def qtxs(d):
    x = qt.fit_transform(d)
    return pd.DataFrame(x, columns=d.columns, index=d.index)

df[features + ["ret"]] = df.groupby(
  "date", 
  group_keys=False
)[features + ["ret"]].apply(qtxs)

Fit a regression tree

from sklearn.tree import DecisionTreeRegressor

Xtrain = df[df.date=='2021-12'][features]
ytrain = df[df.date=='2021-12']["ret"]

model = DecisionTreeRegressor(
  max_depth=3,
  random_state=0
)
model.fit(Xtrain, ytrain)

View the regression tree

from sklearn.tree import plot_tree
_ = plot_tree(model)

Feature importance

What fraction of the splitting is each feature responsible for?

pd.Series(model.feature_importances_, index=Xtrain.columns)

agr        0.035596
bm         0.000000
idiovol    0.858119
mom12m     0.106285
roeq       0.000000
dtype: float64