Trees





Kerry Back

Decision Trees

  • Split data sequentially into subsets based on the value of a single feature
    • Above a threshold into one group
    • Below the threshold into the other
  • Prediction in each subset is the plurality class (for classification) or the cell mean (for regression).
  • Try to minimize impurity in classification and (usually) mean squared error in regression.

Example

Example: train from 2021-12, predict for 2022-01

  • Get data from the SQL database
df = pd.read_sql(
    """
    select ticker, date, ag, bm, idiovol, mom12m, roeq, ret
    from data
    where date in ('2021-12', '2022-01')
    """, 
    conn
)
features = ["ag", "bm", "idiovol", "mom12m", "roeq"]

Transform each cross-section

from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution="normal")

def qtxs(d):
    x = qt.fit_transform(d)
    return pd.DataFrame(x, columns=d.columns, index=d.index)

df[features + ["ret"]] = df.groupby(
  "date", 
  group_keys=False
)[features + ["ret"]].apply(qtxs)

Fit a regression tree

from sklearn.tree import DecisionTreeRegressor

Xtrain = df[df.date=='2021-12'][features]
ytrain = df[df.date=='2021-12']["ret"]

model = DecisionTreeRegressor(
  max_depth=3,
  random_state=0
)
model.fit(Xtrain, ytrain)

View the regression tree

from sklearn.tree import plot_tree
_ = plot_tree(model)

Feature importance

  • What fraction of the splitting is each feature responsible for?
pd.Series(model.feature_importances_, index=Xtrain.columns)
agr        0.035596
bm         0.000000
idiovol    0.858119
mom12m     0.106285
roeq       0.000000
dtype: float64