Backtest: Preprocessing





Kerry Back

Filter

  • We might want to drop stocks with prices below some threshold (“drop penny stocks”).
  • We might want to drop stocks of certain sizes.
    • Maybe fewer opportunities in large caps?
    • Maybe drop microcaps because they’re harder to trade?

Deal with missing data

  • We need to have valid data for all featurews in each observation.
  • Can drop NaNs.
  • Or can fill NaNs.
    • Maybe use median or mean for the feature in that time period.
    • Or median or mean for the feature for that industry in that time period.

Transform cross-sections

  • Rather than pooling across dates and scaling/transforming, it is probably better in our application to scale/transform each cross-section independently.
  • We can group by date and apply transformations to the groupby object.

  • Let’s look at an example: QuantileTransformer.
  • Quantile transformer maps quantiles of the sample distribution to quantiles of a target distribution (uniform or normal).
  • So, the transformed data has the target distribution.

from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution="normal")

grouped = df.groupby("date", group_keys=False)

df[features+["ret"]] = grouped[features+["ret"]].apply(
  lambda d: 
    pd.DataFrame(
      qt.fit_transform(d),
      columns=d.columns,
      index=d.index
    )     
)

Industries

  • Industry membership can be used in various ways.
  • One example is adding industry dummy variables as features.
  • There are various ways to define industries, most of which are based on SIC codes.
  • We’ll look at the example of the Fama-French 12 industry classification.

  • The SIC code ranges for each industry are in siccodes12.csv, which was obtained from French’s data library.
inds = pd.read_csv(
  "files/siccodes12.csv", 
  index_col="industry"
)
inds
start end
industry
Consumer Nondurables 100 999
Consumer Nondurables 2000 2399
Consumer Nondurables 2700 2749
Consumer Nondurables 2770 2799
Consumer Nondurables 3100 3199
Consumer Nondurables 3940 3989
Consumer Durables 2500 2519
Consumer Durables 2590 2599
Consumer Durables 3630 3659
Consumer Durables 3710 3711
Consumer Durables 3714 3714
Consumer Durables 3716 3716
Consumer Durables 3750 3751
Consumer Durables 3792 3792
Consumer Durables 3900 3939
Consumer Durables 3990 3999
Manufacturing 2520 2589
Manufacturing 2600 2699
Manufacturing 2750 2769
Manufacturing 3000 3099
Manufacturing 3200 3569
Manufacturing 3580 3629
Manufacturing 3700 3709
Manufacturing 3712 3713
Manufacturing 3715 3715
Manufacturing 3717 3749
Manufacturing 3752 3791
Manufacturing 3793 3799
Manufacturing 3830 3839
Manufacturing 3860 3899
Energy 1200 1399
Energy 2900 2999
Chemicals 2800 2829
Chemicals 2840 2899
Business Equipment 3570 3579
Business Equipment 3660 3692
Business Equipment 3694 3699
Business Equipment 3810 3829
Business Equipment 7370 7379
Telecommunications 4800 4899
Utilities 4900 4949
Shops 5000 5999
Shops 7200 7299
Shops 7600 7699
Healthcare 2830 2839
Healthcare 3693 3693
Healthcare 3840 3859
Healthcare 8000 8099
Finance 6000 6999

  • Find the range in which an SIC code lies to find its industry.
  • If it is not in any of the ranges, then the industry is “Other.”
def industry(sic):
  try:
    return inds[(inds.start<=sic)&(sic<=inds.end)].index[0]
  except:
    return "Other"

  • We could loop over all observations and define the industry for each observation.
  • But it’s faster to pull the unique SIC codes, define the industry for each SIC code, and then do a one-to-many merge into the dataframe of all observations.
codes = pd.Series({code: industry(code) for code in df.siccd.unique()})
codes.name = "industry" 
codes.index.name = "siccd"
df = df.merge(codes, on="siccd")

Polynomial features

  • Polynomial features with degree=2 adds products and squares of features.
  • Degree=3 adds a*b*c, a**3, etc.
  • Adding products facilitates including interactions between variables.
  • We can use polynomial features in a pipeline.

Pipeline

  • The advantage of putting transformations in a pipeline is that the exact same transformations will be automatically applied to new data when predictions are made.
  • sklearn code is
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(transform1, transform2, ..., model)
pipe.fit(...)
pipe.predict(...)

  • We will illustrate with OneHotEncoder (to create dummies) and PolynomialFeatures
  • OneHotEncoder is only applied to the “industry” column, so we use make_column_transformer to apply different transformations to different columns.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

transform1 = make_column_transformer(
    (OneHotEncoder(), ["industry"]),
    remainder="passthrough"
)
transform2 = PolynomialFeatures(degree=2)
pipe = make_pipeline(
    transform1,
    transform2,
    model
)