Backtest: Preprocessing

Filter

We might want to drop stocks with prices below some threshold (“drop penny stocks”).
We might want to drop stocks of certain sizes.
- Maybe fewer opportunities in large caps?
- Maybe drop microcaps because they’re harder to trade?

Deal with missing data

We need to have valid data for all featurews in each observation.
Can drop NaNs.
Or can fill NaNs.
- Maybe use median or mean for the feature in that time period.
- Or median or mean for the feature for that industry in that time period.

Transform cross-sections

Rather than pooling across dates and scaling/transforming, it is probably better in our application to scale/transform each cross-section independently.
We can group by date and apply transformations to the groupby object.

Let’s look at an example: QuantileTransformer.
Quantile transformer maps quantiles of the sample distribution to quantiles of a target distribution (uniform or normal).
So, the transformed data has the target distribution.

from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution="normal")

grouped = df.groupby("date", group_keys=False)

df[features+["ret"]] = grouped[features+["ret"]].apply(
  lambda d: 
    pd.DataFrame(
      qt.fit_transform(d),
      columns=d.columns,
      index=d.index
    )     
)

Industries

Industry membership can be used in various ways.
One example is adding industry dummy variables as features.
There are various ways to define industries, most of which are based on SIC codes.
We’ll look at the example of the Fama-French 12 industry classification.

The SIC code ranges for each industry are in siccodes12.csv, which was obtained from French’s data library.

inds = pd.read_csv(
  "files/siccodes12.csv", 
  index_col="industry"
)
inds

	start	end
industry
Consumer Nondurables	100	999
Consumer Nondurables	2000	2399
Consumer Nondurables	2700	2749
Consumer Nondurables	2770	2799
Consumer Nondurables	3100	3199
Consumer Nondurables	3940	3989
Consumer Durables	2500	2519
Consumer Durables	2590	2599
Consumer Durables	3630	3659
Consumer Durables	3710	3711
Consumer Durables	3714	3714
Consumer Durables	3716	3716
Consumer Durables	3750	3751
Consumer Durables	3792	3792
Consumer Durables	3900	3939
Consumer Durables	3990	3999
Manufacturing	2520	2589
Manufacturing	2600	2699
Manufacturing	2750	2769
Manufacturing	3000	3099
Manufacturing	3200	3569
Manufacturing	3580	3629
Manufacturing	3700	3709
Manufacturing	3712	3713
Manufacturing	3715	3715
Manufacturing	3717	3749
Manufacturing	3752	3791
Manufacturing	3793	3799
Manufacturing	3830	3839
Manufacturing	3860	3899
Energy	1200	1399
Energy	2900	2999
Chemicals	2800	2829
Chemicals	2840	2899
Business Equipment	3570	3579
Business Equipment	3660	3692
Business Equipment	3694	3699
Business Equipment	3810	3829
Business Equipment	7370	7379
Telecommunications	4800	4899
Utilities	4900	4949
Shops	5000	5999
Shops	7200	7299
Shops	7600	7699
Healthcare	2830	2839
Healthcare	3693	3693
Healthcare	3840	3859
Healthcare	8000	8099
Finance	6000	6999

Find the range in which an SIC code lies to find its industry.
If it is not in any of the ranges, then the industry is “Other.”

def industry(sic):
  try:
    return inds[(inds.start<=sic)&(sic<=inds.end)].index[0]
  except:
    return "Other"

We could loop over all observations and define the industry for each observation.
But it’s faster to pull the unique SIC codes, define the industry for each SIC code, and then do a one-to-many merge into the dataframe of all observations.

codes = pd.Series({code: industry(code) for code in df.siccd.unique()})
codes.name = "industry" 
codes.index.name = "siccd"
df = df.merge(codes, on="siccd")

Polynomial features

Polynomial features with degree=2 adds products and squares of features.
Degree=3 adds a*b*c, a**3, etc.
Adding products facilitates including interactions between variables.
We can use polynomial features in a pipeline.

Pipeline

The advantage of putting transformations in a pipeline is that the exact same transformations will be automatically applied to new data when predictions are made.
sklearn code is

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(transform1, transform2, ..., model)
pipe.fit(...)
pipe.predict(...)

We will illustrate with OneHotEncoder (to create dummies) and PolynomialFeatures
OneHotEncoder is only applied to the “industry” column, so we use make_column_transformer to apply different transformations to different columns.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

transform1 = make_column_transformer(
    (OneHotEncoder(), ["industry"]),
    remainder="passthrough"
)
transform2 = PolynomialFeatures(degree=2)
pipe = make_pipeline(
    transform1,
    transform2,
    model
)