Industries





Kerry Back

Definition of Industries

  • Industry groupings are usually built on SIC codes
  • We’ll use a couple of examples: Fama-French 12 industries and Fama-French 48 industries
  • The mappings SIC code \(\mapsto\) industry can be found on French’s website
  • Here are the 12 industries:
start end
industry
Consumer Nondurables 100 999
Consumer Nondurables 2000 2399
Consumer Nondurables 2700 2749
Consumer Nondurables 2770 2799
Consumer Nondurables 3100 3199
Consumer Nondurables 3940 3989
Consumer Durables 2500 2519
Consumer Durables 2590 2599
Consumer Durables 3630 3659
Consumer Durables 3710 3711
Consumer Durables 3714 3714
Consumer Durables 3716 3716
Consumer Durables 3750 3751
Consumer Durables 3792 3792
Consumer Durables 3900 3939
Consumer Durables 3990 3999
Manufacturing 2520 2589
Manufacturing 2600 2699
Manufacturing 2750 2769
Manufacturing 3000 3099
Manufacturing 3200 3569
Manufacturing 3580 3629
Manufacturing 3700 3709
Manufacturing 3712 3713
Manufacturing 3715 3715
Manufacturing 3717 3749
Manufacturing 3752 3791
Manufacturing 3793 3799
Manufacturing 3830 3839
Manufacturing 3860 3899
Energy 1200 1399
Energy 2900 2999
Chemicals 2800 2829
Chemicals 2840 2899
Business Equipment 3570 3579
Business Equipment 3660 3692
Business Equipment 3694 3699
Business Equipment 3810 3829
Business Equipment 7370 7379
Telecommunications 4800 4899
Utilities 4900 4949
Shops 5000 5999
Shops 7200 7299
Shops 7600 7699
Healthcare 2830 2839
Healthcare 3693 3693
Healthcare 3840 3859
Healthcare 8000 8099
Finance 6000 6999

Dummy variables

  • We can use categorical variables in numerical models by creating dummy variables.
  • We create a dummy variable for each industry defined as: =1 if the firm is in the industry and =0 otherwise.
  • By including dummy variables in a linear regression, we allow each industry to have a different intercept.

Example

  • Get data as in the simple backtest, except
    • include siccd in the select statement
    • include ” where date=‘2020-01’ ” to make a smaller example
  • Define a function get_industry(siccd) that selects the right industry for each SIC code using the Fama-French classification.
  • Define the industry as “Other” if an SIC code doesn’t fit any of the ranges.

Head of the dataframe

bm roeq ret siccd industry
date ticker
2020-01 A 0.221863 0.037268 -0.032235 3826 Business Equipment
AAL -0.011426 -1.040881 -0.064156 4512 Other
AAMC -4.432845 -0.013082 0.093927 6211 Finance
AAME 2.085421 -0.039700 0.121827 6320 Finance
AAN 0.623240 0.023518 0.039398 7359 Other

Creating dummies

  • pandas has a get_dummies function.
  • scikit-learn has OneHotEncoder.
  • We’ll use scikit-learn’s LinearRegression for both.
from sklearn.linear_model import LinearRegression

get_dummies

  • Create dummies, add to dataframe and include in features.
d = pd.get_dummies(df.industry)
ind_names = d.columns.to_list()
features = ["bm", "roeq"] + ind_names
df2 = df.join(d)

model = LinearRegression(fit_intercept=False)
model.fit(df2[features], df2["ret"])
pd.Series(model.coef_, index=features)
bm                     -0.005865
roeq                   -0.007340
Business Equipment      0.004721
Chemicals              -0.064114
Consumer Durables      -0.042278
Consumer Nondurables   -0.036132
Energy                 -0.163661
Finance                -0.032657
Healthcare              0.023304
Manufacturing          -0.037615
Other                   0.014954
Shops                  -0.043528
Telecommunications      0.020724
Utilities               0.027766
dtype: float64

One Hot Encoder

  • Add to a pipeline with LinearRegression, then fit the pipeline.
  • Dummies are created as a preprocessing step within the pipeline but not added to the dataframe.
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline

transform = make_column_transformer(
    (OneHotEncoder(), ["industry"]),
    remainder="passthrough"
)

model = LinearRegression(fit_intercept=False)
pipe = make_pipeline(transform, model)
pipe.fit(df[["bm", "roeq", "industry"]], df["ret"])
model.coef_
array([ 0.00472134, -0.06411248, -0.04227921, -0.03613172, -0.16366124,
       -0.03265685,  0.023304  , -0.03761542,  0.0149536 , -0.04352785,
        0.02072872,  0.02776291, -0.005865  , -0.00733964])