Industries

Definition of Industries

Industry groupings are usually built on SIC codes
We’ll use a couple of examples: Fama-French 12 industries and Fama-French 48 industries
The mappings SIC code \(\mapsto\) industry can be found on French’s website
Here are the 12 industries:

	start	end
industry
Consumer Nondurables	100	999
Consumer Nondurables	2000	2399
Consumer Nondurables	2700	2749
Consumer Nondurables	2770	2799
Consumer Nondurables	3100	3199
Consumer Nondurables	3940	3989
Consumer Durables	2500	2519
Consumer Durables	2590	2599
Consumer Durables	3630	3659
Consumer Durables	3710	3711
Consumer Durables	3714	3714
Consumer Durables	3716	3716
Consumer Durables	3750	3751
Consumer Durables	3792	3792
Consumer Durables	3900	3939
Consumer Durables	3990	3999
Manufacturing	2520	2589
Manufacturing	2600	2699
Manufacturing	2750	2769
Manufacturing	3000	3099
Manufacturing	3200	3569
Manufacturing	3580	3629
Manufacturing	3700	3709
Manufacturing	3712	3713
Manufacturing	3715	3715
Manufacturing	3717	3749
Manufacturing	3752	3791
Manufacturing	3793	3799
Manufacturing	3830	3839
Manufacturing	3860	3899
Energy	1200	1399
Energy	2900	2999
Chemicals	2800	2829
Chemicals	2840	2899
Business Equipment	3570	3579
Business Equipment	3660	3692
Business Equipment	3694	3699
Business Equipment	3810	3829
Business Equipment	7370	7379
Telecommunications	4800	4899
Utilities	4900	4949
Shops	5000	5999
Shops	7200	7299
Shops	7600	7699
Healthcare	2830	2839
Healthcare	3693	3693
Healthcare	3840	3859
Healthcare	8000	8099
Finance	6000	6999

Dummy variables

We can use categorical variables in numerical models by creating dummy variables.
We create a dummy variable for each industry defined as: =1 if the firm is in the industry and =0 otherwise.
By including dummy variables in a linear regression, we allow each industry to have a different intercept.

Example

Get data as in the simple backtest, except
- include siccd in the select statement
- include ” where date=‘2020-01’ ” to make a smaller example
Define a function get_industry(siccd) that selects the right industry for each SIC code using the Fama-French classification.
Define the industry as “Other” if an SIC code doesn’t fit any of the ranges.

Head of the dataframe

		bm	roeq	ret	siccd	industry
date	ticker
2020-01	A	0.221863	0.037268	-0.032235	3826	Business Equipment
	AAL	-0.011426	-1.040881	-0.064156	4512	Other
	AAMC	-4.432845	-0.013082	0.093927	6211	Finance
	AAME	2.085421	-0.039700	0.121827	6320	Finance
	AAN	0.623240	0.023518	0.039398	7359	Other

Creating dummies

pandas has a get_dummies function.
scikit-learn has OneHotEncoder.
We’ll use scikit-learn’s LinearRegression for both.

from sklearn.linear_model import LinearRegression

get_dummies

Create dummies, add to dataframe and include in features.

d = pd.get_dummies(df.industry)
ind_names = d.columns.to_list()
features = ["bm", "roeq"] + ind_names
df2 = df.join(d)

model = LinearRegression(fit_intercept=False)
model.fit(df2[features], df2["ret"])
pd.Series(model.coef_, index=features)

bm                     -0.005865
roeq                   -0.007340
Business Equipment      0.004721
Chemicals              -0.064114
Consumer Durables      -0.042278
Consumer Nondurables   -0.036132
Energy                 -0.163661
Finance                -0.032657
Healthcare              0.023304
Manufacturing          -0.037615
Other                   0.014954
Shops                  -0.043528
Telecommunications      0.020724
Utilities               0.027766
dtype: float64

One Hot Encoder

Add to a pipeline with LinearRegression, then fit the pipeline.
Dummies are created as a preprocessing step within the pipeline but not added to the dataframe.

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline

transform = make_column_transformer(
    (OneHotEncoder(), ["industry"]),
    remainder="passthrough"
)

model = LinearRegression(fit_intercept=False)
pipe = make_pipeline(transform, model)
pipe.fit(df[["bm", "roeq", "industry"]], df["ret"])
model.coef_

array([ 0.00472134, -0.06411248, -0.04227921, -0.03613172, -0.16366124,
       -0.03265685,  0.023304  , -0.03761542,  0.0149536 , -0.04352785,
        0.02072872,  0.02776291, -0.005865  , -0.00733964])