Workflow that incorporates various encoding methods
Encoding Methods
In this section, we will talk about the seven encoding methods that we are going to compare in our experiments. We also provide a tiny dataset to show how these encoding methods work.
Here below is the data that we are going to encode, which includes one categorical variable, and a boolean target variable:
ID 
Type 
Failure 
1 
A 
True (1) 
2 
A 
False (0) 
3 
A 
True (1) 
4 
B 
False (0) 
5 
B 
True (1) 
6 
C 
False (0) 
7 
C 
False (0) 
8 
D 
True (1) 
We will use each encoding method to encode the Type variable. If a target variable is necessary for the encoding method, then we will use the 1 and 0 values of the Failure target variable.
Dummy (OneHot) Encoding
For a categorical variable with n levels, this method creates n–1 new boolean variables. The value of each variable is just 0 or 1, where 0 means that it is not this type, and 1 indicates that it is this type. This method does not require a target variable.
Type 
Variable1 (Is Type==A?) 
Variable2 (Is Type==B?) 
Variable3 (Is Type==C?) 
A 
1 
0 
0 
B 
0 
1 
0 
C 
0 
0 
1 
D 
0 
0 
0 
Helmert Encoding
This method creates n–1 new variables. The main difference from onehot encoding is that for any level, each new variable can take many values, not just 0 or 1. For each variable, the sum of each level is 0. This method does not require a target variable. It can be useful for categorical variables that are ordered.
Type 
Variable1 
Variable2 
Variable3 
A 
3/4 
0 
0 
B 
1/4 
2/3 
0 
C 
1/4 
1/3 
1/2 
D 
1/4 
1/3 
1/2 
Binary Encoding
This method creates log2nnew variables. Each new variable can be either 0 or 1. Each level is converted to a binary digit, and then each digit creates a new variable. This method does not require a target variable.
Type 
Variable1 
Variable2 
A 
0 
0 
B 
0 
1 
C 
1 
0 
D 
1 
1 
Frequency Encoding
This method uses the frequency of the level as the encoded value. This method does not require a target variable. Different levels of the categorical variable may have the same encoded value.
Type 
EncodedValue 
A 
3/8 
B 
2/8 
C 
2/8 
D 
1/8 
Hash Encoding
This encoding method hashes the level to some value by using a standard hashing algorithm (e.g. MD5), and then computes the remainder modulo a specified number of levels. The advantage of such a method is that if there are missing categories from the training set, the new levels found in test set can still be encoded using the same hash function. We do not need to manually create a default value for unseen categories, and different unseen categories would be encoded to different values. Different levels of the categorical variable may have the same encoded value.
In this example, suppose the number of levels that we specify is 3, and the hashed values are fabricated.
Type 
HashValue 
EncodedValue 
A 
17 
2 
B 
42 
0 
C 
25 
1 
D 
4 
1 
Impact Encoding
Impact encoding takes the value of the target variable into consideration, by computing the mean of the target variable for each level. A level typically associated with a larger target value would therefore be encoded larger. Different levels of the categorical variable may have the same encoded value.
Type 
Encoded Value 
A 
2/3 
B 
1/2 
C 
0 
D 
1 
WeightofEvidence Encoding
This method of encoding is only used for binary classification problems. The encoded value of WoE encoding is taken as WoE = ln(# of True# of False). Since the number of True or False values could be 0, we need some smoothing to combat against those cases. A naive approach for this is to manually set the number of True/False to a tiny number if its value is 0. Different levels of the categorical variable may have the same encoded value.
Type 
Encoded Value 
A 
0.693147 
B 
0 
C 
13.815511 
D 
13.815511 
Datasets Overview
Here below are the datasets that we will experiment on. Each of them has at least one categorical variable to be encoded.

Diamonds

3 categorical, 6 continuous, 53940 rows in total

Categorical variables: cut (5 distinct levels), color (7), clarity (8)

Response: price


Adult Income

8 categorical, 5 continuous, 48842 rows

Categorical: workclass (8 distinct levels), education (16), marital status (7), occupation (14), relationship (6), race (5), sex (2), native country (41)

Response: Income (Categorical, >50K or not)

Factors: age, workclass, education, maritalstatus, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country

20% of data are positive (>50K)


San Francisco Crime

1 categorical, 0 continuous, 1M rows

Categorical: Address (more than 20K levels)

Response: whether the crime is violent or not


Vancouver Crime

3 categorical, 1 continuous, 500K rows

Categorical: Address (more than 20K levels)

Response: whether the crime is violent or not

Summary
Diamonds 
Adult Income 
SF Crime 
Vancouver Crime 

Task 
Regression 
Classification 
Classification 
Classification 
# of Categorical Variables 
3 
8 
1 
many 
# of levels 
few 
varies 
many 
many 
Balanced (True/False) 
N/A 
28 
19 
46 
Testing the Methods
Here below is a sample workflow: it is the workflow for the diamonds dataset. We also have a baseline model, which only takes in continuous variables, to be compared with. A model that outperforms the can said to provide information by encoding categorical variables.
Here is a summary of the steps shown in the above workflow:

Split the dataset into Training set and Test set.

Compute encoded value based on the Training set.

Apply the encoding rule to Test set.

Rebalance Training set if necessary. (For unbalanced dataset, we will compare the result with and without rebalancing.)

Apply Linear Regression / Logistic Regression / Random Forest to Training set.

Measure performance by MSE / Rsquared / Variance Explained / FScore
Note: When encoding, we must calculate encoded values only based on the training set. This is because the encoding rules are established and applied before the training of data, when we only have access to training data. That means we need to first split the dataset, and then compute encoded values, and apply the encoding rule to the test set. Therefore, we also need to handle the case when a level only appears in the test set. For example, we may manually create a level called other, whose encoded value is the average of the encoded values of all other levels. In the sample workflow above, we split the dataset inside a Python Notebook (which is called by the Python Execute operator) and append labels to each data and finally encode the training dataset. The operator called Train and Test are just filters based on the label.
Encoding Comparison Results
Diamonds Dataset
We try to predict the price of each diamond based on their parameters like color, cut, clarity, etc. We use Linear Regression and Random Forest for this task.
MSE 
RSqaured 

Model 
Linear Regression 
Random Forest 
Random Forest 
Linear Regression 
Impact 
1.65E+06 
5.80E+05 
0.9639 
0.8953 
Binary 
1.83E+06 
5.53E+05 
0.9655 
0.8843 
Freq 
2.11E+06 
4.05E+05 
0.9748 
0.8664 
Baseline 
2.34E+06 
2.03E+06 
0.8738 
0.8515 
Hash 
1.98E+06 
6.14E+05 
0.9617 
0.8749 
Dummy 
1.28E+06 
3.60E+05 
0.9775 
0.9191 
Helmert 
1.30E+06 
3.55E+05 
0.9779 
0.9178 
The chart on the top left is the MSE of our prediction on the test set. The lower the better. All encoding methods outperform the baseline in either model, so we can say that all encoding methods provide some information. The best models here are Dummy and Helmert encoding. Rsquared values here are not very useful as the models have really similar Rsquared values.
We believe that Dummy and Helmert encoding win here because they have lots of variables, taking a large number of degrees of freedom. Binary encoding with a few new variables also displays some advantage. Impact encoding, which incorporates the target value to the encoding value, is also not bad in the task of regression.
Adult Income Dataset
This dataset has many categorical variables, and the number of levels varies. We try to predict whether a person's income is greater or less than $50,000 based on his/her education, age, gender, job, etc. Since the dataset is not very balanced, we also compare the result before and after rebalancing.
Logistic Regression 
Random Forest 

Balanced 
Unbalanced 
Balanced 
Unbalanced 

WOE 
0.684 
0.633 
0.692 
0.673 
Impact 
0.676 
0.629 
0.694 
0.670 
Binary 
0.660 
0.613 
0.683 
0.652 
Freq 
0.625 
0.536 
0.680 
0.655 
Hash 
0.604 
0.450 
0.671 
0.628 
Helmert 
0.686 
0.642 
0.686 
0.661 
Dummy 
0.682 
0.652 
0.698 
0.670 
Baseline 
0.518 
0.407 
0.550 
0.425 
For logistic regression, we see that Hash encoding and Frequency encoding are doing a bad job. Instead, WoE and Impact encoding are doing comparatively well, even a better job than Dummy and Helmert encoding for both models.
Why are Impact and WoE encoding doing so well? In this case, we can consider the Random Forest as using the encoded values to create a number of "if else" statements, so that the WoE and Impact Encoding, with only one variable, can result in effectively the same tree as Dummy or Helmert encoding.
San Francisco Crime Dataset
Recall that this dataset is a little unbalanced, with only 10% of positive response and all models can benefit from rebalancing.
In addition, since we have more than 20,000 levels for the Address variable, directly applying Dummy Encoding and Helmert Encoding would result in an extremely wide dataset which might lead to performance issues, or possibly running out of memory. Therefore, we do not include these two methods in our comparison.
Fraction of Variance Explained(Balanced) 
Fraction of Variance Explained(Unbalanced) 
F (Balanced) 
F(Unbalaned) 

WOE 
0.116 
0.086 
0.249 
0.004 
Impact 
0.059 
0.045 
0.252 
0.005 
Binary 
0.002 
0.002 
0.205 
0.000 
Freq 
0.000 
0.000 
0.202 
0.000 
Hash 
0.000 
0.000 
0.185 
0.000 
Even though the models perform poorly on this dataset, we can still compare their performance. WoE and Impact encoding clearly win. They are the only models producing nontrivial models on the unbalanced data. If the dataset is not rebalanced, then Binary Encoding, Frequency Encoding and Hash Encoding are always producing negative predictions, which leads to a zero Fscore.
The implication here is that WoE and Impact are particularly useful when there is a very high number of categorical levels. However, the dataset is rather unbalanced, and there is only one independent variable, so the models are weak and the results are therefore inconclusive.
We decided to try another dataset — similar, but less unbalanced and with more variables.
Vancouver Crime Dataset
This dataset is less unbalanced. We again try to predict whether the crime is violent or not. In this dataset, we include continuous variable Year, and categorical variables month, day of week and address. We still consider Dummy and Helmert Encoding to be impractical, so we leave them out of the comparison.
Fraction of Variance Explained 
Accuracy 

WOE 
0.194 
0.675 
Impact 
0.123 
0.661 
Binary 
0.032 
0.619 
Freq 
0.058 
0.647 
Hash 
0.001 
0.605 
Baseline 
0.000 
0.605 
Unfortunately, the Hash encoding and Baseline model are still always predicting negative.
What is surprising in this dataset is that Frequency Encoding is doing a better job. One reason for this phenomenon is illustrated by the figure below.
The more cases reported on the address, the larger the encoded value, but also the larger probability of a violent crime. So our Frequency Encoding is having the similar advantage as WoE or Impact Encoding. However, such an explanation could not explain the poor performance of Frequency Encoding in previous datasets.
However, the clear winners are Impact and WoE Encoding, as we suspected.
Summary

WoE and Impact Encoding win by taking the target variable into consideration.

Helmert and Dummy Encoding win by having lots of variables at the expense of degrees of freedom. However, when the number of levels is extremely large, we may face performance issues, and implementing them in a production setting may be impractical. (We can consider grouping some levels together before encoding.)

Binary, Frequency, and Hash Encoding in general do not provide as much information about values to be predicted.
Note: This experiment of comparing different categorical encoding methods was inspired by the article Modeling Trick: Impact Coding of Categorical Variables with Many Levels, where Impact Encoding was applied to the SF Crime dataset. In this article, they use the data of June 2012 and only use the Address variable. We also only take the Address variable into account, but apply multiple encoding methods and use the dataset from 2003 to 2018.
Authors
This blog was written by Zongyuan Chen, who worked as an intern at TIBCO in the summer of 2020, focusing on encoding methods for categorical variables, modeling trends of COVID19, and data visualization. He studied Computer Science and Statistics at the University of Washington, and is now pursuing a master's degree in Financial Engineering at Columbia University
Steven Hillion works on largescale machine learning. He was the cofounder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of opensource and enterprise software in machine learning. Earlier, he led engineering at a series of startups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University
Recommended Comments
There are no comments to display.