Workflow that incorporates various encoding methods
Encoding Methods
In this section, we will talk about the seven encoding methods that we are going to compare in our experiments. We also provide a tiny dataset to show how these encoding methods work.
Here below is the data that we are going to encode, which includes one categorical variable, and a boolean target variable:
ID |
Type |
Failure |
1 |
A |
True (1) |
2 |
A |
False (0) |
3 |
A |
True (1) |
4 |
B |
False (0) |
5 |
B |
True (1) |
6 |
C |
False (0) |
7 |
C |
False (0) |
8 |
D |
True (1) |
We will use each encoding method to encode the Type variable. If a target variable is necessary for the encoding method, then we will use the 1 and 0 values of the Failure target variable.
Dummy (One-Hot) Encoding
For a categorical variable with n levels, this method creates n–1 new boolean variables. The value of each variable is just 0 or 1, where 0 means that it is not this type, and 1 indicates that it is this type. This method does not require a target variable.
Type |
Variable1 (Is Type==A?) |
Variable2 (Is Type==B?) |
Variable3 (Is Type==C?) |
A |
1 |
0 |
0 |
B |
0 |
1 |
0 |
C |
0 |
0 |
1 |
D |
0 |
0 |
0 |
Helmert Encoding
This method creates n–1 new variables. The main difference from one-hot encoding is that for any level, each new variable can take many values, not just 0 or 1. For each variable, the sum of each level is 0. This method does not require a target variable. It can be useful for categorical variables that are ordered.
Type |
Variable1 |
Variable2 |
Variable3 |
A |
3/4 |
0 |
0 |
B |
-1/4 |
2/3 |
0 |
C |
-1/4 |
-1/3 |
1/2 |
D |
-1/4 |
-1/3 |
-1/2 |
Binary Encoding
This method creates log2nnew variables. Each new variable can be either 0 or 1. Each level is converted to a binary digit, and then each digit creates a new variable. This method does not require a target variable.
Type |
Variable1 |
Variable2 |
A |
0 |
0 |
B |
0 |
1 |
C |
1 |
0 |
D |
1 |
1 |
Frequency Encoding
This method uses the frequency of the level as the encoded value. This method does not require a target variable. Different levels of the categorical variable may have the same encoded value.
Type |
EncodedValue |
A |
3/8 |
B |
2/8 |
C |
2/8 |
D |
1/8 |
Hash Encoding
This encoding method hashes the level to some value by using a standard hashing algorithm (e.g. MD5), and then computes the remainder modulo a specified number of levels. The advantage of such a method is that if there are missing categories from the training set, the new levels found in test set can still be encoded using the same hash function. We do not need to manually create a default value for unseen categories, and different unseen categories would be encoded to different values. Different levels of the categorical variable may have the same encoded value.
In this example, suppose the number of levels that we specify is 3, and the hashed values are fabricated.
Type |
HashValue |
EncodedValue |
A |
17 |
2 |
B |
42 |
0 |
C |
25 |
1 |
D |
4 |
1 |
Impact Encoding
Impact encoding takes the value of the target variable into consideration, by computing the mean of the target variable for each level. A level typically associated with a larger target value would therefore be encoded larger. Different levels of the categorical variable may have the same encoded value.
Type |
Encoded Value |
A |
2/3 |
B |
1/2 |
C |
0 |
D |
1 |
Weight-of-Evidence Encoding
This method of encoding is only used for binary classification problems. The encoded value of WoE encoding is taken as WoE = ln(# of True# of False). Since the number of True or False values could be 0, we need some smoothing to combat against those cases. A naive approach for this is to manually set the number of True/False to a tiny number if its value is 0. Different levels of the categorical variable may have the same encoded value.
Type |
Encoded Value |
A |
0.693147 |
B |
0 |
C |
-13.815511 |
D |
13.815511 |
Datasets Overview
Here below are the datasets that we will experiment on. Each of them has at least one categorical variable to be encoded.
-
Diamonds
-
3 categorical, 6 continuous, 53940 rows in total
-
Categorical variables: cut (5 distinct levels), color (7), clarity (8)
-
Response: price
-
-
Adult Income
-
8 categorical, 5 continuous, 48842 rows
-
Categorical: workclass (8 distinct levels), education (16), marital status (7), occupation (14), relationship (6), race (5), sex (2), native country (41)
-
Response: Income (Categorical, >50K or not)
-
Factors: age, workclass, education, marital-status, occupation, relationship, race, sex, capital gain, capital loss, hours per week, native country
-
20% of data are positive (>50K)
-
-
San Francisco Crime
-
1 categorical, 0 continuous, 1M rows
-
Categorical: Address (more than 20K levels)
-
Response: whether the crime is violent or not
-
-
Vancouver Crime
-
3 categorical, 1 continuous, 500K rows
-
Categorical: Address (more than 20K levels)
-
Response: whether the crime is violent or not
-
Summary
Diamonds |
Adult Income |
SF Crime |
Vancouver Crime |
|
Task |
Regression |
Classification |
Classification |
Classification |
# of Categorical Variables |
3 |
8 |
1 |
many |
# of levels |
few |
varies |
many |
many |
Balanced (True/False) |
N/A |
2-8 |
1-9 |
4-6 |
Testing the Methods
Here below is a sample workflow: it is the workflow for the diamonds dataset. We also have a baseline model, which only takes in continuous variables, to be compared with. A model that outperforms the can said to provide information by encoding categorical variables.
Here is a summary of the steps shown in the above workflow:
-
Split the dataset into Training set and Test set.
-
Compute encoded value based on the Training set.
-
Apply the encoding rule to Test set.
-
Rebalance Training set if necessary. (For unbalanced dataset, we will compare the result with and without rebalancing.)
-
Apply Linear Regression / Logistic Regression / Random Forest to Training set.
-
Measure performance by MSE / R-squared / Variance Explained / F-Score
Note: When encoding, we must calculate encoded values only based on the training set. This is because the encoding rules are established and applied before the training of data, when we only have access to training data. That means we need to first split the dataset, and then compute encoded values, and apply the encoding rule to the test set. Therefore, we also need to handle the case when a level only appears in the test set. For example, we may manually create a level called other, whose encoded value is the average of the encoded values of all other levels. In the sample workflow above, we split the dataset inside a Python Notebook (which is called by the Python Execute operator) and append labels to each data and finally encode the training dataset. The operator called Train and Test are just filters based on the label.
Encoding Comparison Results
Diamonds Dataset
We try to predict the price of each diamond based on their parameters like color, cut, clarity, etc. We use Linear Regression and Random Forest for this task.
MSE |
R-Sqaured |
|||
Model |
Linear Regression |
Random Forest |
Random Forest |
Linear Regression |
Impact |
1.65E+06 |
5.80E+05 |
0.9639 |
0.8953 |
Binary |
1.83E+06 |
5.53E+05 |
0.9655 |
0.8843 |
Freq |
2.11E+06 |
4.05E+05 |
0.9748 |
0.8664 |
Baseline |
2.34E+06 |
2.03E+06 |
0.8738 |
0.8515 |
Hash |
1.98E+06 |
6.14E+05 |
0.9617 |
0.8749 |
Dummy |
1.28E+06 |
3.60E+05 |
0.9775 |
0.9191 |
Helmert |
1.30E+06 |
3.55E+05 |
0.9779 |
0.9178 |
The chart on the top left is the MSE of our prediction on the test set. The lower the better. All encoding methods outperform the baseline in either model, so we can say that all encoding methods provide some information. The best models here are Dummy and Helmert encoding. R-squared values here are not very useful as the models have really similar R-squared values.
We believe that Dummy and Helmert encoding win here because they have lots of variables, taking a large number of degrees of freedom. Binary encoding with a few new variables also displays some advantage. Impact encoding, which incorporates the target value to the encoding value, is also not bad in the task of regression.
Adult Income Dataset
This dataset has many categorical variables, and the number of levels varies. We try to predict whether a person's income is greater or less than $50,000 based on his/her education, age, gender, job, etc. Since the dataset is not very balanced, we also compare the result before and after rebalancing.
Logistic Regression |
Random Forest |
|||
Balanced |
Unbalanced |
Balanced |
Unbalanced |
|
WOE |
0.684 |
0.633 |
0.692 |
0.673 |
Impact |
0.676 |
0.629 |
0.694 |
0.670 |
Binary |
0.660 |
0.613 |
0.683 |
0.652 |
Freq |
0.625 |
0.536 |
0.680 |
0.655 |
Hash |
0.604 |
0.450 |
0.671 |
0.628 |
Helmert |
0.686 |
0.642 |
0.686 |
0.661 |
Dummy |
0.682 |
0.652 |
0.698 |
0.670 |
Baseline |
0.518 |
0.407 |
0.550 |
0.425 |
For logistic regression, we see that Hash encoding and Frequency encoding are doing a bad job. Instead, WoE and Impact encoding are doing comparatively well, even a better job than Dummy and Helmert encoding for both models.
Why are Impact and WoE encoding doing so well? In this case, we can consider the Random Forest as using the encoded values to create a number of "if else" statements, so that the WoE and Impact Encoding, with only one variable, can result in effectively the same tree as Dummy or Helmert encoding.
San Francisco Crime Dataset
Recall that this dataset is a little unbalanced, with only 10% of positive response and all models can benefit from rebalancing.
In addition, since we have more than 20,000 levels for the Address variable, directly applying Dummy Encoding and Helmert Encoding would result in an extremely wide dataset which might lead to performance issues, or possibly running out of memory. Therefore, we do not include these two methods in our comparison.
Fraction of Variance Explained(Balanced) |
Fraction of Variance Explained(Unbalanced) |
F (Balanced) |
F(Unbalaned) |
|
WOE |
0.116 |
0.086 |
0.249 |
0.004 |
Impact |
0.059 |
0.045 |
0.252 |
0.005 |
Binary |
0.002 |
0.002 |
0.205 |
0.000 |
Freq |
0.000 |
0.000 |
0.202 |
0.000 |
Hash |
0.000 |
0.000 |
0.185 |
0.000 |
Even though the models perform poorly on this dataset, we can still compare their performance. WoE and Impact encoding clearly win. They are the only models producing non-trivial models on the unbalanced data. If the dataset is not rebalanced, then Binary Encoding, Frequency Encoding and Hash Encoding are always producing negative predictions, which leads to a zero F-score.
The implication here is that WoE and Impact are particularly useful when there is a very high number of categorical levels. However, the dataset is rather unbalanced, and there is only one independent variable, so the models are weak and the results are therefore inconclusive.
We decided to try another dataset — similar, but less unbalanced and with more variables.
Vancouver Crime Dataset
This dataset is less unbalanced. We again try to predict whether the crime is violent or not. In this dataset, we include continuous variable Year, and categorical variables month, day of week and address. We still consider Dummy and Helmert Encoding to be impractical, so we leave them out of the comparison.
Fraction of Variance Explained |
Accuracy |
|
WOE |
0.194 |
0.675 |
Impact |
0.123 |
0.661 |
Binary |
0.032 |
0.619 |
Freq |
0.058 |
0.647 |
Hash |
0.001 |
0.605 |
Baseline |
0.000 |
0.605 |
Unfortunately, the Hash encoding and Baseline model are still always predicting negative.
What is surprising in this dataset is that Frequency Encoding is doing a better job. One reason for this phenomenon is illustrated by the figure below.
The more cases reported on the address, the larger the encoded value, but also the larger probability of a violent crime. So our Frequency Encoding is having the similar advantage as WoE or Impact Encoding. However, such an explanation could not explain the poor performance of Frequency Encoding in previous datasets.
However, the clear winners are Impact and WoE Encoding, as we suspected.
Summary
-
WoE and Impact Encoding win by taking the target variable into consideration.
-
Helmert and Dummy Encoding win by having lots of variables at the expense of degrees of freedom. However, when the number of levels is extremely large, we may face performance issues, and implementing them in a production setting may be impractical. (We can consider grouping some levels together before encoding.)
-
Binary, Frequency, and Hash Encoding in general do not provide as much information about values to be predicted.
Note: This experiment of comparing different categorical encoding methods was inspired by the article Modeling Trick: Impact Coding of Categorical Variables with Many Levels, where Impact Encoding was applied to the SF Crime dataset. In this article, they use the data of June 2012 and only use the Address variable. We also only take the Address variable into account, but apply multiple encoding methods and use the dataset from 2003 to 2018.
Authors
This blog was written by Zongyuan Chen, who worked as an intern at TIBCO in the summer of 2020, focusing on encoding methods for categorical variables, modeling trends of COVID-19, and data visualization. He studied Computer Science and Statistics at the University of Washington, and is now pursuing a master's degree in Financial Engineering at Columbia University
Steven Hillion works on large-scale machine learning. He was the co-founder of Alpine Data, acquired by TIBCO, and before that he built a global team of data scientists at Pivotal and developed a suite of open-source and enterprise software in machine learning. Earlier, he led engineering at a series of start-ups. Steven is originally from Guernsey, in the British Isles. He received his Ph.D. in mathematics from the University of California, Berkeley, and before that read mathematics at Oxford University
Recommended Comments
There are no comments to display.