Map > Data
Science > Explaining the Past
> Data Exploration > Univariate
Analysis > Encoding |
|
|
|
|
|
Encoding
|
|
|
Encoding or continuization is the transformation of categorical variables to
binary or numerical counterparts. An example is to treat male or female for
gender as 1 or 0. Categorical variables must be encoded in many modeling
methods (e.g., linear regression, SVM, neural networks). Two main types of encoding are Binary and Target-based. |
|
|
|
|
|
|
|
|
|
|
|
Binary Encoding
|
|
|
Numerization of categorical variables by taking the values 0 or 1 to indicate the absence or presence of
each category. If the categorical variable has k
categories we would need to create k
binary variables (technically speaking,
k-1
would suffice). In the following example, the categorical
variable "Trend" with three values transformed to three separate
binary numerical variables. The main drawback with this method is when the
categorical variable with many values (e.g., city) which can tremendously
increase the dimension of data.
|
|
|
|
|
|
Categorical |
Encoded |
Trend
|
Trend_Up |
Trend_Down |
Trend_Flat |
Up |
1 |
0 |
0 |
Up |
1 |
0 |
0 |
Down |
0 |
1 |
0 |
Flat |
0 |
0 |
1 |
Down |
0 |
1 |
0 |
Up |
1 |
0 |
0 |
Down |
0 |
1 |
0 |
Flat |
0 |
0 |
1 |
Flat |
0 |
0 |
1 |
Flat |
0 |
0 |
1 |
|
|
|
|
|
|
Target-based Encoding
|
|
|
Target-based encoding is numerization of categorical
variables via target. In this method, we replace the categorical variable
with just one new numerical
variable and replace each category of the categorical variable with its corresponding probability of
the target (if categorical) or average of the target (if numerical). The main
drawbacks of this method are its dependency to the distribution of the
target, and its lower predictability power compare to the binary encoding
method.
|
|
|
|
|
|
Example 1:
|
|
|
An example of target-based encoding via a categorical
target.
|
|
|
Trend |
Target |
Trend_Encoded |
Up |
1 |
0.66 |
Up |
1 |
0.66 |
Down |
0 |
0.33 |
Flat |
0 |
0.5 |
Down |
1 |
0.33 |
Up |
0 |
0.66 |
Down |
0 |
0.33 |
Flat |
0 |
0.5 |
Flat |
1 |
0.5 |
Flat |
1 |
0.5 |
|
|
Target
|
|
Trend |
0 |
1 |
Probability
(1) |
Up |
1 |
2 |
0.66 |
Down |
2 |
1 |
0.33 |
Flat |
2 |
2 |
0.5 |
|
|
|
|
|
|
|
Example 2:
|
|
|
An example of target-based encoding via a numerical
target.
|
|
|
Trend |
Target |
Trend_Encoded |
Up |
21 |
23.7 |
Up |
24 |
23.7 |
Down |
8 |
10.3 |
Flat |
15 |
14.5 |
Down |
11 |
10.3 |
Up |
26 |
23.7 |
Down |
12 |
10.3 |
Flat |
16 |
14.5 |
Flat |
14 |
14.5 |
Flat |
13 |
14.5 |
|
Trend |
Target
- Average |
Up |
23.7 |
Down |
10.3 |
Flat |
14.5 |
|
|
|
|
|
|
|
Exercise
|
|
|
|
|
|