Map > Data
Mining > Explaining the Past
> Data Exploration > Univariate
Analysis > Binning > Unsupervised 

Unsupervised
Binning

Unsupervised binning methods transform numerical variables
into categorical
counterparts but do not use the target (class) information. Equal Width
and Equal Frequency are two unsupervised binning methods. 

1 Equal Width Binning

The algorithm divides the data into k
intervals of equal size. The
width of intervals is:

w = (maxmin)/k

And the interval boundaries are:

min+w, min+2w, ... , min+(k1)w


2 Equal Frequency Binning

The algorithm divides the data into k groups
which each group contains approximately same number of
values. For the both methods, the best way of determining
k
is by looking at the histogram and try different intervals or groups.


Example:




3 Other Methods

 Rank: The rank of a number is its size relative to other values
of a numerical variable. First, we sort the list of values, then we
assign the position of a value as its rank. Same values receive the same
rank but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major
drawback, values can have different ranks in
different lists.
 Quantiles
(median, quartiles, percentiles, ...): Quantiles are also very useful
binning methods but like Rank, one value can have different quantile if
the list of values changes.
 Math functions: For example, FLOOR(LOG(X)) is
an effective binning method for the numerical variables with highly skewed distribution (e.g., income).




Try
to invent a real time unsupervised binning method. Components of a real
time method are updated on the fly.

