Map > Data
Science > Explaining the Past
> Data Exploration > Univariate
Analysis > Binning > Unsupervised 





Unsupervised
Binning



Unsupervised binning methods transform numerical variables
into categorical
counterparts but do not use the target (class) information. Equal Width
and Equal Frequency are two unsupervised binning methods. 





1 Equal Width Binning



The algorithm divides the data into k
intervals of equal size. The
width of intervals is:



w = (maxmin)/k



And the interval boundaries are:



min+w, min+2w, ... , min+(k1)w






2 Equal Frequency Binning



The algorithm divides the data into k groups
which each group contains approximately same number of
values. For the both methods, the best way of determining
k
is by looking at the histogram and try different intervals or groups.






Example:












3 Other Methods



 Rank: The rank of a number is its size relative to other values
of a numerical variable. First, we sort the list of values, then we
assign the position of a value as its rank. Same values receive the same
rank but the presence of duplicate values affects the ranks of subsequent
values (e.g., 1,2,3,3,5). Rank is a solid binning method with one major
drawback, values can have different ranks in
different lists.
 Quantiles
(median, quartiles, percentiles, ...): Quantiles are also very useful
binning methods but like Rank, one value can have different quantile if
the list of values changes.
 Math functions: For example, FLOOR(LOG(X)) is
an effective binning method for the numerical variables with highly skewed distribution (e.g., income).












Try
to invent a real time unsupervised binning method. Components of a real
time method are updated on the fly.





