What is Binning in Data Mining?

Data binning, also called discrete binning or bucketing, is a data pre-processing technique used to reduce the effects of minor observation errors. It is a form of quantization. The original data values are divided into small intervals known as bins, and then they are replaced by a general value calculated for that bin. This has a soothing effect on the input data and may also reduce the chances of over fitting in the case of small datasets.

Statistical data binning is a way to group numbers of more or less continuous values into a smaller number of "bins". It can also be used in multivariate statistics, binning in several dimensions simultaneously. For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals, such as grouping every five years together.

Binning can dramatically improve resource utilization and model build response time without significant loss in model quality. Binning can improve model quality by strengthening the relationship between attributes.

Supervised binning is a form of intelligent binning in which important characteristics of the data are used to determine the bin boundaries. In supervised binning, the bin boundaries are identified by a single-predictor decision tree that considers the joint distribution with the target. Supervised binning can be used for both numerical and categorical attributes.

Image Data Processing

In the context of image processing, binning is the procedure of combining a cluster of pixels into a single pixel. As such, in 2x2 binning, an array of 4 pixels becomes a single larger pixel, reducing the overall number of pixels.

Although associated with loss of information, this aggregation reduces the amount of data to be processed, facilitating the analysis. For example, binning the data may also reduce the impact of reading noise on the processed image (at the cost of a lower resolution).

Why is Binning Used?

Binning or discretization is used to transform a continuous or numerical variable into a categorical feature. Binning of continuous variables introduces non-linearity and tends to improve the performance of the model. It can also be used to identify missing values or outliers.

What is the Purpose of Binning Data?

Binning, also called discretization, is a technique for reducing continuous and discrete data cardinality. Binning groups related values together in bins to reduce the number of distinct values.

Example of Binning

Histograms are an example of data binning used to observe underlying distributions. They typically occur in one-dimensional space and equal intervals for ease of visualization.

Data binning may be used when small instrumental shifts in the spectral dimension from mass spectrometry (MS) or nuclear magnetic resonance (NMR) experiments will be falsely interpreted as representing different components when a collection of data profiles is subjected to pattern recognition analysis. A straightforward way to cope with this problem is by using binning techniques. The spectrum is reduced in resolution to a sufficient degree to ensure that a given peak remains in its bin despite small spectral shifts between analyses.

For example, in NMR, the chemical shift axis may be discredited and coarsely binned, and in MS, the spectral accuracies may be rounded to integer atomic mass unit values. Also, several digital camera systems incorporate an automatic pixel binning function to improve image contrast.

Binning is also used in machine learning to speed up the decision-tree boosting method for supervised classification and regression in algorithms such as Microsoft's LightGBM and scikit-learn's Histogram-based Gradient Boosting Classification Tree.

How do you Binning Data?

There are two methods of dividing data into bins and binning data:

1. Equal Frequency Binning: Bins have an equal frequency.

For example, equal frequency:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:

[5, 10, 11, 13]

[15, 35, 50, 55]

[72, 92, 204, 215]

2. Equal Width Binning: Bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw] where w = (max - min) / (no of bins).

For example, equal Width:

Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]

Output:

[5, 10, 11, 13, 15, 35, 50, 55, 72]

[92]

[204, 215]

Implementation of Binning Technique

Below code shows the implementation of binning techniques.

# equal frequency
def equifreq(arr1, m):   
    a = len(arr1)
    n = int(a / m)
    for i in range(0, m):
        arr = []
        for j in range(i * n, (i + 1) * n):
            if j >= a:
                break
            arr = arr + [arr1[j]]
        print(arr)
 
# equal width
def equiwidth(arr1, m):
    a = len(arr1)
    w = int((max(arr1) - min(arr1)) / m)
    min1 = min(arr1)
    arr = []
    for i in range(0, m + 1):
        arr = arr + [min1 + w * i]
    arri=[]
     
    for i in range(0, m):
        temp = []
        for j in arr1:
            if j >= arr[i] and j <= arr[i+1]:
                temp += [j]
        arri += [temp]
    print(arri)
 
# data to be binned
data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
 
# no of bins
m = 3
 
print("equal frequency binning")
equifreq(data, m)
 
print("\n\nequal width binning")
equiwidth(data, 3)

Output

The above code gives the following output.

Equal frequency binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]

Equal width binning
[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]

Next TopicKDD vs Data Mining

← prev next →