Pandas DataFrame.cut()

The cut() method is invoked when you need to segment and sort the data values into bins. It is used to convert a continuous variable to a categorical variable. It can also segregate an array of elements into separate bins. The method only works for the one-dimensional array-like objects.

If we have a large set of scalar data and perform some statistical analysis on it, we can use the cut() method.

Syntax:

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')

Parameters:

x: It generally refers to an array as an input that is to be bin. The array should be a one-dimensional array.

bins: It refers to an int, sequence of scalars, or IntervalIndex values that define the bin edges for the segmentation. Most of the time, we have numerical data on a very large scale. So, we can group the values into bins to easily perform descriptive statistics as a generalization of patterns in data. The criteria for binning the data into groups are as follows:

int: It defines the number of equal-width bins that are in the range of x. We can also extend the range of x by .1% on both sides to include the minimum and maximum values of x.
sequence of scalars: It mainly defines the bin edges that are allowed for non-uniform width.
IntervalIndex: It refers to an exact bin that is to be used in the function. It should be noted that the IntervalIndex for bins must be non-overlapping.
right: It consists of a boolean value that checks whether the bins include the rightmost edge or not. Its default value is True, and it is ignored when bins is an
labels: It is an optional parameter that mainly refers to an array or a boolean value. Its main task is to specify the labels for the returned The length of the labels must be the same as the resulting bins. If we set its value to False, it returns only integer indicator of the bins. This argument is ignored if bins is an IntervalIndex.
retbins: It refers to a boolean value that checks whether to return the bins or not. It is often useful when bins are provided as a scalar value. The default value of retbins is False.
precision: It is used to store and display the bins labels. It consists of an integer value that has the default value 3.
include_lowest: It consists of a boolean value that is used to check whether the first interval should be left-inclusive or not.
duplicates: It is an optional parameter that decides whether to raise a ValueError or drop duplicate values if the bin edges are not unique.

Returns:

This method returns two objects as output which are as follows:

out: It mainly refers to a Categorical, Series, or ndarray that is an array-like object which represents the respective bin for each value of These objects depend on the value of labels. The possible values than can be returned are as follows:
- True: It is a default value that returns a Series or a Categorical variable. The values stored in these objects are Interval data type.
- sequence of scalars: It also returns a Series or a Categorical variable. The values that are stored in these objects are the type of the sequence.
- False: The false value returns an ndarray of integers.
bins: It mainly refers to a ndarray

Example1: The below example segments the numbers into bins:

import pandas as pd
import numpy as np
info_nums = pd.DataFrame({'num': np.random.randint(1, 50, 11)})
print(info_nums)
info_nums['num_bins'] = pd.cut(x=df_nums['num'], bins=[1, 25, 50])
print(info_nums)
print(info_nums['num_bins'].unique())

Output:

    num
0    48
1    36
2     7
3     2
4    25
5     2
6    13
7     5
8     7
9    25
10   10
    num     num_bins
0    48  (1.0, 25.0]
1    36  (1.0, 25.0]
2     7  (1.0, 25.0]
3     2  (1.0, 25.0]
4    25          NaN
5     2  (1.0, 25.0]
6    13  (1.0, 25.0]
7     5  (1.0, 25.0]
8     7  (1.0, 25.0]
9    25  (1.0, 25.0]
10   10          NaN
[(1.0, 25.0], NaN]
Categories (1, interval[int64]): [(1, 25]]

Example2: The below example shows how to add labels to bins:

import pandas as pd
import numpy as np
info_nums = pd.DataFrame({'num': np.random.randint(1, 10, 7)})
print(info_nums)
info_nums['nums_labels'] = pd.cut(x=info_nums['num'], bins=[1, 7, 10], labels=['Lows', 'Highs'], right=False)
print(info_nums)
print(info_nums['nums_labels'].unique())

Output:

   num
0    9
1    9
2    4
3    9
4    4
5    7
6    2
   num  nums_labels
0    9        Highs
1    9        Highs
2    4        Lows
3    9        Highs
4    4        Lows
5    7        Highs
6    2        Lows
[Highs, Lows]
Categories (2, object): [Lows < Highs]

Next TopicDataFrame.describe()

← prev next →