Pipelines in Pandas

In pandas, pipelines are very important in situations when we need to transform the complete data of the dataframe. It can help in manipulating a lot of data easily. In general terms, the pipeline is used when we have a sequence of operations that need to be performed in order to get the final desired result. We can create a pipeline of our own by defining a couple of functions and passing the data frame through these functions in an order. This task of pipelining the operations can be simplified using the .pipe() method of the pandas dataframe.

The pipe() method helps us in calling multiple functions at the same time and processing our data in a single line of code. To understand the functioning of the pipe() method, let us first understand what a pipeline of operations means. We will see an example of a pipeline and then simplify the process using the .pipe() method.

Below is the Python code for the pipeline of operations on the dataframe.

Code

Output

Original Dataframe: 
   Artists      Role  Age
0   Harry    Singer   31
1   Naill  Musician   33
2   Louis  Lyricist   32
3    Zayn    Singer   33
4    Liam  Composer   32
5   Peter     Actor   34
6  Andrew     Actor   34

We will implement this pipeline using the .pipe() method

Code

Output

  ARTISTS      ROLE        AGE
0   Harry    Singer  32.714286
1   Naill  Musician  32.714286
2   Louis  Lyricist  32.714286
3    Zayn    Singer  32.714286
4    Liam  Composer  32.714286
5   Peter     Actor  32.714286
6  Andrew     Actor  32.714286

Now, we will use the pdpipe package of Python to implement a pipeline on a Pandas dataframe. The pdpipe is easy to use and offers a clear interface to build pipelines for Pnadas dataframes. The pdpipe package of Python is used for pre-processing the pipelines created for the Pandas dataframe. Pdpipe is a much more efficient tool for building complex pipelines in a few lines of code.

Before using the pdpipe package, we need to install it in our Python environment. We will use the following pip command to install this package

Once the package is downloaded, we can use this package, as shown in the example below.

Below is the Python code to implement pipelines using the pdpipe package

Code

Output

Original Dataframe: 
   Artists      Role  Age State  idx
0   Harry    Singer   31    NY    1
1   Naill  Musician   33   Cal    2
2   Louis  Lyricist   32    NL    3
3    Zayn    Singer   33    BP    4
4    Liam  Composer   32    CL    5
5   Peter     Actor   34    NY    6
6  Andrew     Actor   34   Cal    7

Now, we will create a pipeline to drop an unwanted column from the dataframe. We will use the pdpipe package to drop the column.

Here is the Python code to show how it can be done

Code

Output

New dataframe: 
   Artists      Role  Age State
0   Harry    Singer   31    NY
1   Naill  Musician   33   Cal
2   Louis  Lyricist   32    NL
3    Zayn    Singer   33    BP
4    Liam  Composer   32    CL
5   Peter     Actor   34    NY
6  Andrew     Actor   34   Cal

The pdpipe package contains one more method to implement the pipeline to the dataframe. Let us see the second way to do so.

Code

Output

New dataframe: 
   Artists      Role  Age State
0   Harry    Singer   31    NY
1   Naill  Musician   33   Cal
2   Louis  Lyricist   32    NL
3    Zayn    Singer   33    BP
4    Liam  Composer   32    CL
5   Peter     Actor   34    NY
6  Andrew     Actor   34   Cal

In the above two methods of implementing the pipeline to the dataframe, the implementation took two steps. The first step was to create a pipeline. The second step was to apply the pipeline to our data frame.

We have seen how to drop a column, but what if we have to add a column? Let us see how to add a column to the dataframe using the pdpipe package.

Adding a Column to the Dataframe Using the Pdpipe Package

Below is the Python code for adding a column to the dataframe using the pdpipe package.

Code

Output

Original Dataframe: 
   Artists      Role  Age State  idx
0   Harry    Singer   31    NY    1
1   Naill  Musician   33   Cal    2
2   Louis  Lyricist   32    NL    3
3    Zayn    Singer   33    BP    4
4    Liam  Composer   32    CL    5
5   Peter     Actor   34    NY    6
6  Andrew     Actor   34   Cal    7
New dataframe: 
   Artists      Role  Age State  idx
0   Harry    Singer   31    NY    1
1   Naill  Musician   33   Cal    2
2   Louis  Lyricist   32    NL    3
3    Zayn    Singer   33    BP    4
4    Liam  Composer   32    CL    5

We have seen two different ways to implement a pipeline on the Pandas dataframe. We can use the built-in pipe() method of the Pandas module. This function reduces the implementation of the user-defined pipelines to one or two lines of code. The second way is to use the pdpipe package. This package has built-in pipelines for the Pandas dataframe. We need not to create a pipeline from scratch.