Indexing and Selecting a Pandas DataFrame

Pandas is one of the most prominent libraries in Python for Data Analysis applications. First, a DataFrame in Pandas is like a table or a two-dimensional array with rows and columns. It is a mutable and heterogeneous data structure. We refer to rows and columns as axes.

A lot of functions are available in Pandas to manipulate DataFrames for analysis. We can create a DataFrame in several ways but the function used is:

To use this function or any function in the library, first, we have to import the library using the following:

For this tutorial, we created a table in an excel sheet, "painters.xlsx," with information about the 20 greatest painters in the world.

Indexing and Selecting a Pandas DataFrame

Now, here is a Python code to create this table into a pandas DataFrame:

Output:

Indexing and Selecting a Pandas DataFrame

The title of the tutorial is "Indexing and Selecting a DataFrame". Like we slice a string using indexes from 0 to length - 1, we can also access, copy and create new DataFrames from an existing DataFrame. This tutorial explains all these ways.

  1. DataFrame[] and DataFrame.column
  2. DataFrame.loc[]
  3. DataFrame.iloc[]
  4. head() and tail()

1. [] and .

[] is called the Index operator, and . is called the Attribute operator in Pandas. These operators are used for basic form of Indexing and viewing different subsets of a DataFrame.

Using the Attribute operator(.):

Selecting columns:

  • We can only select one single column from the DataFrame using this operator. It is restricted to columns with direct reference. It means that if the name of the column contains white space, Python won't be able to follow up:

In our Painters table:

Output:

Indexing and Selecting a Pandas DataFrame

Observe that a syntax error is raised when we try to access the column "Greatest Artpiece" because of the space. If we want to access the attribute, we can use the getattr(DataFrame, column_name) function.

Output:

Indexing and Selecting a Pandas DataFrame

Using the Index operator:

Selecting columns:

We need to pass the name of the column to the operator, but here there is no restriction for any spaces in the column names:

Note that the name of the column has to be passed in quotes.

Output:

Indexing and Selecting a Pandas DataFrame
  • Another functionality with this operator is that we can even select multiple columns by passing a list of required columns to the function:

Output:

Indexing and Selecting a Pandas DataFrame

Selecting Rows:

Using the slicing operator, we can select rows of the DataFrame using the same index operator. The syntax for slicing is the same as for any other iterable object:

start: Starting index/row_position(inclusive)

stop: Position to stop slicing(exclusive)

step: The interval between selecting rows

Output:

Indexing and Selecting a Pandas DataFrame

We can also use row labels if we use them while creating. Here is an example:

Output:

Indexing and Selecting a Pandas DataFrame
  • Observe that the row with Member 3 is also printed. When we use positions to slice, the end position is exclusive, but the last row is inclusive when we use row labels.

Here are the points to conclude about the index operator:

  1. We can select both rows and columns from the DataFrame using [].
  2. When selecting columns, we can select a single column or multiple columns.
  3. When we use the slicing operator, it will select rows
  4. We can slice rows using positions or row_labels. When we use positions, the last row isn't selected, but when we use row_labels, the last row is selected.

So far, we couldn't select both rows and columns of a DataFrame simultaneously. There are two functions in Pandas specially built for selecting and sub-setting DataFrames. These functions have clear functionality. We'll learn about them now.

2. DataFrame.iloc

Syntax:

Both rows and columns must be positions and not labels, and these positions can be given as follows:

  • A single position
  • List of multiple positions
  • Slice of positions

Here is the table we'll be modifying:

Indexing and Selecting a Pandas DataFrame

Note that the 0th row and 0th column are referred to as the 1st row and 1st column.

  • Single position:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • List of positions:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • Combinations of single position and list of positions:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • Slices

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame Indexing and Selecting a Pandas DataFrame

3. DataFrame.loc[rows, columns]

As we saw above, iloc[] works on positions, not labels. loc[], on the contrary, works on labels, not positions. All the other functionality is the same.

Both rows and columns must be labels, and these labels can be given as follows:

  • A single row or column label
  • List of multiple labels
  • Slice of labels

Note: While using the slice operator on the row or column labels, the end label will be inclusive along with the starting label like when we sliced using the index operator-[]

Here is the table we'll be modifying:

Indexing and Selecting a Pandas DataFrame
  • Single label:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • List of positions:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • Combinations of single position and list of positions:

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
  • Slices

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame
Indexing and Selecting a Pandas DataFrame

Observe that when we gave:

Rows: 1st, 2nd, and 3rd rows

Columns: "Death", "Greatest Artpiece," and "Nationality" are printed, which means the last row and column are also included.

With conditions:

Until now, we selected data from DataFrame using either position numbers or labels. We can also select data based on conditions we need using two ways used-loc[] and index operators:

Here are some important points to carry on:

1. We can use any Boolean operator, but here, we must use:

2. We can use any number of conditions, but every condition must be enclosed using parentheses.

Output:

Indexing and Selecting a Pandas DataFrame

3. For suppose we want to print all the painters born after 1800. We need to check the column Birth:

df["Birth"]>1800

This is the condition. We'll get that column with True and False if we print it after checking the condition. Now, if we want to print the rows, we need to pass the condition to df[]:

Output:

Indexing and Selecting a Pandas DataFrame

4. Using loc[], we can pass the condition directly to the operator as we pass to df[]. The additional advantage we can get by using loc[] is that we can select the columns using slicing.

Here is an example:

Output:

Indexing and Selecting a Pandas DataFrame

4. head() and tail()

These two methods are mostly used to view data samples from a huge amount of data. Head() is used to get the samples from the start and tail() from the end.

If we don't pass any argument, head() prints the first five rows of the DataFrame, and tail() prints the last five rows of the DataFrame. We can pass one argument by mentioning the number of rows we need.

Syntax:

Code:

Output:

Indexing and Selecting a Pandas DataFrame




Latest Courses