Indexing and Selecting a Pandas DataFrame

Pandas is one of the most prominent libraries in Python for Data Analysis applications. First, a DataFrame in Pandas is like a table or a two-dimensional array with rows and columns. It is a mutable and heterogeneous data structure. We refer to rows and columns as axes.

A lot of functions are available in Pandas to manipulate DataFrames for analysis. We can create a DataFrame in several ways but the function used is:

To use this function or any function in the library, first, we have to import the library using the following:

import pandas as pd
pd.DataFrame()

For this tutorial, we created a table in an excel sheet, "painters.xlsx," with information about the 20 greatest painters in the world.

Indexing and Selecting a Pandas DataFrame

Now, here is a Python code to create this table into a pandas DataFrame:

import pandas as pd
df = pd.read_excel("painters.xlsx", index_col = 0)
print(df)

Output:

The title of the tutorial is "Indexing and Selecting a DataFrame". Like we slice a string using indexes from 0 to length - 1, we can also access, copy and create new DataFrames from an existing DataFrame. This tutorial explains all these ways.

DataFrame[] and DataFrame.column
DataFrame.loc[]
DataFrame.iloc[]
head() and tail()

1. [] and .

[] is called the Index operator, and . is called the Attribute operator in Pandas. These operators are used for basic form of Indexing and viewing different subsets of a DataFrame.

Using the Attribute operator(.):

Selecting columns:

We can only select one single column from the DataFrame using this operator. It is restricted to columns with direct reference. It means that if the name of the column contains white space, Python won't be able to follow up:

In our Painters table:

import pandas as pd
df = pd.read_excel("painters.xlsx", index_col = 0)
print(df.Birth)
print(df.Greatest Artpiece)

Output:

Observe that a syntax error is raised when we try to access the column "Greatest Artpiece" because of the space. If we want to access the attribute, we can use the getattr(DataFrame, column_name) function.

Import pandas as pd
df = pd.read_excel("painters.xlsx", index_col = 0)
print(getattr(df, "Greatest Artpiece"))

Output:

Using the Index operator:

Selecting columns:

We need to pass the name of the column to the operator, but here there is no restriction for any spaces in the column names:

import pandas as pd
df = pd.read_excel("painters.xlsx", index_col = 0)
print(df["Greatest Artpiece"])

Note that the name of the column has to be passed in quotes.

Output:

Another functionality with this operator is that we can even select multiple columns by passing a list of required columns to the function:

import pandas as pd
df = pd.read_excel("painters.xlsx", index_col = 0)
print(df[["Name", "Nationality"]])

Output:

Selecting Rows:

Using the slicing operator, we can select rows of the DataFrame using the same index operator. The syntax for slicing is the same as for any other iterable object:

start: Starting index/row_position(inclusive)

stop: Position to stop slicing(exclusive)

step: The interval between selecting rows

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print(df[1: 4])

Output:

We can also use row labels if we use them while creating. Here is an example:

import pandas as pd
dictionary = {"Name": ["Harry", "Zayn", "Niall"], "Age": [28, 28, 29]}
df1 = pd.DataFrame(dictionary, index = ["Member 1", "Member 2", "Member 3"])
print(df1)
print()
print(df1[0: 2])
print()
print(df1["Member 1": "Member 3"])

Output:

Observe that the row with Member 3 is also printed. When we use positions to slice, the end position is exclusive, but the last row is inclusive when we use row labels.

Here are the points to conclude about the index operator:

We can select both rows and columns from the DataFrame using [].
When selecting columns, we can select a single column or multiple columns.
When we use the slicing operator, it will select rows
We can slice rows using positions or row_labels. When we use positions, the last row isn't selected, but when we use row_labels, the last row is selected.

So far, we couldn't select both rows and columns of a DataFrame simultaneously. There are two functions in Pandas specially built for selecting and sub-setting DataFrames. These functions have clear functionality. We'll learn about them now.

2. DataFrame.iloc

Syntax:

Both rows and columns must be positions and not labels, and these positions can be given as follows:

A single position
List of multiple positions
Slice of positions

Here is the table we'll be modifying:

Note that the 0^th row and 0^th column are referred to as the 1^st row and 1^st column.

Single position:

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("The value in 2nd row and 2nd column:")
print(df.iloc[1, 1]) #0th-1st, 1st - 2nd 

Output:

List of positions:

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("First three rows and columns:")
print(df.iloc[[0, 1, 2], [0, 1, 2]])

Output:

Combinations of single position and list of positions:

Syntax:

DataFrame.iloc[row_position, [c1, c2...]]  #Single row, multiple columns
DataFrame.iloc[[r1, r2...], column_position] #Multiple rows and single column

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("Values in first two columns in 2nd row:")
print(df.iloc[1, [0, 1]])
print()
print("Values in first two rows in 2nd column:")
print(df.iloc[[0, 1], 1])

Output:

Slices

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("Values in first row:")
print(df.iloc[0, ::])
print()
print("Values in first column:")
print(df.iloc[::, 0])
print()
print("Values in 2, 3, 4 rows and 3, 4, 5 columns:")
print(df.iloc[1: 3, 2: 4])
print()
print("Values in even rows and even columns:")
print(df.iloc[1::2, 1::2])

Output:

3. DataFrame.loc[rows, columns]

As we saw above, iloc[] works on positions, not labels. loc[], on the contrary, works on labels, not positions. All the other functionality is the same.

Both rows and columns must be labels, and these labels can be given as follows:

A single row or column label
List of multiple labels
Slice of labels

Note: While using the slice operator on the row or column labels, the end label will be inclusive along with the starting label like when we sliced using the index operator-[]

Here is the table we'll be modifying:

Single label:

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("The value in 2nd row and 2nd column:")
print(df.loc[1, "Birth"])

Output:

List of positions:

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("First three rows and columns:")
print(df.loc[[0, 1, 2], ["Name", "Birth", "Death"]])

Output:

Combinations of single position and list of positions:

Syntax:

DataFrame.iloc[row_label, [c1, c2...]]  #Single row, multiple columns
DataFrame.iloc[[r1, r2...], column_label] #Multiple rows and single column

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("Values in first two columns in 2nd row:")
print(df.loc[1, ["Name", "Birth"]])
print()
print("Values in first two rows in 2nd column:")
print(df.loc[[0, 1], "Birth"])

Output:

Slices

Syntax:

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("Values in first row:")
print(df.loc[0, ::])
print()
print("Values in first column:")
print(df.loc[::, "Name"])
print()
print("Values in 2, 3, 4 rows and 3, 4, 5 columns:")
print(df.loc[1: 3, "Death": "Nationality"])
print()
print("Values in even rows and even columns:")
print(df.loc[1::2, "Birth"::2])

Output:

Observe that when we gave:

Rows: 1^st, 2^nd, and 3^rd rows

Columns: "Death", "Greatest Artpiece," and "Nationality" are printed, which means the last row and column are also included.

With conditions:

Until now, we selected data from DataFrame using either position numbers or labels. We can also select data based on conditions we need using two ways used-loc[] and index operators:

Here are some important points to carry on:

1. We can use any Boolean operator, but here, we must use:

& for and operation
| for or operation
~ for not operation

2. We can use any number of conditions, but every condition must be enclosed using parentheses.

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("American or French painter born after 1800:")
print(df[(df["Birth"]>1800) & ((df["Nationality"]=="American") | (df["Nationality"]=="French"))])

Output:

3. For suppose we want to print all the painters born after 1800. We need to check the column Birth:

df["Birth"]>1800

This is the condition. We'll get that column with True and False if we print it after checking the condition. Now, if we want to print the rows, we need to pass the condition to df[]:

df[df["Birth"]>1800]
print("Painters born after 1800:")
print(df["Birth"]>1800)
print()
print(df[df["Birth"]>1800])

Output:

4. Using loc[], we can pass the condition directly to the operator as we pass to df[]. The additional advantage we can get by using loc[] is that we can select the columns using slicing.

Here is an example:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("Using loc():")
print("Painters born after 1800:")
print(df.loc[(df["Birth"]>1800)])
print("Selecting a few columns:")
print(df.loc[(df["Birth"]>1800), "Name": "Birth"])

Output:

4. head() and tail()

These two methods are mostly used to view data samples from a huge amount of data. Head() is used to get the samples from the start and tail() from the end.

If we don't pass any argument, head() prints the first five rows of the DataFrame, and tail() prints the last five rows of the DataFrame. We can pass one argument by mentioning the number of rows we need.

Syntax:

DataFrame.head(number of rows)
DataFrame.tail(number of rows)

Code:

import pandas as pd
cols = [0, 1, 2, 3, 4]
df = pd.read_excel("D:\Internships\JavaTpoint\October-new pos\painters.xlsx", index_col = 0)
print("First five rows: ")
df1 = df.head()
print(df1)
print("\nFirst 3 rows: ")
print(df.head(3))
print("\nLast five rows: ")
df2 = df.tail()
print(df2)
print("\nLast 3 rows: ")
print(df.tail(3))

Output: