Pandas DataFrame.join()

When we want to concatenate our DataFrames, we can add them with each other by stacking them either vertically or side by side. Another method to combine these DataFrames is to use columns in each dataset that contain common values. The method of combining the DataFrame using common fields is called "joining". The method that we use for combining the DataFrame is a join() method. The columns that contain common values are called "join key".

The join() method is often useful when one DataFrame is a lookup table that contains additional data added into the other DataFrame. It is a convenient method that can combine the columns of two differently-indexed DataFrames into a single DataFrame.

Identifying join keys

To determine the appropriate join keys, first, we have to define required fields that are shared between the DataFrames. Both the DataFrames consist of the columns that have the same name and also contain the same data.

Inner joins

Inner join can be defined as the most commonly used join. Basically, its main task is to combine the two DataFrames based on a join key and returns a new DataFrame. The returned DataFrame consists of only selected rows that have matching values in both of the original DataFrame.

Left joins

If we want to add some information into the DataFrame without losing any of the data, we can simply do it through a different type of join called a "left outer join" or "left join".

Like an inner join, left join also uses the join keys to combine two DataFrames, but unlike inner join, it returns all of the rows from the left DataFrame, even those rows whose join keys do not include the values in the right DataFrame.

Syntax:

Parameters:

other: It refers to the DataFrame or Series.

In this case, the index should be similar to one of the columns. If we pass a Series, the named attribute has to be set for using it as the column name in the resulting joined DataFrame.

on: It is an optional parameter that refers to array-like or str values.

It refers to a column or index level name in the caller to join on the index. Otherwise, it joins index-on-index. If multiple values are present, then the other DataFrame must have MultiIndex. It is like an Excel VLOOKUP operation that can pass an array as the join key if it is not already contained within the calling DataFrame.

how: It refers to 'left', 'right', 'outer', 'inner' values that mainly work on how to handle the operation of the two objects. The default value of how is left.

left: It uses a calling frame's index or column if the parameter on is specified.
right: It uses the other index.
outer: It is used to form a union of calling frame's index or column if parameter on is specified with other's index, and also sort it lexicographically.
inner: It is used to form an intersection of calling frame's index or column if parameter on is specified with other's index. So, due to this, it preserves the order of the calling object.

lsuffix: It refers to a string object that has the default value ''. It uses the Suffix from the left frame's overlapping columns.

rsuffix: It refers to a string value, that has the default value ''. It uses the Suffix from the right frame's overlapping columns.

sort: It consists of a boolean value that sorts the resulting DataFrame lexicographically by the join key. If we pass False value, then the order of the join key mainly depends on the join type, i.e., how.

Example: The below example shows the working of join() function.

import pandas as pd
info = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
x = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'B': ['B0', 'B1', 'B2']})
info.join(x, lsuffix='_caller', rsuffix='_x') 
info.set_index('key').join(x.set_index('key'))
info.join(x.set_index('key'), on='key')

Output:

    key  A   B
0   K0   A0  B0
1   K1   A1  B1
2   K2   A2  B2
3   K3   A3  NaN
4   K4   A4  NaN
5   K5   A5  NaN

Example2: The below example joins the two MultiIndexes:

import pandas as pd
leftindex = pd.MultiIndex.from_product([list('xyz'), list('pq'), [1, 2]],
                                      names=['xyz', 'pq', 'num'])
left = pd.DataFrame({'value': range(12)}, index=leftindex)
left

Output:

			value
xyz	pq	num	
x	p	1	0
2	1
q	1	2
2	3
y	p	1	4
2	5
q	1	6
2	7
z	p	1	8
2	9
q	1	10
2	11

Next TopicDataFrame.mean()

← prev next →

For Videos Join Our Youtube Channel: Join Now

Feedback

Send your Feedback to [email protected]

Help Others, Please Share

Learn Latest Tutorials

Splunk

SPSS

Swagger

Transact-SQL

Tumblr

ReactJS

Regex

Reinforcement Learning

R Programming

RxJS

React Native

Python Design Patterns

Python Pillow

Python Turtle

Keras

Preparation

Aptitude

Reasoning

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

AWS

Selenium

Cloud Computing

Hadoop

ReactJS

Data Science

Angular 7

Blockchain

Git

Machine Learning

DevOps

B.Tech / MCA

DBMS

Data Structures

DAA

Operating System

Computer Network

Compiler Design

Computer Organization

Discrete Mathematics

Ethical Hacking

Computer Graphics

Software Engineering

Web Technology

Cyber Security

Automata

C Programming

C++

Java

.Net

Python

Programs

Control System

Data Mining

Data Warehouse

^{Like/Subscribe us for latest updates or newsletter}

Pandas Tutorial

Pandas Series