Pandas DataFrame.merge()

Pandas merge() is defined as the process of bringing the two datasets together into one and aligning the rows based on the common attributes or columns. It is an entry point for all standard database join operations between DataFrame objects:

Syntax:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)

Parameters:

right: DataFrame or named Series
It is an object which merges with the DataFrame.
how: {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed.
- left: It use only keys from the left frame, similar to a SQL left outer join; preserve key order.
- right: It use only keys from the right frame, similar to a SQL right outer join; preserve key order.
- outer: It used the union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: It use the intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
on: label or list
It is a column or index level names to join on. It must be found in both the left and right DataFrames. If on is None and not merging on indexes, then this defaults to the intersection of the columns in both DataFrames.
left_on: label or list, or array-like
It is a column or index level names from the left DataFrame to use as a key. It can be an array with length equal to the length of the DataFrame.
right_on: label or list, or array-like
It is a column or index level names from the right DataFrame to use as keys. It can be an array with length equal to the length of the DataFrame.
left_index : bool, default False
It uses the index from the left DataFrame as the join key(s), If true. In the case of MultiIndex (hierarchical), many keys in the other DataFrame (either the index or some columns) should match the number of levels.
right_index : bool, default False
It uses the index from the right DataFrame as the join key. It has the same usage as the left_index.
sort: bool, default False
If True, it sorts the join keys in lexicographical order in the result DataFrame. Otherwise, the order of the join keys depends on the join type (how keyword).
suffixes: tuple of the (str, str), default ('_x', '_y')
It suffixes to apply to overlap the column names in the left and right DataFrame, respectively. The columns use (False, False) values to raise an exception on overlapping.
copy: bool, default True
If True, it returns a copy of the DataFrame.
Otherwise, It can avoid the copy.
indicator: bool or str, default False
If True, It adds a column to output DataFrame "_merge" with information on the source of each row. If it is a string, a column with information on the source of each row will be added to output DataFrame, and the column will be named value of a string. The information column is defined as a categorical-type and it takes value of:
- "left_only" for the observations whose merge key appears only in 'left' of the DataFrame, whereas,
- "right_only" is defined for observations in which merge key appears only in 'right' of the DataFrame,
- "both" if the observation's merge key is found in both of them.
validate: str, optional
If it is specified, it checks the merge type that is given below:
- "one_to_one" or "1:1": It checks if merge keys are unique in both the left and right datasets.
- "one_to_many" or "1:m": It checks if merge keys are unique in only the left dataset.
- "many_to_one" or "m:1": It checks if merge keys are unique in only the right dataset.
- "many_to_many" or "m:m": It is allowed, but does not result in checks.

Example1: Merge two DataFrames on a key

# import the pandas library
import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4],
   'Name': ['John', 'Parker', 'Smith', 'Parker'],
   'subject_id':['sub1','sub2','sub4','sub6']})
right = pd.DataFrame({
    'id':[1,2,3,4],
   'Name': ['William', 'Albert', 'Tony', 'Allen'],
   'subject_id':['sub2','sub4','sub3','sub6']})
print (left)
print (right)

Output

    id      Name     subject_id

0   1       John        sub1
1   2       Parker      sub2
2   3       Smith       sub4
3   4       Parker      sub6

    id      Name     subject_id

0   1       William     sub2
1   2       Albert      sub4
2   3       Tony        sub3
3   4       Allen       sub6

Example2: Merge two DataFrames on multiple keys:

import pandas as pd
left = pd.DataFrame({
   'id':[1,2,3,4,5],
   'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
   'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
	'id':[1,2,3,4,5],
   'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
   'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')

Output

    id   Name_x    subject_id_x   Name_y     subject_id_y
0   1     John      sub1          William     sub2
1   2     Parker    sub2          Albert      sub4
2   3     Smith     sub4          Tony        sub3
3   4     Parker    sub6          Allen       sub6

Next TopicDataFrame.pivot_table()

← prev next →