Python pandas merge(join) 通过单列或多列合并连接两个DataFrame

Python pandas中处理两个DataFrame时,有些情况我们可能需要将两个DataFrame合并成一个DataFrame,本文主要介绍Python pandas 中通过单列或多列合并连接两个DataFrame的方法,以及相关的示例代码。

1、内连接(inner join)

内连接是满足条件时,左边的和右边的DataFrame都存在的数据。

1)单列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='inner', on ='Id')
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='inner',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender  Like
0  2    John     2      F     C
1  4  Mickey     4      M  Java

2)多列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                     'UserId':['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],
                    'UserId': ['1', '2', '3', '4', '5'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='inner', on =['Id','UserId'])
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='inner',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender   Like
0  3      3    Lynn     3      M     JS
1  4      4  Mickey     4      F  Linux

2、外连接(outer join)

外连接返回左侧DataFrame中的所有行,右侧DataFrame中的所有行,并在可能的情况下匹配行,在其他地方使用NaN

1)单列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='outer', on ='Id')
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='outer',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender    Like
0  1    Levi     1    NaN     NaN
1  2    John     2      F       C
2  3    Lynn     3    NaN     NaN
3  4  Mickey     4      M    Java
4  5    Rose     5    NaN     NaN
5  6     NaN   NaN      M      JS
6  7     NaN   NaN      F   Linux
7  8     NaN   NaN      F  Python

2)多列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                     'UserId':['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],
                    'UserId': ['1', '2', '3', '4', '5'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='outer', on =['Id','UserId'])
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='outer',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender    Like
0  1      1    Levi     1    NaN     NaN
1  2      2    John     2    NaN     NaN
2  3      3    Lynn     3      M      JS
3  4      4  Mickey     4      F   Linux
4  5      5    Rose     5    NaN     NaN
5  2      1     NaN   NaN      F       C
6  9      2     NaN   NaN      M    Java
7  8      5     NaN   NaN      F  Python

3、左连接(left join)

使用左连接,将显示第一个DataFrame中的所有记录,而不管第一个DataFrame中的键是否可以在第二个DataFrame中找到。而对于第二个DataFrame,只会显示在第一个DataFrame中可以找到的具有第二个DataFrame中的key的记录。

1)单列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='left', on ='Id')
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='left',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender  Like
0  1    Levi     1    NaN   NaN
1  2    John     2      F     C
2  3    Lynn     3    NaN   NaN
3  4  Mickey     4      M  Java
4  5    Rose     5    NaN   NaN

2)多列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                     'UserId':['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],
                    'UserId': ['1', '2', '3', '4', '5'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='left', on =['Id','UserId'])
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='left',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender   Like
0  1      1    Levi     1    NaN    NaN
1  2      2    John     2    NaN    NaN
2  3      3    Lynn     3      M     JS
3  4      4  Mickey     4      F  Linux
4  5      5    Rose     5    NaN    NaN

4、右连接(right join)

对于右连接,将显示来自第二个DataFrame的所有记录。但是,只会显示在第二个DataFrame中可以找到的第一个DataFrame中具有Key的记录。

1)单列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '4', '6', '7', '8'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='right', on ='Id')
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='right',left_on = "XId",right_on = "YId")

输出:

  Id    Name order Gender    Like
0  2    John     2      F       C
1  4  Mickey     4      M    Java
2  6     NaN   NaN      M      JS
3  7     NaN   NaN      F   Linux
4  8     NaN   NaN      F  Python

2)多列条件

import numpy as np
import pandas as pd
 
 
left = pd.DataFrame({'Id': ['1', '2', '3', '4', '5'],
                     'UserId':['1', '2', '3', '4', '5'],
                    'Name': ['Levi', 'John', 'Lynn',
                             'Mickey', 'Rose'],
                    'order': ['1', '2', '3', '4', '5']})
 
right = pd.DataFrame({'Id': ['2', '9', '3', '4', '8'],
                    'UserId': ['1', '2', '3', '4', '5'],
                    'Gender': ['F', 'M', 'M', 'F', 'F'],
                    'Like': ['C', 'Java',
                                 'JS', 'Linux','Python']})
#合并        
df = pd.merge(left, right, how ='right', on =['Id','UserId'])
print(df)
#两个DataFrame列名不同,使用left_on和right_on指定列名
#df = pd.merge(left,right,how ='right',left_on = ["XId","UserId"],right_on = ["YId","UserId"])

输出:

  Id UserId    Name order Gender    Like
0  2      1     NaN   NaN      F       C
1  9      2     NaN   NaN      M    Java
2  3      3    Lynn     3      M      JS
3  4      4  Mickey     4      F   Linux
4  8      5     NaN   NaN      F  Python

推荐阅读
cjavapy编程之路首页