示例数据:
np.random.seed(0)
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': np.random.randn(4)})
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': np.random.randn(4)})
left
key value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893
right
key value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357
1、实现内连接(INNER JOIN)
left.merge(right, on='key')
# 或者
# left.merge(right, on='key', how='inner')
key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278
2、实现左连接( LEFT JOIN)
left.merge(right, on='key', how='left')
key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278
3、实现右连接(RIGHT JOIN)
left.merge(right, on='key', how='right')
key value_x value_y 0 B 0.400157 1.867558 1 D 2.240893 -0.977278 2 E NaN 0.950088 3 F NaN -0.151357
4、实现全连接(FULL OUTER JOIN)
left.merge(right, on='key', how='outer')
key value_x value_y 0 A 1.764052 NaN 1 B 0.400157 1.867558 2 C 0.978738 NaN 3 D 2.240893 -0.977278 4 E NaN 0.950088 5 F NaN -0.151357
5、排除左连接
先执行左外连接,然后只过滤(不包括)来自左的行。
(left.merge(right, on='key', how='left', indicator=True)
.query('_merge == "left_only"')
.drop('_merge', 1))
key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN
或
left.merge(right, on='key', how='left', indicator=True)
key value_x value_y _merge 0 A 1.764052 NaN left_only 1 B 0.400157 1.867558 both 2 C 0.978738 NaN left_only 3 D 2.240893 -0.977278 both
6、排除右连接
执行右外连接,然后只过滤(不包括)来自右的行。
(left.merge(right, on='key', how='right', indicator=True)
.query('_merge == "right_only"')
.drop('_merge', 1))
key value_x value_y 2 E NaN 0.950088 3 F NaN -0.151357
7、排除全连接
(left.merge(right, on='key', how='outer', indicator=True)
.query('_merge != "both"')
.drop('_merge', 1))
key value_x value_y 0 A 1.764052 NaN 2 C 0.978738 NaN 4 E NaN 0.950088 5 F NaN -0.151357
8、键列的不同名称
如果键列的名称不同(例如,left用keyLeft和right用keyRight代替),key
那么将必须指定left_on和right_on作为参数,而不是on:
示例数据:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893
right2
keyRight value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357
例如,
left2.merge(right2, left_on='keyLeft', right_on='keyRight', how='inner')
keyLeft value_x keyRight value_y 0 B 0.400157 B 1.867558 1 D 2.240893 D -0.977278
9、避免在输出中重复键列
在keyLeftfromleft和keyRightfrom上进行合并时right,如果只希望在输出中使用keyLeft或keyRight(但不同时使用)中的任何一个,则可以通过将索引设置,
示例数据:
left2 = left.rename({'key':'keyLeft'}, axis=1)
right2 = right.rename({'key':'keyRight'}, axis=1)
left2
keyLeft value 0 A 1.764052 1 B 0.400157 2 C 0.978738 3 D 2.240893
right2
keyRight value 0 B 1.867558 1 D -0.977278 2 E 0.950088 3 F -0.151357
例如,
left3 = left2.set_index('keyLeft')
left3.merge(right2, left_index=True, right_on='keyRight')
value_x keyRight value_y 0 0.400157 B 1.867558 1 2.240893 D -0.977278
10、仅合并其中一个的单个列 DataFrames
right3 = right.assign(newcol=np.arange(len(right)))
right3
key value newcol 0 B 1.867558 0 1 D -0.977278 1 2 E 0.950088 2 3 F -0.151357 3
如果只需要合并“new_val”(不包含任何其他列),通常可以在合并之前仅将其子集作为子集:
left.merge(right3[['key', 'newcol']], on='key')
key value newcol 0 B 0.400157 0 1 D 2.240893 1
如果要进行左外部联接,则性能更高的解决方案将涉及map:
# left['newcol'] = left['key'].map(right3.set_index('key')['newcol']))
left.assign(newcol=left['key'].map(right3.set_index('key')['newcol']))
key value newcol 0 A 1.764052 NaN 1 B 0.400157 0.0 2 C 0.978738 NaN 3 D 2.240893 1.0
11、合并多列
要加入对多列,指定列表on(或left_on和right_on)。
left.merge(right, on=['key1', 'key2'] ...)
列名称不同
left.merge(right, left_on=['lkey1', 'lkey2'], right_on=['rkey1', 'rkey2'])