示例数据:
df_first = pd.DataFrame([[1, 'A',1000], [2, 'B',np.NaN],[3,np.NaN,3000],[4, 'D',8000],[5, 'E',6000]], columns=['EmpID', 'Name','Salary'])
df_second = pd.DataFrame([[1, 'A','HR','Delhi'], [8, 'B','Admin','Mumbai'],[3,'C','Finance',np.NaN],[9, 'D','Ops','Banglore'],[5, 'E','Programming',np.NaN],[10, 'K','Analytics','Mumbai']], columns=['EmpID', 'Name','Department','Location'])
print df_first
print df_second
EmpID Name Salary 0 1 A 1000.0 1 2 B NaN 2 3 NaN 3000.0 3 4 D 8000.0 4 5 E 6000.0 EmpID Name Department Location 0 1 A HR Delhi 1 8 B Admin Mumbai 2 3 C Finance NaN 3 9 D Ops Banglore 4 5 E Programming NaN 5 10 K Analytics Mumbai
需要combine_first和set_index来匹配由EmpID列创建:
df = df_first.set_index('EmpID').combine_first(df_second.set_index('EmpID')).reset_index()
print (df)
EmpID Department Location Name Salary 0 1 HR Delhi A 1000.0 1 2 NaN NaN B NaN 2 3 Finance NaN C 3000.0 3 4 NaN NaN D 8000.0 4 5 Programming NaN E 6000.0 5 8 Admin Mumbai B NaN 6 9 Ops Banglore D NaN 7 10 Analytics Mumbai K NaN
对于某些顺序的列需要reindex():
ColNames = pd.Index(np.concatenate([df_second.columns, df_first.columns])).drop_duplicates()
print (ColNames)
Index(['EmpID', 'Name', 'Department', 'Location', 'Salary'], dtype='object')
df = (df_first.set_index('EmpID')
.combine_first(df_second.set_index('EmpID'))
.reset_index()
.reindex(columns=ColNames))
print (df)
EmpID Name Department Location Salary 0 1 A HR Delhi 1000.0 1 2 B NaN NaN NaN 2 3 C Finance NaN 3000.0 3 4 D NaN NaN 8000.0 4 5 E Programming NaN 6000.0 5 8 B Admin Mumbai NaN 6 9 D Ops Banglore NaN 7 10 K Analytics Mumbai NaN