本文主要介绍Python中,使用pandas通过字段的部分匹配,来连接合并两个 DataFrame的方法,以及相关的示例代码。

两个DataFrame:

import pandas as pd

df1 = pd.DataFrame({
                    'Langs': ['C','Py','Java'], 
                    'Values': [1,2,3]
                  })

df2 = pd.DataFrame({
                    'Langs': ['C语言','Python','CJavaPy'], 
                    'Names': ['coder1','coder2','coder3']
                  })

预期结果:

  Langs  Values   Names
0     C       1  coder1
1    Py       2  coder2
2  Java       3  coder3

  Langs  Values   Names
0     C       1  coder1
1     C       1  coder3
2  Java       3  coder3
3    Py       2  coder2
4    Py       2  coder3

1、使用thefuzz包和extractOne函数

安装thefuzz:

pip install thefuzz
pip install python-Levenshtein

实现代码:

import pandas as pd
from thefuzz import process

df1 = pd.DataFrame({
                    'Langs': ['C','Py','Java'], 
                    'Values': [1,2,3]
                  })

df2 = pd.DataFrame({
                    'Langs': ['C语言','Python','CJavaPy'], 
                    'Names': ['coder1','coder2','coder3']
                  })

langs = lambda x: process.extractOne(x, df2["Langs"])[2] 
df1['Names'] = df2.loc[df1["Langs"].map(langs).values, 'Names'].values
print(df1)

2、使用merge()和groupby()

import pandas as pd
from thefuzz import process

df1 = pd.DataFrame({
                    'Langs': ['C','Py','Java'], 
                    'Values': [1,2,3]
                  })

df2 = pd.DataFrame({
                    'Langs_y': ['C语言','Python','CJavaPy'], 
                    'Names': ['coder1','coder2','coder3']
                  })

df1['join'] = 1
df2['join'] = 1
dfFull = df1.merge(df2, on='join',).drop('join', axis=1)
df2.drop('join', axis=1, inplace=True)
dfFull['match'] = dfFull.apply(lambda x: x.Langs_y.find(x.Langs), axis=1).ge(0)
dfFull = dfFull[dfFull['match']==True]
dfResult = dfFull.groupby(["Langs", "Names"]).max().reset_index()[['Langs', 'Values' , 'Names' ]]
print(dfResult)

推荐文档