1、示例txt文件内容
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]
2、通过pd.DataFrame.from_records()创建DataFrame
import pandas as pd from collections import namedtuple Item = namedtuple('Item', 'state area') items = [] with open('unis.txt') as f: for line in f: l = line.rstrip('\n') if l.endswith('[edit]'): state = l.rstrip('[edit]') else: i = l.index(' (') area = l[:i] items.append(Item(state, area)) df = pd.DataFrame.from_records(items, columns=['State', 'Area']) print df
输出:
State Area
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
3、通过pd.read_csv()读取创建DataFrame
import pandas as pd df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
输出:
State Region Name
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe
11 Arizona Tucson
或者:
import pandas as pd df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
输出:
State Region Name
0 Alabama Auburn (Auburn University)[1]
1 Alabama Florence (University of North Alabama)
2 Alabama Jacksonville (Jacksonville State University)[2]
3 Alabama Livingston (University of West Alabama)[2]
4 Alabama Montevallo (University of Montevallo)[2]
5 Alabama Troy (Troy University)[2]
6 Alabama Tuscaloosa (University of Alabama, Stillman Co...
7 Alabama Tuskegee (Tuskegee University)[5]
8 Alaska Fairbanks (University of Alaska Fairbanks)[2]
9 Arizona Flagstaff (Northern Arizona University)[6]
10 Arizona Tempe (Arizona State University)
11 Arizona Tucson (University of Arizona)