示例数据:
Student_id actvity_timestamp
1001 2019-09-05 08:26:12
1001 2019-09-06 09:26:12
1001 2019-09-21 10:11:01
1001 2019-10-24 11:44:01
1001 2019-10-25 11:31:01
1001 2019-10-26 12:13:01
1002 2019-09-11 12:21:01
1002 2019-09-12 13:11:01
1002 2019-11-23 16:22:01
实现输出:
Student_id | days_in_Sept | days_in_Oct | days_in_Nov |
1001 | 3 | 3 | 0 |
1002 | 2 | 0 | 1 |
1、使用pd.crosstab()实现
crossTab()函数是用于统计分组频率的特殊透视表。计算两个(或多个)因子的简单交叉表。默认情况下,计算因子的频率表,除非传递值数组和聚合函数。
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
"Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
"activity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})
months = pd.to_datetime(df.activity_timestamp).dt.strftime("%B")
print()
result = pd.crosstab(
df.Student_id,
months,
values=pd.to_datetime(df.activity_timestamp).dt.date,
aggfunc=pd.Series.nunique
print(result)
相关文档:
2、使用df.pivot_table()实现
pivot_table()是创建一个电子表格样式的数据透视表作为DataFrame。透视表中的级别将存储在结果DataFrame的索引和列上的MultiIndex对象(分层索引)中。
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({
"Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002],
"actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"]
})
df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format
df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b') # get month short name
df['activity_date'] = df['actvity_timestamp'].dt.date # get activity dates
df_out = (df.pivot_table(index='Student_id', # group under each student id
columns='activity_month', # month short name as new columns
values='activity_date', # aggregate on dates
aggfunc='nunique', #activities on the same date counted once
fill_value=0)
.rename_axis(columns=None)
)
print(df_out)