示例数据:
Student_id actvity_timestamp
1001 2019-09-05 08:26:12
1001 2019-09-06 09:26:12
1001 2019-09-21 10:11:01
1001 2019-10-24 11:44:01
1001 2019-10-25 11:31:01
1001 2019-10-26 12:13:01
1002 2019-09-11 12:21:01
1002 2019-09-12 13:11:01
1002 2019-11-23 16:22:01
实现输出:
Student_id | days_in_Sept | days_in_Oct | days_in_Nov |
1001 | 3 | 3 | 0 |
1002 | 2 | 0 | 1 |
1、使用pd.crosstab()实现
crossTab()函数是用于统计分组频率的特殊透视表。计算两个(或多个)因子的简单交叉表。默认情况下,计算因子的频率表,除非传递值数组和聚合函数。
import pandas as pd import numpy as np df = pd.DataFrame.from_dict({ "Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002], "activity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"] }) months = pd.to_datetime(df.activity_timestamp).dt.strftime("%B") print() result = pd.crosstab( df.Student_id, months, values=pd.to_datetime(df.activity_timestamp).dt.date, aggfunc=pd.Series.nunique print(result)
相关文档:
2、使用df.pivot_table()实现
pivot_table()是创建一个电子表格样式的数据透视表作为DataFrame。透视表中的级别将存储在结果DataFrame的索引和列上的MultiIndex对象(分层索引)中。
import pandas as pd import numpy as np df = pd.DataFrame.from_dict({ "Student_id": [1001,1001,1001,1001,1001,1001,1002,1002,1002], "actvity_timestamp": ["2019-09-05 08:26:12", "2019-09-06 09:26:12", "2019-09-21 10:11:01", "2019-10-24 11:44:01", "2019-10-25 11:31:01", "2019-10-26 12:13:01", "2019-09-11 12:21:01", "2019-09-12 13:11:01", "2019-11-23 16:22:01"] }) df['actvity_timestamp'] = pd.to_datetime(df['actvity_timestamp']) # to datetime format df['activity_month'] = df['actvity_timestamp'].dt.strftime('%b') # get month short name df['activity_date'] = df['actvity_timestamp'].dt.date # get activity dates df_out = (df.pivot_table(index='Student_id', # group under each student id columns='activity_month', # month short name as new columns values='activity_date', # aggregate on dates aggfunc='nunique', #activities on the same date counted once fill_value=0) .rename_axis(columns=None) ) print(df_out)