1. 目的
目前dataframe中的数据如下,每一行数据表示的该日之前的那一周的平均价格指数,比如第一行数据为data_time='2023-04-06', price_index=132
,表示从2023-03-29
到2023-04-05
之间的7天的价格指数的平均值为132
。现在需要将这些间隔的日期中的价格指数按平均价格指数来补全。
data_time price_index
0 2023-04-06 132
1 2023-03-29 689
2 2023-03-22 450
3 2023-04-12 765
2. 解决思路
- 把data_time排序,第一个data_time=
2023-03-22
区别处理,确定其前7天到此日期之间的所有日期,设置其对应的price_index均为450 - 从第二个data_time开始,取出其前一个data_time到当前data_time之间的所有日期,设置其对应的price_index均为当前data_time对应的price_index
3. 使用到的python api
(1) 取出df的某一列转成list
data_time_list = sorted(df.loc[:, 'data_time'].tolist())
(2) 时间字符串转日期
datetime.strptime(data_time_str, '%Y-%m-%d')
(3) 根据df中某一个字段的值找到对应的另一个字段的值
df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]
(4) 往df中添加行
new_row = {'data_time': date_str, 'price_index': price_index}
new_row_df = pd.DataFrame(new_row, index=[0])
result = pd.concat([result, new_row_df])
(5) 按df的某一个字段排序
result.sort_values(by='data_time', ascending=True)
4. 完整代码
def fill_in_missing_data(df: pd.DataFrame) -> pd.DataFrame:
if df.empty:
return df
else:
data_time_list = sorted(df.loc[:, 'data_time'].tolist())
result = pd.DataFrame()
for i in range(len(data_time_list)):
if i == 0:
date = datetime.strptime(data_time_list[i], '%Y-%m-%d')
ones_week_ago = date - timedelta(days=7)
for j in range(7):
date = ones_week_ago + timedelta(days=j)
price_index = df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]
date_str = date.strftime('%Y-%m-%d')
new_row = {'data_time': date_str, 'price_index': price_index}
new_row_df = pd.DataFrame(new_row, index=[0])
result = pd.concat([result, new_row_df])
else:
start_date = datetime.strptime(data_time_list[i-1], '%Y-%m-%d')
end_date = datetime.strptime(data_time_list[i], '%Y-%m-%d')
delta = end_date - start_date
for k in range(delta.days):
current_date = start_date + timedelta(days=k)
price_index = df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]
current_date_str = current_date.strftime('%Y-%m-%d')
new_row = {'data_time': current_date_str, 'price_index': price_index}
new_row_df = pd.DataFrame(new_row, index=[0])
result = pd.concat([result, new_row_df])
result = result.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
result.sort_values(by='data_time', ascending=True)
return result
# 创建示例数据
data = {'data_time': ['2023-04-06', '2023-03-29', '2023-03-22', '2023-04-12'],
'price_index': [132, 689, 450, 765]}
df = pd.DataFrame(data)
df = fill_in_missing_data(df)
print(df)
输出结果为:
data_time price_index
0 2023-03-15 450
0 2023-03-16 450
0 2023-03-17 450
0 2023-03-18 450
0 2023-03-19 450
0 2023-03-20 450
0 2023-03-21 450
0 2023-03-22 689
0 2023-03-23 689
0 2023-03-24 689
0 2023-03-25 689
0 2023-03-26 689
0 2023-03-27 689
0 2023-03-28 689
0 2023-03-29 132
0 2023-03-30 132
0 2023-03-31 132
0 2023-04-01 132
0 2023-04-02 132
0 2023-04-03 132
0 2023-04-04 132
0 2023-04-05 132
0 2023-04-06 765
0 2023-04-07 765
0 2023-04-08 765
0 2023-04-09 765
0 2023-04-10 765
0 2023-04-11 765