Bootstrap

【Python基础-Pandas】dataframe中将两个日期间的数据补全

1. 目的

目前dataframe中的数据如下,每一行数据表示的该日之前的那一周的平均价格指数,比如第一行数据为data_time='2023-04-06', price_index=132,表示从2023-03-292023-04-05之间的7天的价格指数的平均值为132。现在需要将这些间隔的日期中的价格指数按平均价格指数来补全。

    data_time  price_index
0  2023-04-06          132
1  2023-03-29          689
2  2023-03-22          450
3  2023-04-12          765

2. 解决思路

  • 把data_time排序,第一个data_time=2023-03-22区别处理,确定其前7天到此日期之间的所有日期,设置其对应的price_index均为450
  • 从第二个data_time开始,取出其前一个data_time到当前data_time之间的所有日期,设置其对应的price_index均为当前data_time对应的price_index

3. 使用到的python api

(1) 取出df的某一列转成list

data_time_list = sorted(df.loc[:, 'data_time'].tolist())

(2) 时间字符串转日期

datetime.strptime(data_time_str, '%Y-%m-%d')

(3) 根据df中某一个字段的值找到对应的另一个字段的值

df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]

(4) 往df中添加行

new_row = {'data_time': date_str, 'price_index': price_index}
new_row_df = pd.DataFrame(new_row, index=[0])
result = pd.concat([result, new_row_df])

(5) 按df的某一个字段排序

result.sort_values(by='data_time', ascending=True)

4. 完整代码

def fill_in_missing_data(df: pd.DataFrame) -> pd.DataFrame:
    if df.empty:
        return df
    else:
        data_time_list = sorted(df.loc[:, 'data_time'].tolist())
        result = pd.DataFrame()
        for i in range(len(data_time_list)):
            if i == 0:
                date = datetime.strptime(data_time_list[i], '%Y-%m-%d')
                ones_week_ago = date - timedelta(days=7)
                for j in range(7):
                    date = ones_week_ago + timedelta(days=j)
                    price_index = df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]
                    date_str = date.strftime('%Y-%m-%d')
                    new_row = {'data_time': date_str, 'price_index': price_index}
                    new_row_df = pd.DataFrame(new_row, index=[0])
                    result = pd.concat([result, new_row_df])
            else:
                start_date = datetime.strptime(data_time_list[i-1], '%Y-%m-%d')
                end_date = datetime.strptime(data_time_list[i], '%Y-%m-%d')
                delta = end_date - start_date
                for k in range(delta.days):
                    current_date = start_date + timedelta(days=k)
                    price_index = df.loc[df['data_time'] == data_time_list[i], 'price_index'].values[0]
                    current_date_str = current_date.strftime('%Y-%m-%d')
                    new_row = {'data_time': current_date_str, 'price_index': price_index}
                    new_row_df = pd.DataFrame(new_row, index=[0])
                    result = pd.concat([result, new_row_df])
        result = result.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
        result.sort_values(by='data_time', ascending=True)
        return result


# 创建示例数据
data = {'data_time': ['2023-04-06', '2023-03-29', '2023-03-22', '2023-04-12'],
        'price_index': [132, 689, 450, 765]}
df = pd.DataFrame(data)
df = fill_in_missing_data(df)
print(df)

输出结果为:

    data_time  price_index
0  2023-03-15          450
0  2023-03-16          450
0  2023-03-17          450
0  2023-03-18          450
0  2023-03-19          450
0  2023-03-20          450
0  2023-03-21          450
0  2023-03-22          689
0  2023-03-23          689
0  2023-03-24          689
0  2023-03-25          689
0  2023-03-26          689
0  2023-03-27          689
0  2023-03-28          689
0  2023-03-29          132
0  2023-03-30          132
0  2023-03-31          132
0  2023-04-01          132
0  2023-04-02          132
0  2023-04-03          132
0  2023-04-04          132
0  2023-04-05          132
0  2023-04-06          765
0  2023-04-07          765
0  2023-04-08          765
0  2023-04-09          765
0  2023-04-10          765
0  2023-04-11          765
;