低基数类别
类别数在10以内的,独热编码类别数最好不超过5
LabelEncoder
在这里插入代码片
OnehotEncoder
高基数类别
类别数大于10的特征列
统计特征
def aggregate_statistic_feature(df, group, target):
tem = df.groupby([group])[target].agg(['max', 'min', 'sum', 'mean', 'median', 'nunique', 'std', 'skew']).reset_index()
tem.columns = [group] + [fea+'_'+col for col in tem.columns.values[1:]]
return tem