Bootstrap

数据分析与挖掘:财政收入影响因素分析及预测模型

1. 背景与挖掘目标

  • 项目为《Python 数据分析与挖掘实战》第 13 章:财政收入影响因素分析及预测模型,内容参考了书中源代码及 u012063773 的博客
  • 挖掘目标为分析地方财政收入的关键特征,筛选特征进行分析建模,然后对财政收入进行预测

2. 分析方法与过程

2.1 数据探索

  1. 主要变量描述性分析:可以看出 y 的波动很大
'''原始数据概括性度量'''
import numpy as np
import pandas as pd

inputfile = 'chapter13/demo/data/data1.csv'
data = pd.read_csv(inputfile)
r = [data.min(), data.max(), data.mean(), data.std()]
r = pd.DataFrame(r, index=['Min', 'Max', 'Mean', 'STD']).T
r = np.round(r, 2)    # 保留两位小数
r
MinMaxMeanSTD
x13831732.007599295.005579519.951262194.72
x2181.542110.78765.04595.70
x3448.196882.852370.831919.17
x47571.0042049.1419644.6910203.02
x56212.7033156.8315870.958199.77
x66370241.008323096.007350513.60621341.85
x7525.714454.551712.241184.71
x8985.3115420.145705.804478.40
x960.62228.46129.4950.51
x1065.66852.56340.22251.58
x1197.50120.00103.315.51
x121.031.911.420.25
x135321.0041972.0017273.8011109.19
y64.872088.14618.08609.25
  1. 原始数据相关性分析:可以看出 x11 与 y 相关性不大,且为负相关
'''原始数据求解 Pearson 相关系数'''
pear = np.round(data.corr(method = 'pearson'), 2)
pear
x1x2x3x4x5x6x7x8x9x10x11x12x13y
x11.000.950.950.970.970.990.950.970.980.98-0.290.940.960.94
x20.951.001.000.990.990.920.990.990.980.98-0.130.891.000.98
x30.951.001.000.990.990.921.000.990.980.99-0.150.891.000.99
x40.970.990.991.001.000.950.991.000.991.00-0.190.911.000.99
x50.970.990.991.001.000.950.991.000.991.00-0.180.900.990.99
x60.990.920.920.950.951.000.930.950.970.96-0.340.950.940.91
x70.950.991.000.990.990.931.000.990.980.99-0.150.891.000.99
x80.970.990.991.001.000.950.991.000.991.00-0.150.901.000.99
x90.980.980.980.990.990.970.980.991.000.99-0.230.910.990.98
x100.980.980.991.001.000.960.991.000.991.00-0.170.900.990.99
x11-0.29-0.13-0.15-0.19-0.18-0.34-0.15-0.15-0.23-0.171.00-0.43-0.16-0.12
x120.940.890.890.910.900.950.890.900.910.90-0.431.000.900.87
x130.961.001.001.000.990.941.001.000.990.99-0.160.901.000.99
y0.940.980.990.990.990.910.990.990.980.99-0.120.870.991.00

2.2 模型构建

1. Lasso 变量选择模型(备注:书中使用的是 Adaptive-Lasso 变量选择,这个函数多处查找都没找到,因此直接使用 Lasso,得到的结果和书中略有不同,后面保留的变量暂时以书中的为准)

'''Lasson 变量选择'''
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1, max_iter=100000)
model.fit(data.iloc[:, 0:13], data['y'])
print(model.coef_)
[-3.88351082e-04 -5.85234238e-01  4.38483025e-01 -1.25563758e-01
  1.74517446e-01  8.19661325e-04  2.67660850e-01  2.89486267e-02
 -7.55994563e+00 -8.62534215e-02  3.37878229e+00  0.00000000e+00
 -7.70629587e-03]

2. 财政收入及各类别收入预测模型:各类别收入预测方法一样,因此以财政收入为例,描述灰色模型的计算过程,然后建立灰色预测与神经网络的组合预测模型,参数设置为误差精度10^-7,学习次数 10000 次,神经元个数为 6 个

  • 灰色预测原理

    灰色预测对原始数据进行生成处理如累加,生成有较强规律性的数据序列,然后建立相应的微分方程模型,从而预测事物未来发展趋势的状况。

    设变量 X ( 0 ) = { X ( 0 ) ( i ) , i = 1 , 2 … n } X^{(0)} = \{X^{(0)}(i), i=1,2\dots n\} X(0)={X(0)(i),i=1,2n} 为一非负单调原始数据数列,对 X ( 0 ) X^{(0)} X(0) 进行 1 次累加得到 X ( 1 ) = { X ( 1 ) ( k ) , k = 1 , 2 … n } X^{(1)} = \{X^{(1)}(k), k=1,2\dots n\} X(1)={X(1)(k),k=1,2n},对 X ( 1 ) X^{(1)} X(1) 建立一阶线性微分方程,其中 a , u a, u a,u 为常数:
    d X ( 1 ) d t + a X ( 1 ) = u \frac{dX^{(1)}}{dt} + aX^{(1)} = u dtdX(1)+aX(1)=u
    求解微分方程,得到
    X ( 1 ) ( t ) = [ ∫ e ∫ a ⋅ d x ⋅ u ⋅ d x + C ] ⋅ ∫ e ∫ − a ⋅ d x … … … … … … (1) X^{(1)}(t) = [\int e^{\int a\cdot dx} \cdot u \cdot dx+ C] \cdot \int e^{\int -a\cdot dx}\text {………………(1)} X(1)(t)=[eadxudx+C]eadx1
       ⟹    X ( 1 ) ( t ) = ( u a ⋅ e a t + C ) ⋅ e − a t … … … … … … (2) \implies X^{(1)}(t) = (\frac {u}{a} \cdot e^{at} + C) \cdot e^{-at}\text {………………(2)} X(1)(t)=(aueat+C)eat2
    X ( 1 ) ( t 0 ) X^{(1)}(t_0) X(1)(t0) 代入(2),求解 C C C,得到:
    C = ( X ( 1 ) ( t 0 ) − u a ) ⋅ e − a t 0 … … … … … … (3) C = (X^{(1)}(t_0) - \frac{u}{a}) \cdot e^{-at_0}\text {………………(3)} C=(X(1)(t0)au)eat03
    将(3)代入(2),得到:
    X ( 1 ) ( t ) = [ X ( 1 ) ( t 0 ) − u a ] ⋅ e − a ( t − t 0 ) + u a … … … … … … (4) X^{(1)}(t) = [X^{(1)}(t_0) - \frac{u}{a}]\cdot e^{-a(t-t_0)} + \frac{u}{a}\text {………………(4)} X(1)(t)=[X(1)(t0)au]ea(tt0)+au4
    对于离散值:
    X ( 1 ) ( k + 1 ) = [ X ( 1 ) ( 1 ) − u a ] ⋅ e − a k + u a … … … … … … (5) X^{(1)}(k+1) = [X^{(1)}(1) - \frac{u}{a}]\cdot e^{-ak} + \frac{u}{a}\text {………………(5)} X(1)(k+1)=[X(1)(1)au]eak+au5
    灰色预测中对于 a , u a, u a,u 的求解使用的是最小二乘法。由于:
    X ( 1 ) ( k ) − X ( 1 ) ( k − 1 ) = Δ X ( 1 ) ( k ) Δ k = X ( 0 ) ( k ) , Δ k = 1 … … … … … … (6) X^{(1)}(k) - X^{(1)}(k-1) = \frac{\Delta X^{(1)}(k)}{\Delta k} = X^{(0)}(k), \Delta k = 1\text {………………(6)} X(1)(k)X(1)(k1)=ΔkΔX(1)(k)=X(0)(k),Δk=16
    将(6) 代入微分方程,得到:
    X ( 0 ) ( k ) = − a X ( 1 ) ( k ) + u … … … … … … (7) X^{(0)}(k) = -aX^{(1)}(k)+u \text {………………(7)} X(0)(k)=aX(1)(k)+u7
    由于 Δ X ( 1 ) ( k ) Δ k \frac{\Delta X^{(1)}(k)}{\Delta k} ΔkΔX(1)(k) 涉及 X ( 1 ) ( k ) X^{(1)}(k) X(1)(k) 两个时刻的值,因此将(7)中的 X ( 1 ) ( k ) X^{(1)}(k) X(1)(k) 换为两个时刻的均值更为合理,得到:
    Y = B U Y = BU Y=BU
    即:
    [ X ( 0 ) ( 2 ) X ( 0 ) ( 3 ) ⋮ X ( 0 ) ( N ) ] = [ − 1 2 ( X ( 1 ) ( 2 ) + X ( 1 ) ( 1 ) ) 1 − 1 2 ( X ( 1 ) ( 3 ) + X ( 1 ) ( 2 ) ) 1 ⋮ ⋮ − 1 2 ( X ( 1 ) ( N ) + X ( 1 ) ( N − 1 ) ) 1 ] [ a u ] … … … … … … (8) \begin{bmatrix} X^{(0)}(2)\\ X^{(0)}(3)\\ \vdots\\ X^{(0)}(N)\\ \end{bmatrix} = \begin{bmatrix} -\frac{1}{2}(X^{(1)}(2) + X^{(1)}(1)) & 1 \\ -\frac{1}{2}(X^{(1)}(3) + X^{(1)}(2)) & 1 \\ \vdots & \vdots \\ -\frac{1}{2}(X^{(1)}(N) + X^{(1)}(N-1)) & 1 \\ \end{bmatrix} \begin{bmatrix} a \\ u \\ \end{bmatrix}\text {………………(8)} X(0)(2)X(0)(3)X(0)(N)=21(X(1)(2)+X(1)(1))21(X(1)(3)+X(1)(2))21(X(1)(N)+X(1)(N1))111[au]8
    由最小二乘法,得到:
    U ^ = [ a ^ u ^ ] = ( B T B ) − 1 B T Y … … … … … … (9) \hat{U} = \begin{bmatrix} \hat{a} \\ \hat{u} \\ \end{bmatrix} = (B^TB)^{-1}B^TY\text {………………(9)} U^=[a^u^]=(BTB)1BTY9
    将(9)代入(5),得到:
    X ( 1 ) ( k + 1 ) = [ X ( 1 ) ( 1 ) − u ^ a ^ ] ⋅ e − a ^ k + u ^ a ^ … … … … … … (10) X^{(1)}(k+1) = [X^{(1)}(1) - \frac{\hat{u}}{\hat{a}}]\cdot e^{-\hat{a}k} + \frac{\hat u}{\hat a}\text {………………(10)} X(1)(k+1)=[X(1)(1)a^u^]ea^k+a^u^10
    将(10)代入(6):
    X ( 0 ) ( k + 1 ) = ( 1 − e a ^ ) [ X ( 0 ) ( 1 ) − u ^ a ^ ] e − a ^ k … … … … … … (11) X^{(0)}(k+1) = (1-e^{\hat{a}})[X^{(0)}(1) - \frac{\hat{u}}{\hat{a}}]e^{-\hat{a}k}\text {………………(11)} X(0)(k+1)=(1ea^)[X(0)(1)a^u^]ea^k11

'''灰色预测函数'''
def GM11(x0): #自定义灰色预测函数
    import numpy as np
    x1 = x0.cumsum() # 生成累加序列
    z1 = (x1[:len(x1)-1] + x1[1:])/2.0 # 生成紧邻均值(MEAN)序列,比直接使用累加序列好,共 n-1 个值
    z1 = z1.reshape((len(z1),1))
    B = np.append(-z1, np.ones_like(z1), axis = 1)    # 生成 B 矩阵
    Y = x0[1:].reshape((len(x0)-1, 1))    # Y 矩阵
    [[a],[u]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Y)    #计算参数
    f = lambda k: (x0[0]-u/a)*np.exp(-a*(k-1))-(x0[0]-u/a)*np.exp(-a*(k-2))    #还原值
    delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))    # 计算残差
    C = delta.std()/x0.std()
    P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
    return f, a, u, x0[0], C, P #返回灰色预测函数、a、b、首项、方差比、小残差概率

'''地方财政收入灰色预测'''
import numpy as np
import pandas as pd

inputfile = 'chapter13/demo/data/data1.csv'
outputfile = 'chapter13/demo/tmp2/data1_GM11.xls'
modelfile = 'chapter13/demo/tmp2/net.model'
data = pd.read_csv(inputfile)
data.head()
x1x2x3x4x5x6x7x8x9x10x11x12x13y
03831732181.54448.197571.006212.706370241525.71985.3160.6265.66120.01.029532164.87
13913824214.63549.979038.167601.736467115618.251259.2073.4695.46113.51.051652999.75
23928907239.56686.449905.318092.826560508638.941468.0681.1681.16108.21.064700888.11
34282130261.58802.5910444.608767.986664862656.581678.1285.7291.70102.21.0927694106.07
44453911283.14904.5711255.709422.336741400758.831893.5288.88114.6197.71.2008027137.32
data.index = range(1994, 2014)
data.loc[2014] = None
data.loc[2015] = None
# 模型精度评价
l = ['x1', 'x2', 'x3', 'x4', 'x5', 'x7']
for i in l:
    GM = GM11(data[i][list(range(1994, 2014))].values)
    f = GM[0]
    c = GM[-2]
    p = GM[-1]
    data[i][2014] = f(len(data)-1)
    data[i][2015] = f(len(data))
    data[i] = data[i].round(2)
    if (c < 0.35) & (p > 0.95):
        print('对于模型{},该模型精度为---好'.format(i))
    elif (c < 0.5) & (p > 0.8):
        print('对于模型{},该模型精度为---合格'.format(i))
    elif (c < 0.65) & (p > 0.7):
        print('对于模型{},该模型精度为---勉强合格'.format(i))
    else:
        print('对于模型{},该模型精度为---不合格'.format(i))

data[l+['y']].to_excel(outputfile, )
对于模型x1,该模型精度为---好
对于模型x2,该模型精度为---好
对于模型x3,该模型精度为---好
对于模型x4,该模型精度为---好
对于模型x5,该模型精度为---好
对于模型x7,该模型精度为---好
'''神经网络'''
inputfile2 = outputfile
outputfile2 = 'chapter13/demo/tmp2/revenue.xls'
modelfile = 'chapter13/demo/tmp2/1-net.model'
data2 = pd.read_excel(inputfile2, index_col=0)

# 提取数据
feature = list(data2.columns[:len(data2.columns)-1])
train = data2.loc[list(range(1994, 2014))].copy()
mean = train.mean()
std = train.std() 
train = (train - mean) / std    # 数据标准化,这里使用标准差标准化
x_train = train[feature].values
y_train = train['y'].values

# 建立神经网络模型 
from keras.models import Sequential
from keras.layers.core import Dense, Activation

model = Sequential()
model.add(Dense(input_dim=6, units=12))
model.add(Activation('relu'))
model.add(Dense(input_dim=12, units=1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, epochs=10000, batch_size=16)
model.save_weights(modelfile2)

2.3 数据预测

  • 从结果可以看到,1994~2013 预测值和实际值几乎重合,因此数据预测可信度较高
# 预测,并还原结果
x = ((data2[feature] - mean[feature]) / std[feature]).values
data2['y_pred'] = model.predict(x) * std['y'] + mean['y']
data2.to_excel(outputfile2)

import matplotlib.pyplot as plt
%matplotlib notebook
p = data2[['y', 'y_pred']].plot(style=['b-o', 'r-*'])
p.set_ylim(0, 2500)
p.set_xlim(1993, 2016)
plt.show()

预测图

源代码及数据文件参考:https://github.com/Raymone23/Data-Mining

;