PyMC3 - GLM之线性回归

GLM：线性回归

GLM即Generalized linear model，广义线性模型。
贝叶斯统计的一些软件工具包JAGS, BUGS, Stan 和 PyMC，使用这些工具包需要对将要简历的模型有充分的了解。

线性回归的传统形式

通常，频率学派将线性回归表述为：

Y = X β + ϵ

$Y = X\beta + \epsilon$
其中，

Y $Y$ 是我们期望预测的输出（因变量）；

X $X$ 是预测因子（自变量）；

β $\beta$ 是待估计的模型系数（参数）；

ϵ $\epsilon$ 是误差项，假设为正态分布。

可以使用普通最小二乘或者最大似然法找到 $\beta$ 的最优解。

线性回归的概率形式

贝叶斯学派采样概率分布形式描述线性回归模型，可表达为如下形式：

Y \sim  (X β, σ 2)

$Y \sim \mathcal{N}(X \beta, \sigma^2)$
其中，

Y $Y$ 是一个随机变量，其取值（每个数据点）服从正态分布。这个正态分布的均值有线性变量

Xβ $X \beta$ 决定，方差由

σ2 $\sigma^2$ 决定。

这个描述有如下两个优点：
* 先验：可以将我们已有的任何先验赋予参数。
* 量化不确定性：不仅得到单个 $\beta$ 的估计，而是得到 $\beta$ 的整个后验，表征了 $\beta$ 对不同取值的可能性。

PyMC3中的贝叶斯GLMs

使用PyMC3中的GLM模型可以简单的构建复杂模型。

%matplotlib inline

import pymc3 as pm
import numpy as np
import matplotlib.pyplot as plt

生成数据

以截距和斜率产生一条直线，并以直线位均值在其附近产生正态分布的数据点。

# data size
size = 200

# true line parameter
true_intercept = 1
true_slope = 2

# y = a + b*x
x = np.linspace(0, 1, size)
true_regression_line = true_intercept + true_slope * x

# add noise (sampled data)
y = true_regression_line + np.random.normal(scale=.5, size=size)

# simulate data 
data = dict(x=x, y=y)

# plot the data
fig = plt.figure(figsize=(7, 7));
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model');
ax.plot(x, y, 'x', label='sampled data');
ax.plot(x, true_regression_line, label='true regression line', lw=2.0);
plt.legend(loc=0);

这里写图片描述

直接建模估计

使用贝叶斯线性回归进行建模并拟合这些采样数据。

with pm.Model() as naive_model: # model specifications in PyMC3 are wrapped in a with-statement
    # Define priors
    intercept = pm.Normal('Intercept', 0, sd=20)
    slope = pm.Normal('Slope', 0, sd=20)
    sigma = pm.HalfCauchy('sigma', beta=10, testval=1.)

    # Define likelihood
    likelihood = pm.Normal('y', mu=intercept + slope * x, sd=sigma, observed=y)

使用默认采样器进行采样，PyMC3不需要特别指定采样初始值。NUTS采样器会使用ADVI进行初始化。

with naive_model:
    naive_trace = pm.sample(2000)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -157.32: 100%|██████████| 200000/200000 [00:18<00:00, 10888.26it/s]
Finished [100%]: Average ELBO = -157.3
100%|██████████| 2000/2000 [00:04<00:00, 432.16it/s]

绘制拟合的结果可以看出，截距(Intercept)的估计值约1.027(真实1)；斜率的估计值约1.893（真实2）。噪声的方差估计值0.528（真实0.5）。进行了准确的估计。

#pm.traceplot(naive_trace);
pm.plot_posterior(naive_trace);

这里写图片描述

GLM建模估计

使用GLM模块进行建模估计。

with pm.Model() as glm_model:
    pm.glm.glm('y~x',data,\
               {'Intercept': pm.Normal.dist(mu=0, sd=20),\
                'sd':pm.HalfCauchy.dist(beta=10, testval=1.),\
                'x':pm.Normal.dist(mu=0, sd=20)})

??pm.glm.glm

with glm_model:
    glm_trace = pm.sample(2000)

Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -157.31: 100%|██████████| 200000/200000 [00:21<00:00, 9193.40it/s] 
Finished [100%]: Average ELBO = -157.32
100%|██████████| 2000/2000 [00:05<00:00, 391.03it/s]

#pm.traceplot(glm_trace);
pm.plot_posterior(glm_trace);

这里写图片描述

分析模型

贝叶斯推断不仅给出了最优的拟合直线，而且给出了待估计参数的完整后验信息。

左侧给出了三个随机变量的边缘后验：x轴表示可能的值，y轴表示取值对应的可能性。右侧给出了三个随机变量的采样值，可以看出三个随机变量都良好的收敛并稳定（没有出现跳变和奇怪的模式）。

其次，随机变量的最大后验估计很接近与真实值（参数模拟数据是给出的值）。

采用均值后验对参数进行估计。

est_intercept=naive_trace['Intercept'].mean();
est_slope=naive_trace['Slope'].mean();
est_regression_line = est_intercept + est_slope * x;

??pm.glm.plot_posterior_predictive

plt.figure(figsize=(7, 7))
plt.plot(x, y, 'x', label='data');

pm.glm.plot_posterior_predictive(glm_trace,samples=200,c='m');
plt.plot(x, true_regression_line, label='true regression line', lw=3.0, c='y');
plt.plot(x,est_regression_line,label = 'mean estimate regression line',c='k');
plt.title('Posterior predictive regression lines');
plt.legend(loc=0);
plt.xlabel('x');
plt.ylabel('y');

这里写图片描述