Bootstrap

SVM公式推导

第6章 支持向量机

6.1 间隔与支持向量

超平面的方程可以表示为:
(6.1) w T x + b = 0 \tag{6.1} w^Tx+b =0 wTx+b=0(6.1)

推导6.1

样本数据集:
( x 1 0 , x 2 0 , x 3 0 , … , x m 0 x 1 1 , x 2 1 , x 3 1 , … , x m 1 x 1 2 , x 2 2 , x 3 2 , … , x m 2 x 1 3 , x 2 3 , x 3 3 , … , x m 3 … x 1 n , x 2 n , x 3 n , … , x m n ) \begin{pmatrix} x_1^0,x_2^0,x_3^0,\dots,x_m^0 \\ x_1^1,x_2^1,x_3^1,\dots,x_m^1 \\ x_1^2,x_2^2,x_3^2,\dots,x_m^2\\ x_1^3,x_2^3,x_3^3,\dots,x_m^3\\ \dots\\ x_1^n,x_2^n,x_3^n,\dots,x_m^n \end{pmatrix} x10,x20,x30,,xm0x11,x21,x31,,xm1x12,x22,x32,,xm2x13,x23,x33,,xm3x1n,x2n,x3n,,xmn其中 x n m x_n^m xnm表示第 m m m个样本的第 n n n个特征,这里对每一个样本添加一个 x 0 = 1 x_0 = 1 x0=1

( y 1 , y 2 , y 3 , … , y m ) \begin{pmatrix} y_1, y_2, y_3, \dots, y_m \end{pmatrix} (y1,y2,y3,,ym)其中 y m y_m ym表示第 m m m个样本的lable
w T = ( w 0 , w 1 , w 2 , w 3 , … , w n ) w^T = \begin{pmatrix} w_0,w_1 , w_2, w_3, \dots, w_n \end{pmatrix} wT=(w0,w1,w2,w3,,wn),其中w为超平面的法向量,决定了超平面的方向, w 0 ∗ x 0 = b w_0 * x_0 = b w0x0=b为位移项,决定了超平面与原点之间的距离。

(6.2)
下面我们将超平面的方程标记为 ( w , b ) (w,b) (w,b),样本空间中任意点 x x x到超平面 ( w , b ) (w,b) (w,b)的距离可写为
(6.2) r = ∣ w T x + b ∣ ∣ ∣ w ∣ ∣ \tag{6.2} r = \frac{|w^Tx+b|}{||w||} r=wwTx+b(6.2)
推导(6.2)
在二维平面d怎么求?(点到直线的距离公式)
( x , y ) (x,y) (x,y) A x + B y + C = 0 Ax+By+C=0 Ax+By+C=0的距离用以下公式表示
d = ∣ A x + B y + C ∣ A 2 + B 2 d=\frac{|Ax+By+C|}{\sqrt{A^2+B^2}} d=A2+B2 Ax+By+C
拓展到n维空间有: w T x b = 0 w^Tx_b=0 wTxb=0 w T x + b = 0 w^Tx+b=0 wTx+b=0
∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + ⋯ + w n 2 ||w|| = \sqrt{w_1^2+w_2^2 +\cdots + w_n^2} w=w12+w22++wn2

假设超平面 ( w , b ) (w,b) (w,b)能够将训练样本正确分类,即对于 ( x i , y i ) ∈ D (x_i,y_i)\in D (xi,yi)D,若 y i = + 1 y_i = +1 yi=+1,则有 w T x i + b &gt; 0 w^Tx_i + b \gt0 wTxi+b>0;若 y i = − 1 y_i = -1 yi=1,则有 w T x i + b &lt; 0 w^Tx_i + b \lt0 wTxi+b<0
(6.3) { w T x i + b ≥ + 1   y i = + 1 ​ w T x i + b ≤ − 1   y i = − 1 \tag{6.3} \begin{cases} w^Tx_i+b \geq + 1 &amp;\text{ } y_i = +1 \\ ​ w^Tx_i+b \leq - 1 &amp;\text{ } y_i = -1 \end{cases} {wTxi+b+1wTxi+b1 yi=+1 yi=1(6.3)
推导6.3
空间任意一点到超平面的距离为d

d = ∣ w T x + b ∣ ∣ ∣ w ∣ ∣ d=\frac{|w^Tx+b|}{||w||} d=wwTx+b
∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + ⋯ + w n 2 ||w|| = \sqrt{w_1^2+w_2^2 +\cdots + w_n^2} w=w12+w22++wn2

{ w T x i + b ∣ ∣ w ∣ ∣ ≥ d ∀ y i = 1 w T x i + b ∣ ∣ w ∣ ∣ ≤ − d ∀ y i = − 1 ⟹ { w d T x i + b d ≥ 1 ∀ y i = 1 w d T x i + b d ≤ − 1 ∀ y i = − 1 \begin{cases} \frac{w^Tx_i+b }{||w||}\geq d &amp; \forall y_i=1 \\ \frac{w^Tx_i+b}{||w||}\leq -d &amp; \forall y_i=-1 \\ \end{cases} \Longrightarrow \begin{cases} w_d^T x_i + b_d\geq 1 &amp; \forall y_i=1 \\ w_d^T x_i+ b_d \leq -1 &amp; \forall y_i=-1 \\ \end{cases} {wwTxi+bdwwTxi+bdyi=1yi=1{wdTxi+bd1wdTxi+bd1yi=1yi=1
其中 { w d T = w T d b d = b d \begin{cases} w_d^T = \frac{w^T}{d}\\ b_d = \frac{b}{d} \\ \end{cases} {wdT=dwTbd=db
对于决策边界的超平面方程: w d T x + b d = 0 w_d^Tx+b_d=0 wdTx+bd=0
重命名!!! 令 { w d T = w T b d = b ⟹ w T x + b = 0 ⟹ { w T x i + b ≥ 1 ∀ y i = 1 w T x i + b ≤ − 1 ∀ y i = − 1 \begin{cases} w_d^T = w^T\\ b_d = b \\ \end{cases}\Longrightarrow w^Tx+b=0\Longrightarrow \begin{cases} w^T x_i + b\geq 1 &amp; \forall y_i=1 \\ w^T x_i+ b \leq -1 &amp; \forall y_i=-1 \\ \end{cases} {wdT=wTbd=bwTx+b=0{wTxi+b1wTxi+b1yi=1yi=1
距离超平面最近的这几个训练样本点使式 (6.3) 的等号成立,它们被称为"支持向量" (surport vector),两个异类支持向量到超平面的距离之和为:
(6.4) γ = 2 ∣ ∣ w ∣ ∣ \tag{6.4} \gamma =\frac{2}{||w||} γ=w2(6.4)
推导6.4
在超平面的一边的同类支持向量构成决策边界方程为:
w T x i + b = 1 w^T x_i + b = 1 wTxi+b=1
超平面到该决策边界的距离为:
γ = 1 ∣ ∣ w ∣ ∣ \gamma = \frac{1}{||w||} γ=w1
两平行直线之间的距离

在这里插入图片描述
欲找到具有"最大间隔" (maximum margin) 的划分超平面,也就是要找
到能满足式 (6.3) 中约束的参数 w w w b b b,使得 γ \gamma γ最大,即
(6.5) m a x ⎵ w , b 2 ∣ ∣ w ∣ ∣ \tag{6.5} \underbrace{max}_{\text{$w,b$}}\frac{2}{||w||} w,b maxw2(6.5)

s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 … m s.t.\quad y_i(w^Tx_i+b) \geq 1,\quad i=1,2 \dots m s.t.yi(wTxi+b)1,i=1,2m
显然为了最大化间隔 γ \gamma γ,仅需最大化 2 ∣ ∣ w ∣ ∣ \frac{2}{||w||} w2 ,这等价于最小化 ∣ ∣ w ∣ ∣ ||w|| w (加上系数与平方,只是为了计算方便)
(6.6) m i n ⎵ w , b 1 2 ∣ ∣ w ∣ ∣ 2 \tag{6.6} \underbrace{min}_{\text{$w,b$}}\frac{1}{2} ||w||^2 w,b min21w2(6.6)

s . t . y i ( w T x i + b ) ≥ 1 , i = 1 , 2 … m s.t.\quad y_i(w^Tx_i+b) \geq 1,\quad i=1,2 \dots m s.t.yi(wTxi+b)1,i=1,2m

6.2 对偶问题

求解式 (6.6)来得到大间隔划分超平面所对应的模型
(6.7) f ( x ) = w T x + b ​ \tag{6.7} f(x) = w^Tx+b​ f(x)=wTx+b(6.7)
对式 (6.6)使用拉格朗日乘子法可得到其"对偶问题" (dual problem). 具体来说,对式 (6.6) 的每条约束添加拉格朗日乘子 α ≥ 0 ​ \alpha \geq 0​ α0,则该问题的拉格朗日函数可写为:
(6.8) L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) ​ \tag{6.8} L(w,b,\alpha) = \frac{1}{2}||w||^2+\sum_{i=1}^m\alpha_i(1-y_i(w^Tx_i+b))​ L(w,b,α)=21w2+i=1mαi(1yi(wTxi+b))(6.8)
推导6.8
其中 α = ( α 1 ; α 2 ; ⋯ α m ) \alpha = (\alpha_1;\alpha_2;\cdots\alpha_m) α=(α1;α2;αm).令 L ( w , b , α ) L(w,b,\alpha) L(w,b,α) w w w b b b的偏导为0​
L ( w , b , α ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 m ( α i − α i y i w T x i + α i y i b ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 m α i − ∑ i = 1 m α i y i w T x i + ∑ i = 1 m α i y i b ​ \begin{aligned}L(w,b,\alpha) &amp;= \frac{1}{2}||w||^2+\sum_{i=1}^m\alpha_i(1-y_i(w^Tx_i+b)) \\ &amp; = \frac{1}{2}||w||^2+\sum_{i=1}^m(\alpha_i-\alpha_iy_iw^Tx_i+\alpha_iy_ib)\\ &amp; =\frac{1}{2}||w||^2+\sum_{i=1}^m\alpha_i -\sum_{i=1}^m\alpha_iy_iw^Tx_i +\sum_{i=1}^m\alpha_iy_ib \end{aligned}​ L(w,b,α)=21w2+i=1mαi(1yi(wTxi+b))=21w2+i=1m(αiαiyiwTxi+αiyib)=21w2+i=1mαii=1mαiyiwTxi+i=1mαiyib
(1)对 w w w b b b分别求偏导数​

∂ L ∂ w = w − ∑ i = 1 m α i y i x i = 0 ⟹ w = ∑ i = 1 m α i y i x i \frac {\partial L}{\partial w}=w - \sum_{i=1}^{m}\alpha^iy^ix^i = 0 \Longrightarrow w=\sum_{i=1}^{m}\alpha^iy^ix^i wL=wi=1mαiyixi=0w=i=1mαiyixi

∂ L ∂ b = ∑ i = 1 m α i y i 0 ⟹ ∑ i = 1 m α i y i = 0 \frac {\partial L}{\partial b}=\sum_{i=1}^{m}\alpha^iy^i0 \Longrightarrow \sum_{i=1}^{m}\alpha^iy^i = 0 bL=i=1mαiyi0i=1mαiyi=0
(6.9) w = ∑ i = 1 m α i y i x i \tag{6.9} w = \sum_{i=1}^m\alpha_iy_ix_i w=i=1mαiyixi(6.9)

(6.10) 0 = ∑ i = 1 m α i y i \tag{6.10} 0=\sum_{i=1}^m\alpha_iy_i 0=i=1mαiyi(6.10)

(6.11) L ( w , b , α ) = m a x ⎵ α ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j x i T x j \tag{6.11} L(w,b,\alpha) =\underbrace{max}_{\text{$\alpha$}}\sum_{i=1}^m\alpha_i - \frac{1}{2}\sum_{i = 1}^m\sum_{j=1}^m\alpha_i \alpha_j y_iy_jx_i^Tx_j L(w,b,α)=α maxi=1mαi21i=1mj=1mαiαjyiyjxiTxj(6.11)
推导6.11
将式 (6.9)代人 (6.8) ,即可将 L ( w , b , α ) L(w ,b ,\alpha) L(w,b,α) 中的 w w w b b b 消去,再考虑式 (6.10) 的约束,就得到式 (6.6) 的对偶问题
L ( w , b , α ) = 1 2 w T w + ∑ i = 1 m α i [ 1 − y i ( w T x i + b ) ] = 1 2 w T w + ∑ i = 1 m α i − ∑ i = 1 m α i y i w T x i − ∑ i = 1 m α i y i b = 1 2 w T ∑ i = 1 m α i y i x i − w T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i − ∑ i = 1 m α i y i b = − 1 2 w T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i − ∑ i = 1 m α i y i b = − 1 2 w T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i − b ∑ i = 1 m α i y i = − 1 2 ( ∑ i = 1 m α i y i x i ) T ( ∑ i = 1 m α i y i x i ) + ∑ i = 1 m α i − b ∑ i = 1 m α i y i = − 1 2 ∑ i = 1 m α i y i ( x i ) T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i − b ∑ i = 1 m α i y i ⟹ 其 中 ∑ i = 1 m α i y i = 0 = − 1 2 ∑ i = 1 m α i y i ( x i ) T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i = ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j ( x i ) T x j \begin{aligned} L(w,b,\alpha) &amp;=\frac {1}{2}w^Tw+\sum _{i=1}^m\alpha_i[1-y_i(w^Tx_i+b)]\\ &amp; =\frac {1}{2}w^Tw+\sum _{i=1}^m\alpha_i - \sum _{i=1}^m\alpha_iy_iw^Tx_i-\sum _{i=1}^m\alpha_iy_ib\\ &amp;=\frac {1}{2}w^T\sum _{i=1}^m\alpha_iy_ix_i-w^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_ i -\sum _{i=1}^m\alpha_iy_ib\\ &amp; = -\frac {1}{2}w^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i -\sum _{i=1}^m\alpha_iy_ib\\ &amp;=-\frac {1}{2}w^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i -b\sum _{i=1}^m\alpha_iy_i\\ &amp;=-\frac {1}{2}(\sum_{i=1}^{m}\alpha_iy_ix_i)^T(\sum _{i=1}^m\alpha_iy_ix_i)+\sum _{i=1}^m\alpha_i -b\sum _{i=1}^m\alpha_iy_i\\ &amp;=-\frac {1}{2}\sum_{i=1}^{m}\alpha_iy_i(x_i)^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i -b\sum _{i=1}^m\alpha_iy_i\Longrightarrow其中 \sum_{i=1}^{m}\alpha_iy_i = 0\\ &amp;= -\frac {1}{2}\sum_{i=1}^{m}\alpha_iy_i(x_i)^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i \\ &amp;=\sum _{i=1}^m\alpha_i-\frac {1}{2}\sum_{i=1 }^{m}\sum_{j=1}^{m}\alpha_i\alpha_jy_iy_j(x_i)^Tx_j \end{aligned} L(w,b,α)=21wTw+i=1mαi[1yi(wTxi+b)]=21wTw+i=1mαii=1mαiyiwTxii=1mαiyib=21wTi=1mαiyixiwTi=1mαiyixi+i=1mαii=1mαiyib=21wTi=1mαiyixi+i=1mαii=1mαiyib=21wTi=1mαiyixi+i=1mαibi=1mαiyi=21(i=1mαiyixi)T(i=1mαiyixi)+i=1mαibi=1mαiyi=21i=1mαiyi(xi)Ti=1mαiyixi+i=1mαibi=1mαiyii=1mαiyi=0=21i=1mαiyi(xi)Ti=1mαiyixi+i=1mαi=i=1mαi21i=1mj=1mαiαjyiyj(xi)Txj

s . t . ∑ i = 1 m α i y i = 0 s.t. \quad \sum_{i=1}^m \alpha_i y_i = 0 s.t.i=1mαiyi=0

α i ≥ 0 , i = 1 , 2 … , m \alpha _i \geq 0,\quad i=1,2\dots,m αi0,i=1,2,m

将(6.9)带入 f ( x ) f(x) f(x)得:
(6.12) f ( x ) = w T x + b = ∑ i = 1 m α i y i x i T x + b \tag{6.12} f(x) = w^Tx+b = \sum_{i=1}^{m}\alpha_iy_ix_i^Tx+b f(x)=wTx+b=i=1mαiyixiTx+b(6.12)
从对偶问题 (6.11)解出的 α i \alpha_i αi是式 (6.8) 中的拉格朗日乘子,它恰对应着训练样本 ( x i , y i ) (x_i ,y_i) (xi,yi). 注意到式 (6.6) 中有不等式约束,因此上述过程需满足KKT(Karush-Kuhn-Tucker) 条件,即要求:
(6.13) { α i ≥ 0 ;   y i f ( x i ) − 1 ≥ 0 ; α i ( y i f ( x i ) − 1 ) = 0 \tag{6.13} \begin{cases} \alpha_i \geq 0; &amp;\text{ } \\ y_if(x_i)-1 \geq 0;&amp;\text{}\\ \alpha_i(y_if(x_i) -1) =0 \end{cases} αi0;yif(xi)10;αi(yif(xi)1)=0 (6.13)
使用 S M O SMO SMO算法,固定 α i , α j \alpha_i,\alpha_j αi,αj以外的参数,则有:
(6.14) α i y i + α j y j = c , α i ≥ 0 , α j ≥ 0 \tag{6.14} \alpha_iy_i + \alpha_jy_j = c,\quad \alpha_i \geq 0 ,\quad \alpha_j\geq 0 αiyi+αjyj=c,αi0,αj0(6.14)
(6.15) c = − ∑ k ≠ i , j α k y k \tag{6.15}c = -\sum_{k\ne i,j}\alpha_ky_k c=k̸=i,jαkyk(6.15)
(6.16) α i y i + α j y j = c \tag{6.16}\alpha_iy_i+\alpha_jy_j = c αiyi+αjyj=c(6.16)
对于任何支持向量都有
{ w T x s + b = 1 ∀ y s = 1 w T x s + b = − 1 ∀ y s = − 1 ⟹ y s f ( x s ) = 1 \begin{cases} w^T x_s + b=1 &amp; \forall y_s=1 \\ w^T x_s+ b= -1 &amp; \forall y_s=-1 \end{cases} \Longrightarrow y_sf(x_s)= 1 {wTxs+b=1wTxs+b=1ys=1ys=1ysf(xs)=1
(6.17) y s ( ∑ i ∈ S α i y i x i T x s + b ) = 1 \tag{6.17}y_s(\sum_{i\in S}\alpha_iy_ix_i^Tx_s+b) = 1 ys(iSαiyixiTxs+b)=1(6.17)
推导6.17
(6.17)等式两边同乘 y s y_s ys
y s 2 ( ∑ i ∈ S α i y i x i T x s + b ) = y s , 其 中 y s 2 = 1 y_s^2(\sum_{i\in S}\alpha_iy_ix_i^Tx_s+b) = y_s,\quad其中y_s^2 =1 ys2(iSαiyixiTxs+b)=ysys2=1

(6.18) b = 1 ∣ S ∣ ∑ s ∈ S ( y s − ∑ s ∈ S α i y i x i T x s ) \tag{6.18}b = \frac{1}{|S|}\sum_{s\in S}(y_s - \sum_{s\in S}\alpha _iy_ix_i^Tx_s) b=S1sS(yssSαiyixiTxs)(6.18)

6.3 核函数

ϕ ( x ) \phi(x) ϕ(x)表示将 x映射
到一个合适的高维空间 后的特征向量
(6.19) f ( x ) = w T ϕ ( x ) + b \tag{6.19}f(x) = w^T\phi(x) +b f(x)=wTϕ(x)+b(6.19)
(6.20) m i n ⎵ w , b 1 2 ∣ ∣ w ∣ ∣ 2 \tag{6.20} \underbrace{min}_{\text{$w,b$}}\frac{1}{2}||w||^2 w,b min21w2(6.20)

s . t . y i ( w T ϕ ( x i ) + b ) ≥ 1 , i = 1 , 2 … m ​ s.t. \quad y_i(w^T\phi(x_i)+ b)\geq 1,\quad i = 1,2\dots m​ s.t.yi(wTϕ(xi)+b)1,i=1,2m

(6.21) m a x ⎵ α ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j ϕ ( x i ) T ( x j ) \tag{6.21}\underbrace{max}_{\text{$\alpha$}}\sum_{i=1}^m\alpha_i - \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_j\phi(x_i)^T(x_j) α maxi=1mαi21i=1mj=1mαiαjyiyjϕ(xi)T(xj)(6.21)

s . t . ∑ i = 1 m α i y i = 0 s.t. \quad \sum_{i=1}^m\alpha_iy_i = 0 s.t.i=1mαiyi=0
α i ≥ 0 , i = 1 , 2 , … , m \alpha_i\geq 0,\quad i =1,2,\dots,m αi0,i=1,2,,m
半正定矩阵和正定矩阵
(6.22) κ ( x i , x j ) = ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ = ϕ ( x ) ( x i ) T ϕ ( x ) ( x j ) \tag{6.22}\kappa(x_i,x_j) =\langle \phi(x_i),\phi(x_j) \rangle = \phi(x)(x_i)^T\phi(x)(x_j) κ(xi,xj)=ϕ(xi),ϕ(xj)=ϕ(x)(xi)Tϕ(x)(xj)(6.22)
(6.23) m a x ⎵ α ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j κ ( x i , x j ) \tag{6.23} \underbrace{max}_{\text{$\alpha$}} \sum_{i=1}^m\alpha_i - \frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_jy_iy_j\kappa(x_i,x_j) α maxi=1mαi21i=1mj=1mαiαjyiyjκ(xi,xj)(6.23)

s . t . ∑ i = 1 m α i y i = 0 s.t. \quad \sum_{i=1}^m\alpha_iy_i =0 s.t.i=1mαiyi=0

α i ≥ 0 , i = 1 , 2 ⋯ &ThinSpace; , m \alpha_i \geq 0,\quad i =1,2\cdots,m αi0,i=1,2,m
(6.24) f ( x ) = w T ϕ ( x ) + b = ∑ i = 1 m α i y i ϕ ( x ) ( x i ) T ϕ ( x ) + b = ∑ i = 1 m α i y i κ ( x i , x j ) + b \tag{6.24}\begin{aligned} f(x) &amp;=w^T\phi(x) +b \\&amp;=\sum_{i=1}^m\alpha_iy_i\phi(x)(x_i)^T\phi(x) +b \\&amp;=\sum_{i=1}^m\alpha_iy_i\kappa(x_i,x_j) +b\end{aligned} f(x)=wTϕ(x)+b=i=1mαiyiϕ(x)(xi)Tϕ(x)+b=i=1mαiyiκ(xi,xj)+b(6.24)
为核函数还可通过函数组合得到,例如:

  • κ 1 \kappa_1 κ1 κ 2 \kappa_2 κ2为核函数,则对于任意正数 γ 1 \gamma_1 γ1, γ 2 \gamma_2 γ2,其线性组合
    (6.25) γ 1 κ 1 + γ 2 κ 2 \tag{6.25}\gamma_1\kappa_1+\gamma_2\kappa_2 γ1κ1+γ2κ2(6.25)
  • κ 1 \kappa_1 κ1 κ 2 \kappa_2 κ2为核函数,则核函数的直积
    (6.26) κ 1 ⨂ κ 2 ( x , z ) = κ 1 ( x , z ) κ 2 ( x , z ) \tag{6.26}\kappa_1\bigotimes\kappa_2(x,z) = \kappa_1(x,z)\kappa_2(x,z) κ1κ2(x,z)=κ1(x,z)κ2(x,z)(6.26)
  • κ 1 \kappa_1 κ1为核函 数,则对于任意函数 g ( x ) g(x) g(x)
    (6.27) κ ( x , z ) = g ( x ) κ 1 ( x , z ) g ( z ) \tag{6.27} \kappa(x,z)=g(x)\kappa_1(x,z)g(z) κ(x,z)=g(x)κ1(x,z)g(z)(6.27)

6.4 软间隔与正则化

(6.28) y i ( w T x i + b ) ≥ 1 , i = 1 , 2 … m ​ \tag{6.28} \quad y_i(w^Tx_i+b) \geq 1,\quad i=1,2 \dots m​ yi(wTxi+b)1,i=1,2m(6.28)
(6.29) m i n ⎵ w , b 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m l 0 / 1 ( y i ( w T x i + b ) − 1 ) ​ \tag{6.29} \underbrace{min}_{\text{$w,b$}}\frac{1}{2} ||w||^2+C\sum_{i=1}^{m}l_{0/1}(y_i(w^Tx_i+b)-1)​ w,b min21w2+Ci=1ml0/1(yi(wTxi+b)1)(6.29)
(6.30) l 0 / 1 ( z ) = { 1 , z &lt; 0 0 , o t h e r w i s e ​ \tag{6.30}l_{0/1}(z)=\begin{cases}\\ 1 , z&lt;0 \\ 0 ,otherwise\\ \end{cases}​ l0/1(z)={1,z<00,otherwise(6.30)
在6.6式的基础上,我们把 不满足约束的样本以 l 0 / 1 ​ l_{0/1}​ l0/1损失函数引入进来 ,以实现允许少量样本不满足约束
hingle损失: (6.31) l h i n g e = m a x ( 0 , 1 − z ) ; ​ \tag{6.31}l_{hinge}=max(0,1-z);​ lhinge=max(0,1z);(6.31)
指数损失: (6.32) ( e x p o n e n t i a l l o s s ) : l e x p ( z ) = e x p ( − z ) ​ \tag{6.32}(exponential loss):l_{exp}(z)=exp(-z)​ exponentiallosslexp(z)=exp(z)(6.32)
对率损失: (6.33) ( l o g i s t i c s l o s s ) : L l o g ( z ) = l o g ( 1 + e x p ( − z ) ) ​ \tag{6.33}(logistics loss):L_{log}(z)=log(1+exp(-z))​ logisticslossLlog(z)=log(1+exp(z))(6.33)
我们用hinge损失代替 l 0 / 1 ​ l_{0/1}​ l0/1 (6.29) m i n ⎵ w , b 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m m a x ( 0 , 1 − y i ( w T x i + b ) ) ​ \tag{6.29} \underbrace{min}_{\text{$w,b$}}\frac{1}{2} ||w||^2+C\sum_{i=1}^{m}max(0,1-y_i(w^Tx_i+b))​ w,b min21w2+Ci=1mmax(0,1yi(wTxi+b))(6.29)

引入“松弛变量” ε i \varepsilon_i εi,为了方便理解,我们用‘红色’标出四个位于间隔内的点,粉色线段长度代表函数间隔(在这里为1)蓝色线段为 ε \varepsilon ε 绿色线段为 1 − ε 1-\varepsilon 1ε ,此时我们将(6.34)重写为 (6.35) m i n ⎵ w , b , ε i 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ε i \tag{6.35} \underbrace{min}_{\text{$w,b,\varepsilon_i$}}\frac{1}{2} ||w||^2+C\sum_{i=1}^{m}\varepsilon_i w,b,εi min21w2+Ci=1mεi(6.35) s . t . y i ( w T x i + b ) ≥ 1 − ε i , i = 1 , 2 … m s.t.\quad y_i(w^Tx_i+b) \geq 1-\varepsilon_i,\quad i=1,2 \dots m s.t.yi(wTxi+b)1εi,i=1,2m ε i ≥ 0 i = 1 , 2 , . . . m . \varepsilon_i \geq 0 \quad i= 1,2,...m. εi0i=1,2,...m.
在这里插入图片描述
(6.36) L ( w , b , α , ε , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ε i + ∑ i = 1 m α i ( 1 − ε i − y i ( w T x i + b ) ) − ∑ i = 1 m μ i ε i \tag{6.36} L(w,b,\alpha,\varepsilon ,\mu) = \frac{1}{2}||w||^2+C\sum_{i=1}^m \varepsilon_i +\sum_{i=1}^m \alpha_i(1-\varepsilon_i-y_i(w^Tx_i+b))-\sum_{i=1}^m\mu_i \varepsilon_i L(w,b,α,ε,μ)=21w2+Ci=1mεi+i=1mαi(1εiyi(wTxi+b))i=1mμiεi(6.36)在这部分由于存在两个月叔条件,我们引入两个拉格朗日乘子 α , ε \alpha ,\varepsilon α,ε
分别对 w , b , ε w,b,\varepsilon w,b,ε求导并使其为0
(6.37) ∂ L ∂ w = w − ∑ i = 1 m α i y i x i = 0 ⟹ w = ∑ i = 1 m α i y i x i \tag{6.37} \frac {\partial L}{\partial w}=w - \sum_{i=1}^{m}\alpha^iy^ix^i = 0 \Longrightarrow w=\sum_{i=1}^{m}\alpha^iy^ix^i wL=wi=1mαiyixi=0w=i=1mαiyixi(6.37)

(6.38) ∂ L ∂ b = ∑ i = 1 m α i y i = 0 ⟹ ∑ i = 1 m α i y i = 0 \tag{6.38} \frac {\partial L}{\partial b}=\sum_{i=1}^{m}\alpha^iy^i=0 \Longrightarrow \sum_{i=1}^{m}\alpha^iy^i = 0 bL=i=1mαiyi=0i=1mαiyi=0(6.38)

(6.39) ∂ L ∂ ε = C ∑ i = 1 m 1 − ∑ i = 1 m α 1 − ∑ i = 1 m &ThinSpace; u i ⟹ C = α i + μ i \tag{6.39} \frac{\partial L}{\partial \varepsilon}=C\sum_{i=1}^m1-\sum_{i=1}^m \alpha_1 -\sum_{i=1}^m \,u_i \Longrightarrow C=\alpha_i +\mu_i εL=Ci=1m1i=1mα1i=1muiC=αi+μi(6.39)

将式6.37-6.39代入6.36可以得到6.35的对偶问题——线性规划中普遍存在配对现象,每一个线性规划问题都存在另一个与他有对应关系的线性规划问题,其一叫原问题,其二叫对偶问题
(6.40) L ( w , b , α , ε , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ε i + ∑ i = 1 m α i ( 1 − ε i − y i ( w T x i + b ) ) − ∑ i = 1 m μ i ε i = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 m α i ( 1 − y i ( w T x i + b ) ) + C ∑ i = 1 m ε i − ∑ i = 1 m α i ε i − ∑ i = 1 m μ i ε i = − 1 2 ∑ i = 1 m α i y i ( x i ) T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i + ∑ i = 1 m C ε i − ∑ i = 1 m α i ε i − ∑ i = 1 m μ i ε i = − 1 2 ∑ i = 1 m α i y i ( x i ) T ∑ i = 1 m α i y i x i + ∑ i = 1 m α i + ∑ i = 1 m ( C − α i − μ i ) ε i = ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j ( x i ) T x j m a x ⎵ α ∑ i = 1 m α i − 1 2 ∑ i = 1 m ∑ j = 1 m α i α j y i y j ( x i ) T x j s . t . ∑ i = 1 m α i y i = 0 0 ≤ α i ≤ C i = 1 , 2 , … , m \begin{aligned} L(w,b,\alpha,\varepsilon ,\mu) &amp;= \frac{1}{2}||w||^2+C\sum_{i=1}^m \varepsilon_i+\sum_{i=1}^m \alpha_i(1-\varepsilon_i-y_i(w^Tx_i+b))-\sum_{i=1}^m\mu_i \varepsilon_i \\ &amp;=\frac{1}{2}||w||^2+\sum_{i=1}^m\alpha_i(1-y_i(w^Tx_i+b))+C\sum_{i=1}^m \varepsilon_i-\sum_{i=1}^m \alpha_i \varepsilon_i-\sum_{i=1}^m\mu_i \varepsilon_i \\ &amp; = -\frac {1}{2}\sum_{i=1}^{m}\alpha_iy_i(x_i)^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i +\sum_{i=1}^m C\varepsilon_i-\sum_{i=1}^m \alpha_i \varepsilon_i-\sum_{i=1}^m\mu_i \varepsilon_i \\ &amp; = -\frac {1}{2}\sum_{i=1}^{m}\alpha_iy_i(x_i)^T\sum _{i=1}^m\alpha_iy_ix_i+\sum _{i=1}^m\alpha_i +\sum_{i=1}^m (C-\alpha_i-\mu_i)\varepsilon_i \\ &amp;=\sum _{i=1}^m\alpha_i-\frac {1}{2}\sum_{i=1 }^{m}\sum_{j=1}^{m}\alpha_i\alpha_jy_iy_j(x_i)^Tx_j \\ &amp;\tag{6.40}\underbrace{max}_{\alpha}\sum _{i=1}^m\alpha_i-\frac {1}{2}\sum_{i=1 }^{m}\sum_{j=1}^{m}\alpha_i\alpha_jy_iy_j(x_i)^Tx_j \\ &amp;s.t. \sum_{i=1}^m \alpha_i y_i=0 \\ &amp; 0 \leq\alpha_i \leq C \quad i=1,2,\dots ,m \end{aligned} L(w,b,α,ε,μ)=21w2+Ci=1mεi+i=1mαi(1εiyi(wTxi+b))i=1mμiεi=21w2+i=1mαi(1yi(wTxi+b))+Ci=1mεii=1mαiεii=1mμiεi=21i=1mαiyi(xi)Ti=1mαiyixi+i=1mαi+i=1mCεii=1mαiεii=1mμiεi=21i=1mαiyi(xi)Ti=1mαiyixi+i=1mαi+i=1m(Cαiμi)εi=i=1mαi21i=1mj=1mαiαjyiyj(xi)Txjα maxi=1mαi21i=1mj=1mαiαjyiyj(xi)Txjs.t.i=1mαiyi=00αiCi=1,2,,m(6.40)

KKT条件,这里简单提一下不等式约束的KKT拉格朗日乘子为 α \alpha α不等式约束为 ε \varepsilon ε那么要满足 { α ≥ 0 ε α ε = 0 \begin{cases} \alpha \geq0 \\ \varepsilon \\ \alpha \varepsilon=0 \end{cases} α0εαε=0
(6.41) { α i ≥ 0 , μ i ≥ 0 , y i f ( x i ) − 1 + ε i ≥ 0 , α i ( y i f ( x i ) − 1 + ε i ) = 0 , ε i ≥ 0 , μ i ε i = 0. \tag{6.41}\begin{cases} \alpha_i \geq0, \mu_i \geq0 ,\\ y_if(x_i)-1+\varepsilon_i \geq0,\\ \alpha_i(y_if(x_i)-1+\varepsilon_i)=0,\\ \varepsilon_i \geq0, \mu_i \varepsilon_i=0. \end{cases} αi0,μi0,yif(xi)1+εi0,αi(yif(xi)1+εi)=0,εi0,μiεi=0.(6.41)

我们对不同的损失函数概括抽象推广到一般形式:
(6.42) m i n ⎵ f Ω ( f ) + C ∑ i = 1 m l ( f ( x i ) , y i ) \tag{6.42} \underbrace{min}_{f} \Omega(f)+C\sum_{i=1}^ml(f(x_i),y_i) f minΩ(f)+Ci=1ml(f(xi),yi)(6.42)

6.5 支持向量回归

在这里简单说一点,我们如何来看待分类和回归,分类的损失函数要么是1,要么是0,回归得的损失函数是连续的数值。支持向量回归我们容忍 f ( x ) f(x) f(x)与y之间有 ϵ \epsilon ϵ的误差,(结合图片,把6.43,6.44一起理解)
(6.43) m i n ⎵ w , b , ε i 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m l ϵ ( f ( x i ) − y i ) \tag{6.43} \underbrace{min}_{\text{$w,b,\varepsilon_i$}}\frac{1}{2} ||w||^2+C\sum_{i=1}^{m}l_{\epsilon}(f(x_i)-y_i) w,b,εi min21w2+Ci=1mlϵ(f(xi)yi)(6.43) (6.44) l ϵ ( z ) = { 0 , i f ∣ z ∣ ≤ ϵ ∣ z ∣ − ϵ , o t h e r w i s e \tag{6.44}l_{\epsilon}(z)= \begin{cases}\\ 0, if|z| \leq \epsilon \\|z|-\epsilon,otherwise \end{cases} lϵ(z)={0,ifzϵzϵ,otherwise(6.44)
在这里插入图片描述
如图所示我们在间隔两侧引入松弛变量 ε , ε ^ \varepsilon ,\hat\varepsilon ε,ε^,就是在间隔 ϵ \epsilon ϵ基础上我们重新增加了一部分“容忍量”
(6.45) m i n ⎵ w , b , ε i , ε ^ i 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ( ϵ i , ε ^ i \tag{6.45} \underbrace{min}_{\text{$w,b,\varepsilon_i,\hat \varepsilon_i$}}\frac{1}{2} ||w||^2+C\sum_{i=1}^{m}(\epsilon_i,\hat \varepsilon_i w,b,εi,ε^i min21w2+Ci=1m(ϵi,ε^i(6.45) f ( x i ) − y i ≤ ϵ + ε i , f(x_i)-y_i\leq\epsilon+\varepsilon_i, f(xi)yiϵ+εi, y i − f ( x i ) ≤ ϵ + ε ^ i , y_i-f(x_i) \leq\epsilon+\hat\varepsilon_i, yif(xi)ϵ+ε^i, ε i ≥ 0 , ε ^ 0 , i = 1 , 2 , … , m . \varepsilon_i \geq0,\hat\varepsilon0,\quad i=1,2,\dots,m. εi0,ε^0,i=1,2,,m.
在这里插入图片描述
引入拉格朗日乘子 μ i ≥ 0 , m μ ^ i ≥ 0 \mu_i \geq0,m\hat\mu_i\geq0 μi0,mμ^i0,对应两个松弛变量, α − i ≥ 0 , α ^ i ≥ 0 \alpha-i\geq0,\hat\alpha_i\geq0 αi0,α^i0对应两个约束条件.
(6.46) L ( w , b , α , α ^ , ε , ε ^ , μ , μ ^ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + C ∑ i = 1 m ( ε i + ε ^ i ) − ∑ i = 1 m μ i ε i − ∑ i = 1 m μ ^ i ε ^ i + ∑ i = 1 m α i ( f ( x i ) − y i − ϵ − ε i ) + ∑ i = 1 m α ^ i ( y i − f ( x i ) − ϵ − ε ^ i \tag{6.46} L(w,b,\alpha,\hat\alpha,\varepsilon,\hat\varepsilon ,\mu,\hat\mu) \\ = \frac{1}{2}||w||^2+C\sum_{i=1}^m (\varepsilon_i+\hat\varepsilon_i)-\sum_{i=1}^m\mu_i \varepsilon_i-\sum_{i=1}^m\hat\mu_i \hat\varepsilon_i +\sum_{i=1}^m \alpha_i(f(x_i)-y_i-\epsilon-\varepsilon_i)+\sum_{i=1}^m\hat\alpha_i(y_i-f(x_i)-\epsilon-\hat\varepsilon_i L(w,b,α,α^,ε,ε^,μ,μ^)=21w2+Ci=1m(εi+ε^i)i=1mμiεii=1mμ^iε^i+i=1mαi(f(xi)yiϵεi)+i=1mα^i(yif(xi)ϵε^i(6.46)

L ( w , b , α , α ^ , ε , ε ^ , μ , μ ^ ) ​ L(w,b,\alpha,\hat\alpha,\varepsilon,\hat\varepsilon ,\mu,\hat\mu)​ L(w,b,α,α^,ε,ε^,μ,μ^)分别对 w , b , ε , ε ^ ​ w,b,\varepsilon,\hat\varepsilon​ wb,ε,ε^求偏导并使其为0
(6.47) ∂ L ∂ w = w − ∑ i = 1 m α i x i − ∑ i = 1 m α ^ i x i = 0 ⟹ w = ∑ i = 1 m ( α ^ i − α i ) x i ​ \tag{6.47} \frac{\partial L}{\partial w}=w-\sum_{i=1}^m\alpha_ix_i-\sum_{i=1}^m\hat\alpha_ix_i=0 \Longrightarrow w=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)x_i ​ wL=wi=1mαixii=1mα^ixi=0w=i=1m(α^iαi)xi(6.47) (6.48) ∂ L ∂ b = ∑ i = 1 m α i − ∑ i = 1 m α ^ i = 0 ⟹ 0 = ∑ i = 1 m ( α ^ i − α i ) ​ \tag{6.48} \frac{\partial L}{\partial b}=\sum_{i=1}^m\alpha_i -\sum_{i=1}^m\hat\alpha_i=0 \Longrightarrow 0=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)​ bL=i=1mαii=1mα^i=00=i=1m(α^iαi)(6.48) (6.49) ∂ L ∂ ε i = C ∑ i = 1 m 1 − ∑ i = 1 m α 1 − ∑ i = 1 m &ThinSpace; u i ⟹ C = α i + μ i ​ \tag{6.49} \frac{\partial L}{\partial \varepsilon_i}=C\sum_{i=1}^m1-\sum_{i=1}^m \alpha_1 -\sum_{i=1}^m \,u_i \Longrightarrow C=\alpha_i +\mu_i​ εiL=Ci=1m1i=1mα1i=1muiC=αi+μi(6.49) (6.50) ∂ L ∂ ε ^ i = C ∑ i = 1 m 1 − ∑ i = 1 m α ^ 1 − ∑ i = 1 m μ ^ i ⟹ C = α ^ i + μ ^ i ​ \tag{6.50} \frac{\partial L}{\partial \hat\varepsilon_i}=C\sum_{i=1}^m1-\sum_{i=1}^m \hat\alpha_1 -\sum_{i=1}^m \hat\mu_i \Longrightarrow C=\hat\alpha_i +\hat\mu_i​ ε^iL=Ci=1m1i=1mα^1i=1mμ^iC=α^i+μ^i(6.50)

将6.47-6.50代入6.46,即可得到SVR的对偶问题
L ( w , b , α , α ^ , ε , ε ^ , μ , μ ^ ) = − 1 2 w T ∑ i = 1 m ( α ^ i − α i ) x i + ∑ i = 1 m ( α i ε i + α i ε ^ i + μ i ε i + μ i ε ^ i − μ i ε i − μ ^ i ε ^ i ) + ∑ i = 1 m α i ( ( w T x i + b ) − y i − ϵ − ε i ) + ∑ i = 1 m α ^ i ( y i − ( w T + b ) − ϵ − ε ^ i ) = ∑ i = 1 m ( α i ε i + α i ε ^ i + μ i ε i + μ i ε ^ i − μ i ε i − μ ^ i ε ^ i − μ i ε i − μ ^ i ε ^ i − α i ε i − α ^ i ε ^ i ) + ∑ i = 1 m [ y i ( α ^ i − α i ) − ϵ ( α ^ i + α i ) ] − 1 2 w T ∑ i = 1 m ( α ^ i − α i ) x i + ∑ i = 1 m ( α i − α ^ i ) w T x i = ∑ i = 1 m [ y i ( α ^ i − α i ) − ϵ ( α ^ i + α i ) ] + ∑ i = 1 m ( α i ε ^ i − μ i ε ^ i − μ ^ i ε ^ i − α ^ i ε ^ i ) − 1 2 ∑ i = 1 m ∑ j = 1 m ( α ^ i − α i ) ( ∣ ^ a l p h a j − α j ) x i T x j = ∑ i = 1 m [ y i ( α ^ i − α i ) − ϵ ( α ^ i + α i ) ] + ∑ i = 1 m [ ( α i + μ i ) − ( μ ^ i + α ^ i ) ] ε ^ i − 1 2 ∑ i = 1 m ∑ j = 1 m ( α ^ i − α i ) ( ∣ ^ a l p h a j − α j ) x i T x j = ∑ i = 1 m [ y i ( α ^ i − α i ) − ϵ ( α ^ i + α i ) ] − 1 2 ∑ i = 1 m ∑ j = 1 m ( α ^ i − α i ) ( ∣ ^ a l p h a j − α j ) x i T x j \begin{aligned} L(w,b,\alpha,\hat\alpha,\varepsilon,\hat\varepsilon ,\mu,\hat\mu) =-\frac{1}{2}w^T\sum_{i=1}^m(\hat\alpha_i-\alpha_i)x_i+\sum_{i=1}^m(\alpha_i\varepsilon_i+\alpha_i\hat\varepsilon_i+\mu_i\varepsilon_i+\mu_i\hat\varepsilon_i-\mu_i\varepsilon_i-\hat\mu_i\hat\varepsilon_i)\\ +\sum_{i=1}^m\alpha_i((w^Tx_i+b)-y_i-\epsilon-\varepsilon_i)+\sum_{i=1}^m\hat\alpha_i(y_i-(w^T+b)-\epsilon-\hat\varepsilon_i)\\ =\sum_{i=1}^m(\alpha_i\varepsilon_i+\alpha_i\hat\varepsilon_i+\mu_i\varepsilon_i+\mu_i\hat\varepsilon_i-\mu_i\varepsilon_i-\hat\mu_i\hat\varepsilon_i-\mu_i\varepsilon_i-\hat\mu_i\hat\varepsilon_i-\alpha_i\varepsilon_i-\hat\alpha_i\hat\varepsilon_i)\\ +\sum_{i=1}^m[y_i(\hat\alpha_i-\alpha_i)-\epsilon(\hat\alpha_i+\alpha_i)]-\frac{1}{2}w^T\sum_{i=1}^m(\hat\alpha_i-\alpha_i)x_i+\sum_{i=1}^m(\alpha_i-\hat\alpha_i)w^Tx_i\\ =\sum_{i=1}^m[y_i(\hat\alpha_i-\alpha_i)-\epsilon(\hat\alpha_i+\alpha_i)]+\sum_{i=1}^m(\alpha_i\hat\varepsilon_i-\mu_i\hat\varepsilon_i-\hat\mu_i\hat\varepsilon_i-\hat\alpha_i\hat\varepsilon_i)-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m(\hat\alpha_i-\alpha_i)(\hat|alpha_j-\alpha_j)x_i^Tx_j\\ =\sum_{i=1}^m[y_i(\hat\alpha_i-\alpha_i)-\epsilon(\hat\alpha_i+\alpha_i)]+\sum_{i=1}^m[(\alpha_i+\mu_i)-(\hat\mu_i+\hat\alpha_i)]\hat\varepsilon_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m(\hat\alpha_i-\alpha_i)(\hat|alpha_j-\alpha_j)x_i^Tx_j\\ =\sum_{i=1}^m[y_i(\hat\alpha_i-\alpha_i)-\epsilon(\hat\alpha_i+\alpha_i)]-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m(\hat\alpha_i-\alpha_i)(\hat|alpha_j-\alpha_j)x_i^Tx_j \end{aligned} L(w,b,α,α^,ε,ε^,μ,μ^)=21wTi=1m(α^iαi)xi+i=1m(αiεi+αiε^i+μiεi+μiε^iμiεiμ^iε^i)+i=1mαi((wTxi+b)yiϵεi)+i=1mα^i(yi(wT+b)ϵε^i)=i=1m(αiεi+αiε^i+μiεi+μiε^iμiεiμ^iε^iμiεiμ^iε^iαiεiα^iε^i)+i=1m[yi(α^iαi)ϵ(α^i+αi)]21wTi=1m(α^iαi)xi+i=1m(αiα^i)wTxi=i=1m[yi(α^iαi)ϵ(α^i+αi)]+i=1m(αiε^iμiε^iμ^iε^iα^iε^i)21i=1mj=1m(α^iαi)(^alphajαj)xiTxj=i=1m[yi(α^iαi)ϵ(α^i+αi)]+i=1m[(αi+μi)(μ^i+α^i)]ε^i21i=1mj=1m(α^iαi)(^alphajαj)xiTxj=i=1m[yi(α^iαi)ϵ(α^i+αi)]21i=1mj=1m(α^iαi)(^alphajαj)xiTxj (6.51) ∑ i = 1 m [ y i ( α ^ i − α i ) − ϵ ( α ^ i + α i ) ] − 1 2 ∑ i = 1 m ∑ j = 1 m ( α ^ i − α i ) ( ∣ ^ a l p h a j − α j ) x i T x j \tag{6.51}\sum_{i=1}^m[y_i(\hat\alpha_i-\alpha_i)-\epsilon(\hat\alpha_i+\alpha_i)]-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m(\hat\alpha_i-\alpha_i)(\hat|alpha_j-\alpha_j)x_i^Tx_j i=1m[yi(α^iαi)ϵ(α^i+αi)]21i=1mj=1m(α^iαi)(^alphajαj)xiTxj(6.51) s . t . ∑ i = 1 m ( α ^ i − α i ) = 0 s.t.\sum_{i=1}^m(\hat\alpha_i-\alpha_i)=0 s.t.i=1m(α^iαi)=0 0 ≤ α i , α ^ i ≤ C 0\leq\alpha_i,\hat\alpha_i\leq C 0αi,α^iC

上述过程满足KKT条件,即要求
(6.52) { α i ( f ( x i ) − y i − ϵ − ε i ) = 0 α ^ i ( y i − f ( x i ) − ϵ − ε i ) = 0 α i α ^ i = 0 , ε i ε ^ i = 0 ( C − α i ) ε i = 0 , ( c − α ^ i ) ε ^ i = 0 r \tag{6.52}\begin{cases}\\ \alpha_i(f(x_i)-y_i-\epsilon-\varepsilon_i)=0 \\ \hat\alpha_i(y_i-f(x_i)-\epsilon-\varepsilon_i)=0\\ \alpha_i\hat\alpha_i=0,\varepsilon_i\hat\varepsilon_i=0\\ (C-\alpha_i)\varepsilon_i=0,(c-\hat\alpha_i)\hat\varepsilon_i=0 \end{cases}r αi(f(xi)yiϵεi)=0α^i(yif(xi)ϵεi)=0αiα^i=0,εiε^i=0(Cαi)εi=0,(cα^i)ε^i=0r(6.52)

将式6.47代入6.7
(6.7) f ( x ) = w T x + b \tag{6.7} f(x) = w^Tx+b f(x)=wTx+b(6.7) (6.47) w = ∑ i = 1 m ( α ^ i − α i ) x i \tag{6.47}w=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)x_i w=i=1m(α^iαi)xi(6.47) (6.53) f ( x ) = ∑ i = 1 m ( α ^ i − α i ) x i T x + b \tag{6.53}f(x)=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)x_i^Tx+b f(x)=i=1m(α^iαi)xiTx+b(6.53)

(6.54) b = y i + ϵ − ∑ j = 1 m ( α ^ j α j ) x j T x i \tag{6.54}b=y_i+\epsilon-\sum_{j=1}^m(\hat\alpha_j\alpha_j)x_j^Tx_i b=yi+ϵj=1m(α^jαj)xjTxi(6.54)

(6.55) w = ∑ i = 1 m ( α ^ i − α i ) ϕ ( x i ) \tag{6.55}w=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)\phi (x_i) w=i=1m(α^iαi)ϕ(xi)(6.55)

(6.56) f ( x ) = ∑ i = 1 m ( α ^ i − α i ) k ( x , x i ) + b \tag{6.56}f(x)=\sum_{i=1}^m(\hat\alpha_i-\alpha_i)k(x,x_i)+b f(x)=i=1m(α^iαi)k(x,xi)+b(6.56) 其中 k ( x i , x j ) = ϕ ( x i ) T ϕ ( x j ) k(x_i,x_j)=\phi(x_i)^T \phi(x_j) k(xi,xj)=ϕ(xi)Tϕ(xj)为核函数

6.6 核方法[公式6.57-6.70]

它的核心思想是将数据映射到高维空间中,希望在高维空间中数据具有更好的区分性,而核函数是用来计算映射到高维空间中内积的一种方法,也就是说核方法的本质应该是内积,而内积又恰恰定义了相似度。

再生核希尔伯特空间
(6.57) m i n ⎵ h ∈ H F ( h ) = Ω ( ∣ ∣ h ∣ ∣ H ) + ℓ ( h ( x 1 ) , h ( x 2 ) … h ( x m ) ) \tag{6.57} \underbrace{min}_{\text{$h \in H$}}F(h) = \Omega(||h||_H)+\ell(h(x_1),h(x_2)\dots h(x_m)) hH minF(h)=Ω(hH)+(h(x1),h(x2)h(xm))(6.57)

(6.58) h ∗ ( x ) = ∑ i = 1 m α i κ ( x , x i ) \tag{6.58} h^*(x) = \sum_{i=1}^m\alpha_i\kappa(x,x_i) h(x)=i=1mαiκ(x,xi)(6.58)
证明见wiki

将样本映射到高维空间,进行线性判别分析
(6.59) h ( x ) = w T ϕ ( x ) \tag{6.59} h(x) =w^T\phi(x) h(x)=wTϕ(x)(6.59)

(6.60) m a x ⎵ w J ( w ) = w T S b ϕ w w T S w ϕ w \tag{6.60} \underbrace{max}_{\text{$w$}}J(w) = \frac{w^TS_b^ \phi w}{w^TS_w^\phi w} w maxJ(w)=wTSwϕwwTSbϕw(6.60)

(6.61) μ i ϕ = 1 m i ∑ x ∈ X i ϕ ( x ) \tag{6.61} \mu_i^\phi = \frac{1}{m_i}\sum_{x\in X_i}\phi(x) μiϕ=mi1xXiϕ(x)(6.61)

(6.62) S b ϕ = ( μ 1 ϕ − μ 0 ϕ ) ( μ 1 ϕ − μ 0 ϕ ) T \tag{6.62} S_b^\phi =(\mu_1^\phi - \mu_0^\phi )(\mu_1^\phi - \mu_0^\phi )^T Sbϕ=(μ1ϕμ0ϕ)(μ1ϕμ0ϕ)T(6.62)

(6.63) S w ϕ = ∑ i = 0 1 ∑ x ∈ X i ( ϕ ( x ) − μ i ϕ ) ( ϕ ( x ) − μ i ϕ ) T \tag{6.63} S_w^\phi =\sum_{i=0}^{1}\sum_{x\in X_i}(\phi (x)- \mu_i^\phi )(\phi (x)- \mu_i^\phi )^T Swϕ=i=01xXi(ϕ(x)μiϕ)(ϕ(x)μiϕ)T(6.63)

(6.64) h ( x ) = ∑ i = 1 m α i κ ( x , x i ) = ∑ i = 1 m α i ϕ ( x ) T ϕ ( x i ) = w T ϕ ( x ) \tag{6.64} h(x) = \sum_{i=1}^m\alpha_i\kappa(x,x_i) = \sum_{i=1}^m\alpha_i\phi (x)^T\phi (x_i) =w^T\phi(x) h(x)=i=1mαiκ(x,xi)=i=1mαiϕ(x)Tϕ(xi)=wTϕ(x)(6.64)

(6.65) w = ∑ i = 1 m α i ϕ ( x i ) \tag{6.65} w = \sum_{i=1}^m\alpha_i\phi (x_i) w=i=1mαiϕ(xi)(6.65)
推导6.65
h ( x ) = ∑ i = 1 m α i ϕ ( x ) T ϕ ( x i ) = w T ϕ ( x ) h(x) = \sum_{i=1}^m\alpha_i\phi (x)^T\phi (x_i) =w^T\phi(x) h(x)=i=1mαiϕ(x)Tϕ(xi)=wTϕ(x)

(6.66) μ ^ 0 = 1 m 0 K l 0 \tag{6.66} \hat{\mu}_0 =\frac{1}{m_0}Kl_0 μ^0=m01Kl0(6.66)

(6.67) μ ^ 0 = 1 m 1 K l 1 \tag{6.67} \hat{\mu}_0 =\frac{1}{m_1}Kl_1 μ^0=m11Kl1(6.67)

(6.68) M = ( μ ^ 0 − μ ^ 1 ) ( μ ^ 0 − μ ^ 1 ) T = S b ϕ \tag{6.68} M=(\hat{\mu}_0-\hat{\mu}_1)(\hat{\mu}_0-\hat{\mu}_1)^T = S_b^\phi M=(μ^0μ^1)(μ^0μ^1)T=Sbϕ(6.68)

(6.69) N = K K T − ∑ i = 0 1 m i μ ^ i μ ^ i T = S w ϕ \tag{6.69} N = KK^T-\sum_{i=0}^1m_i\hat{\mu}_i\hat{\mu}_i^T = S_w^\phi N=KKTi=01miμ^iμ^iT=Swϕ(6.69)

(6.70) m a x ⎵ w J ( α ) = α T M α α T N α \tag{6.70} \underbrace{max}_{\text{$w$}}J(\alpha)= \frac{\alpha^TM\alpha}{\alpha^TN\alpha} w maxJ(α)=αTNααTMα(6.70)

悦读

道可道,非常道;名可名,非常名。 无名,天地之始,有名,万物之母。 故常无欲,以观其妙,常有欲,以观其徼。 此两者,同出而异名,同谓之玄,玄之又玄,众妙之门。

;