Bootstrap

tensorflow中optimizer minimize自动训练简介和选择训练variable的方法

 

本文主要介绍tensorflow的自动训练的相关细节,并把自动训练和基础公式结合起来。如有不足,还请指教。

写这个的初衷:有些教程说的比较模糊,没体现出用意和特性或应用场景。

面向对象:稍微了解点代码,又因为有限的教程讲解比较模糊而一知半解的初学者。

(更多相关内容,比如相关优化算法的分解和手动实现,EMA、BatchNormalization等用法,底部都有链接。)

 

 

正文

tensorflow提供了多种optimizer,典型梯度下降GradientDescent和Adagrad、Momentum、Nestrov、Adam等变种。

典型的学习步骤是梯度下降GradientDescent,optimizer可以自动实现这一过程,通过指定loss来串联所有相关变量形成计算图,然后通过optimizer(learning_rate).minimize(loss)实现自动梯度下降。minimize()也是两步操作的合并,后边会分解。

计算图的概念:一个变量想要被训练到,前提他在计算图中,更直白的说,要在公式或者连锁公式中,如果一个变量和loss没有任何直接以及间接关系,那就不会被训练到。

 

 

源码

train的过程其实就是修改计算图中的tf.Variable的过程,可以认为这些所有variable都是权重,为了简化,下面这个例子没引入placeholder和x,没有x和w的区分,但是变量prediction_to_train=3其实等价于:

prediction_to_train(y) = w*x,其中初始值w=3,隐藏的锁死的x=1(也就是一个固定的训练样本)。

这里loss定义的是平方差,label是1,所以训练过程就是x=1,y=1的数据,针对初始化w=3,训练w,把w变成1。


   
   
  1. import tensorflow as tf
  2. #define variable and error
  3. label = tf.constant( 1,dtype = tf.float32)
  4. prediction_to_train = tf.Variable( 3,dtype=tf.float32)
  5. #define losses and train
  6. manual_compute_loss = tf.square(prediction_to_train - label)
  7. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  8. train_step = optimizer.minimize(manual_compute_loss)
  9. init = tf.global_variables_initializer()
  10. with tf.Session() as sess:
  11. sess.run(init)
  12. for _ in range( 100):
  13. print( 'variable is ', sess.run(prediction_to_train), ' and the loss is ',sess.run(manual_compute_loss))
  14. sess.run(train_step)

输出


   
   
  1. variable is 3.0 and the loss is 4.0
  2. variable is 2.96 and the loss is 3.8416002
  3. variable is 2.9208 and the loss is 3.6894724
  4. variable is 2.882384 and the loss is 3.5433698
  5. variable is 2.8447363 and the loss is 3.403052
  6. variable is 2.8078415 and the loss is 3.268291
  7. 。。。。。。。
  8. 。。。
  9. variable is 2.0062745 and the loss is 1.0125883
  10. variable is 1.986149 and the loss is 0.9724898
  11. variable is 1.966426 and the loss is 0.9339792
  12. 。。。。
  13. 。。。
  14. variable is 1.0000029 and the loss is 8.185452e-12
  15. variable is 1.0000029 and the loss is 8.185452e-12
  16. variable is 1.0000029 and the loss is 8.185452e-12
  17. variable is 1.0000029 and the loss is 8.185452e-12
  18. variable is 1.0000029 and the loss is 8.185452e-12

 

限定train的Variable的方法:

根据train是修改计算图中tf.Variable(默认是计算图中所有tf.Variable,可以通过var_list指定)的事实,可以使用tf.constant或者python变量的形式来规避常量被训练,这也是迁移学习要用到的技巧。

下边是一个正经的陈(train)一发的例子:

y=w1*x+w2*x+w3*x

因y=1,x=1

1=w1+w2+w3

又w3=4

-3=w1+w2


   
   
  1. #demo2
  2. #define variable and error
  3. label = tf.constant( 1,dtype = tf.float32)
  4. x = tf.placeholder(dtype = tf.float32)
  5. w1 = tf.Variable( 4,dtype=tf.float32)
  6. w2 = tf.Variable( 4,dtype=tf.float32)
  7. w3 = tf.constant( 4,dtype=tf.float32)
  8. y_predict = w1*x+w2*x+w3*x
  9. #define losses and train
  10. make_up_loss = tf.square(y_predict - label)
  11. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  12. train_step = optimizer.minimize(make_up_loss)
  13. init = tf.global_variables_initializer()
  14. with tf.Session() as sess:
  15. sess.run(init)
  16. for _ in range( 100):
  17. w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x: 1})
  18. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  19. sess.run(train_step,{x: 1})

 因为w3是constant,成功避免了被陈(train)一发,只有w1和w2被train。

符合预期-3=w1+w2


   
   
  1. variable is w1: -1.4999986 w2: -1.4999986 w3: 4.0 and the loss is 8.185452e-12
  2. variable is w1: -1.4999986 w2: -1.4999986 w3: 4.0 and the loss is 8.185452e-12
  3. variable is w1: -1.4999986 w2: -1.4999986 w3: 4.0 and the loss is 8.185452e-12
  4. variable is w1: -1.4999986 w2: -1.4999986 w3: 4.0 and the loss is 8.185452e-12

下边是使用var_list限制只有w2被train的例子,只有w2被train,又因为那两个w初始化都是4,x=1,所以w2接近-7是正确答案。


   
   
  1. #define variable and error
  2. label = tf.constant( 1,dtype = tf.float32)
  3. x = tf.placeholder(dtype = tf.float32)
  4. w1 = tf.Variable( 4,dtype=tf.float32)
  5. w2 = tf.Variable( 4,dtype=tf.float32)
  6. w3 = tf.constant( 4,dtype=tf.float32)
  7. y_predict = w1*x+w2*x+w3*x
  8. #define losses and train
  9. make_up_loss = tf.square(y_predict - label)
  10. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  11. train_step = optimizer.minimize(make_up_loss,var_list = w2)
  12. init = tf.global_variables_initializer()
  13. with tf.Session() as sess:
  14. sess.run(init)
  15. for _ in range( 500):
  16. w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x: 1})
  17. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  18. sess.run(train_step,{x: 1})

   
   
  1. variable is w1: 4.0 w2: -6.99948 w3: 4.0 and the loss is 2.7063857e-07
  2. variable is w1: 4.0 w2: -6.9994903 w3: 4.0 and the loss is 2.5983377e-07
  3. variable is w1: 4.0 w2: -6.9995003 w3: 4.0 and the loss is 2.4972542e-07
  4. variable is w1: 4.0 w2: -6.9995103 w3: 4.0 and the loss is 2.398176e-07
  5. variable is w1: 4.0 w2: -6.9995203 w3: 4.0 and the loss is 2.3011035e-07
  6. variable is w1: 4.0 w2: -6.99953 w3: 4.0 and the loss is 2.2105178e-07
  7. variable is w1: 4.0 w2: -6.9995394 w3: 4.0 and the loss is 2.1217511e-07

如果w1、w2、w3都是tf.constant呢?毫无疑问,,还,真友好~

一共两种情况:

var_list自动获取所有可训练变量,会报错告诉你找不到能train的variables:

ValueError: No variables to optimize.
   
   

用var_list指定一个constant,没有实现:

NotImplementedError: ('Trying to update a Tensor ', <tf.Tensor 'Const_1:0' shape=() dtype=float32>)
   
   

 

 

另一种获得var_list的方式——tf.getCollection

各种get_variable更实用一些,因为不一定方便通过python引用得到tensor。


   
   
  1. #demo2.2 another way to collect var_list
  2. label = tf.constant( 1,dtype = tf.float32)
  3. x = tf.placeholder(dtype = tf.float32)
  4. w1 = tf.Variable( 4,dtype=tf.float32)
  5. with tf.name_scope(name= 'selected_variable_to_trian'):
  6. w2 = tf.Variable( 4,dtype=tf.float32)
  7. w3 = tf.constant( 4,dtype=tf.float32)
  8. y_predict = w1*x+w2*x+w3*x
  9. #define losses and train
  10. make_up_loss = (y_predict - label)** 3
  11. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  12. output_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope= 'selected_variable_to_trian')
  13. train_step = optimizer.minimize(make_up_loss,var_list = output_vars)
  14. init = tf.global_variables_initializer()
  15. with tf.Session() as sess:
  16. sess.run(init)
  17. for _ in range( 3000):
  18. w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x: 1})
  19. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  20. sess.run(train_step,{x: 1})

   
   
  1. variable is w 1: 4. 0 w 2: - 6. 988893 w 3: 4. 0 and the loss is 1. 3702081e- 06
  2. variable is w 1: 4. 0 w 2: - 6. 988897 w 3: 4. 0 and the loss is 1. 3687968e- 06
  3. variable is w 1: 4. 0 w 2: - 6. 9889007 w 3: 4. 0 and the loss is 1. 3673865e- 06
  4. variable is w 1: 4. 0 w 2: - 6. 9889045 w 3: 4. 0 and the loss is 1. 3659771e- 06
  5. variable is w 1: 4. 0 w 2: - 6. 9889083 w 3: 4. 0 and the loss is 1. 3645688e- 06
  6. variable is w 1: 4. 0 w 2: - 6. 988912 w 3: 4. 0 and the loss is 1. 3631613e- 06
  7. variable is w 1: 4. 0 w 2: - 6. 988916 w 3: 4. 0 and the loss is 1. 3617548e- 06
  8. variable is w 1: 4. 0 w 2: - 6. 9889197 w 3: 4. 0 and the loss is 1. 3603493e- 06

TRAINABLE_VARIABLE=False

另一种限制variable被限制的方法,与上边的方法原理相似,都和tf.GraphKeys.TRAINABLE_VARIABLE有关,只不过前一个是从里边挑出指定scope,这个从变量定义时就决定了不往里插入这个变量。

不可训练和常量还是不同的,毕竟还能手动修改,比如滑动平均值的应用,不可训练像是专门针对optimizer的约定。

 


   
   
  1. #demo2.4 another way to avoid variable be train
  2. label = tf.constant( 1,dtype = tf.float32)
  3. x = tf.placeholder(dtype = tf.float32)
  4. w1 = tf.Variable( 4,dtype=tf.float32,trainable= False)
  5. w2 = tf.Variable( 4,dtype=tf.float32)
  6. w3 = tf.constant( 4,dtype=tf.float32)
  7. y_predict = w1*x+w2*x+w3*x
  8. #define losses and train
  9. make_up_loss = (y_predict - label)** 3
  10. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  11. output_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
  12. train_step = optimizer.minimize(make_up_loss,var_list = output_vars)
  13. init = tf.global_variables_initializer()
  14. with tf.Session() as sess:
  15. sess.run(init)
  16. for _ in range( 3000):
  17. w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x: 1})
  18. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  19. sess.run(train_step,{x: 1})

获取所有trainable变量来train,也就等于不指定var_list直接train,是默认参数。


   
   
  1. var_list: Optional list or tuple of `Variable` objects to update to
  2. minimize `loss`. Defaults to the list of variables collected in
  3. the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.

   
   
  1. #demo2.3 another way to avoid variable be train
  2. label = tf.constant( 1,dtype = tf.float32)
  3. x = tf.placeholder(dtype = tf.float32)
  4. #w1 = tf.Variable(4,dtype=tf.float32)
  5. w1 = tf.Variable( 4,dtype=tf.float32,trainable= False)
  6. with tf.name_scope(name= 'selected_variable_to_trian'):
  7. w2 = tf.Variable( 4,dtype=tf.float32)
  8. w3 = tf.constant( 4,dtype=tf.float32)
  9. y_predict = w1*x+w2*x+w3*x
  10. #define losses and train
  11. make_up_loss = (y_predict - label)** 3
  12. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  13. train_step = optimizer.minimize(make_up_loss)
  14. init = tf.global_variables_initializer()
  15. with tf.Session() as sess:
  16. sess.run(init)
  17. for _ in range( 3000):
  18. w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x: 1})
  19. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  20. sess.run(train_step,{x: 1})

实际结果同上,略。

 

minimize()操作分解

其实minimize()操作也只是一个compute_gradients()和apply_gradients()的组合操作.

compute_gradients()用来计算梯度,opt.apply_gradients()用来更新参数。通过多个optimizer可以指定多个具有不同学习率的学习过程,针对不同的var_list分别进行gradient的计算和参数更新,可以用来迁移学习或者处理一些深层网络梯度更新不匹配的问题,暂不赘述。


   
   
  1. #demo2.4 combine of ompute_gradients() and apply_gradients()
  2. label = tf.constant( 1,dtype = tf.float32)
  3. x = tf.placeholder(dtype = tf.float32)
  4. w1 = tf.Variable( 4,dtype=tf.float32,trainable= False)
  5. w2 = tf.Variable( 4,dtype=tf.float32)
  6. w3 = tf.Variable( 4,dtype=tf.float32)
  7. y_predict = w1*x+w2*x+w3*x
  8. #define losses and train
  9. make_up_loss = (y_predict - label)** 3
  10. optimizer = tf.train.GradientDescentOptimizer( 0.01)
  11. w2_gradient = optimizer.compute_gradients(loss = make_up_loss, var_list = w2)
  12. train_step = optimizer.apply_gradients(grads_and_vars = (w2_gradient))
  13. init = tf.global_variables_initializer()
  14. with tf.Session() as sess:
  15. sess.run(init)
  16. for _ in range( 300):
  17. w1_,w2_,w3_,loss_,w2_gradient_ = sess.run([w1,w2,w3,make_up_loss,w2_gradient],feed_dict={x: 1})
  18. print( 'variable is w1:',w1_, ' w2:',w2_, ' w3:',w3_, ' and the loss is ',loss_)
  19. print( 'gradient:',w2_gradient_)
  20. sess.run(train_step,{x: 1})

 

具体的learning rate、step、计算公式和手动梯度下降实现:

在预测中,x是关于y的变量,但是在train中,w是L的变量,x是不可能变化的。所以,知道为什么weights叫Variable了吧(强行瞎解释一发)

下面用tensorflow接口手动实现梯度下降:

为了方便写公式,下边的代码改了变量的命名,采用loss、prediction、gradient、weight、y、x等首字母表示,η表示学习率,w0、w1、w2等表示第几次迭代时w的值,不是多个变量。

loss=(y-p)^2=(y-w*x)^2=(y^2-2*y*w*x+w^2*x^2)

dl/dw = 2*w*x^2-2*y*x

代入梯度下降公式w1=w0-η*dL/dw|w=w0

w1 = w0-η*dL/dw|w=w0

w2 = w1 - η*dL/dw|w=w1

w3 = w2 - η*dL/dw|w=w2

 

初始:y=3,x=1,w=2,l=1,dl/dw=-2,η=1

更新:w=4

更新:w=2

更新:w=4

所以,本例x=1,y=3,dl/dw巧合的等于2w-2y,也就是二倍的prediction和label的差距。learning rate=1会导致w围绕正确的值来回徘徊,完全不收敛,这样写主要是方便演示计算。改小learning rate 并增加循环次数就能收敛了。


   
   
  1. #demo4:manual gradient descent in tensorflow
  2. #y label
  3. y = tf.constant( 3,dtype = tf.float32)
  4. x = tf.placeholder(dtype = tf.float32)
  5. w = tf.Variable( 2,dtype=tf.float32)
  6. #prediction
  7. p = w*x
  8. #define losses
  9. l = tf.square(p - y)
  10. g = tf.gradients(l, w)
  11. learning_rate = tf.constant( 1,dtype=tf.float32)
  12. #learning_rate = tf.constant(0.11,dtype=tf.float32)
  13. init = tf.global_variables_initializer()
  14. #update
  15. update = tf.assign(w, w - learning_rate * g[ 0])
  16. with tf.Session() as sess:
  17. sess.run(init)
  18. print(sess.run([g,p,w], {x: 1}))
  19. for _ in range( 5):
  20. w_,g_,l_ = sess.run([w,g,l],feed_dict={x: 1})
  21. print( 'variable is w:',w_, ' g is ',g_, ' and the loss is ',l_)
  22. _ = sess.run(update,feed_dict={x: 1})

结果:

learning rate=1


   
   
  1. [[ -2.0], 2.0, 2.0]
  2. variable is w: 2.0 g is [ -2.0] and the loss is 1.0
  3. variable is w: 4.0 g is [ 2.0] and the loss is 1.0
  4. variable is w: 2.0 g is [ -2.0] and the loss is 1.0
  5. variable is w: 4.0 g is [ 2.0] and the loss is 1.0
  6. variable is w: 2.0 g is [ -2.0] and the loss is 1.0

 效果类似下图

缩小learning rate


   
   
  1. variable is w: 2.9964619 g is [ -0.007575512] and the loss is 1.4347095e-05
  2. variable is w: 2.996695 g is [ -0.0070762634] and the loss is 1.2518376e-05
  3. variable is w: 2.996913 g is [ -0.0066099167] and the loss is 1.0922749e-05
  4. variable is w: 2.9971166 g is [ -0.0061740875] and the loss is 9.529839e-06
  5. variable is w: 2.9973066 g is [ -0.0057668686] and the loss is 8.314193e-06
  6. variable is w: 2.9974842 g is [ -0.0053868294] and the loss is 7.2544826e-06
  7. variable is w: 2.9976501 g is [ -0.0050315857] and the loss is 6.3292136e-06
  8. variable is w: 2.997805 g is [ -0.004699707] and the loss is 5.5218115e-06
  9. variable is w: 2.9979498 g is [ -0.004389763] and the loss is 4.8175043e-06
  10. variable is w: 2.998085 g is [ -0.0041003227] and the loss is 4.2031616e-06
  11. variable is w: 2.9982114 g is [ -0.003829956] and the loss is 3.6671408e-06
  12. variable is w: 2.9983294 g is [ -0.0035772324] and the loss is 3.1991478e-06

 

扩展:Momentum、Adagrad的自动和手动实现,这里嫌太长,分开了

 

源码

 

补充实操经验:

实际工程经常会使用global_step变量,作为动态学习率、EMABatch_Normalization操作的依据,在对所有可训练数据训练时,尤其ema选中所有可训练变量时,容易对global_step产生影响(本来是每一步+1,偏偏被加了个惯性,加了衰减系数),所以global_step一定要设定trainable=False。并且EMA等操作谨慎选择训练目标。

关于EMA与trainable=False,其实没有严格关系,但是通常有一定关系,EMA默认可能是获得所有可训练变量,如果给global_step设定trainable=False,就避免了被传入EMA的var_list,这也算是一个“你也不知道为什么,只是走运没出事儿”的常见案例了!!!

同样道理,BatchNormalization的average_mean和average_variance都是要设定trainable=False,都是他们单独维护的。

 

 

 

 

 
;