Bootstrap

【深度学习】图形模型基础(7):机器学习优化中的方差减少方法(2)

4.高级算法

本节将探讨基本变分减少(VR)方法的几种拓展。这些拓展旨在处理更广泛的应用场景,包括非光滑问题和/或非强凸问题。此外,一些拓展利用算法技巧或问题结构的特性,设计出比基础方法更高效的算法。

4.1. SGD与VR方法的混合

VR方法的收敛速率 ρ \rho ρ 与训练样本的数量 n n n 成正比。这与经典SGD方法的收敛速率不同,SGD的收敛速率是次线性的,但并不依赖于 n n n。这意味着在 n n n 非常大的早期迭代中,VR方法可能不如经典SGD方法表现好。例如,在图1中可以看到,在前10个周期(数据遍历)中,SGD与两种VR方法相比具有竞争力。

为了改善VR方法对 n n n 的依赖性,已经提出了几种混合SGD和VR方法。Konečný和Richtárik[32]以及Le Roux等人[37]分别分析了SAG和SVRG,当它们用SGD的 n n n 次迭代进行初始化时。这并没有改变收敛速率,但显著改善了常数因子中对 n n n 的依赖性。然而,这需要为这些初始SGD迭代设置步长,这比为VR方法设置步长更为复杂。

最近,一些方法被探索出来,它们同时保证了依赖于 n n n 的线性收敛速率和不依赖于 n n n 的次线性收敛速率。例如,Lei和Jordan展示了如何实现这种“两全其美”的结果,即使用逐渐增大的minibatch近似 ∇ f ( x ˉ ) \nabla f(\bar{x}) f(xˉ) 的“实用”SVRG变体。

4.2.非均匀采样

除了改善对 n n n 的依赖性,还有一系列工作专注于通过非均匀采样随机训练样本 i k i_k ik 来改善对Lipschitz常数 L i L_i Li 的依赖性。特别是,这些算法倾向于选择较大的 L i L_i Li 值。这意味着梯度变化更快的样本被更频繁地采样。这通常与使用较大步长结合使用,该步长依赖于 L i L_i Li 值的平均值而不是最大 L i L_i Li 值。在适当的采样概率和步长选择下,这导致形式为
[ O\left((\kappa_{\text{mean}} + n) \log \left(\frac{1}{\epsilon}\right)\right) ]
的改进迭代复杂度,它依赖于 κ mean : = ( 1 n ∑ i L i ) / μ \kappa_{\text{mean}} := \left(\frac{1}{n} \sum_{i} L_i\right) / \mu κmean:=(n1iLi)/μ 而不是依赖于 κ max = ( L max ) / μ \kappa_{\text{max}} = (L_{\text{max}}) / \mu κmax=(Lmax)/μ。在非均匀采样下,基本VR方法SVRG、SDCA和SAGA已经展示了这种改进的速率。

几乎所有现有方法在迭代过程中都使用固定的 {1, … ,n} 上的概率分布。然而,通过在算法执行期间自适应地改变概率,可以进一步改进这种选择。第一种这种类型的VR方法是ASDCA,由Csiba等人开发,它基于使用所谓的对偶残差更新SDCA中的概率。

Schmidt等人[65]提出了一种实证方法,尝试估计局部 L i L_i Li 值(这可能比全局值小得多),并在实验中展示了显著的增益。Vainsencher等人提出了一种相关的方法,它使用局部 L i L_i Li 估计并有理论支持。

4.3.最小批处理

另一种改善对 L i L_i Li 值依赖性的策略是使用最小批处理,类似于经典的minibatch SGD方法,以获得梯度的更好近似。这里我们关注固定批量大小的随机选择。也就是说,设 b ∈ N b \in \mathbb{N} bN,我们从所有具有 b b b 个元素的集合中以均匀概率选择一个集合 B k ⊂ { 1 , . . . , n } B_k \subset \{1, ..., n\} Bk{1,...,n} 使得 ∣ B k ∣ = b |B_k| = b Bk=b。现在,我们可以通过将每个 ∇ f i ( x k ) \nabla f_i(x_k) fi(xk) 替换为由 1 b ∑ i ∈ B k ∇ f i ( x k ) \frac{1}{b} \sum_{i \in B_k} \nabla f_i(x_k) b1iBkfi(xk) 给出的minibatch估计来实现VR方法。

有多种早期的minibatch方法,但最新的方法能够获得形式为
O ( L ( b ) μ + n b log ⁡ ( 1 ϵ ) ) O\left(\frac{L(b)}{\mu} + \frac{n}{b} \log \left(\frac{1}{\epsilon}\right)\right) O(μL(b)+bnlog(ϵ1))
的迭代复杂度,使用步长 γ = O ( 1 L ( b ) ) \gamma = O\left(\frac{1}{L(b)}\right) γ=O(L(b)1),其中
L ( b ) = 1 b ( n − b n − 1 L max + n b − 1 L ) L(b) = \frac{1}{b} \left(\frac{n-b}{n-1} L_{\text{max}} + \frac{n}{b-1} L\right) L(b)=b1(n1nbLmax+b1nL)
是首次在文献中定义的minibatch平滑常数。这种迭代复杂度在全GD的复杂度 L ( n ) = L L(n) = L L(n)=L 和VR方法的复杂度 L ( 1 ) = L max L(1) = L_{\text{max}} L(1)=Lmax 之间进行插值。由于 L ≤ L max ≤ n L L \leq L_{\text{max}} \leq nL LLmaxnL,可能存在 L ≪ L max L \ll L_{\text{max}} LLmax。因此,使用较大的minibatch可以允许大幅度加速,特别是在我们可以并行计算多个梯度的设置中。然而,计算 L L L 通常比计算 L max L_{\text{max}} Lmax 更具挑战性。

4.4. 加速变体

改善对 κ max \kappa_{\text{max}} κmax 依赖性的另一种策略是Nesterov或Polyak加速(也称为动量)。众所周知,Nesterov的加速GD将全梯度方法的迭代复杂度从 O ( κ max log ⁡ ( 1 / ϵ ) ) O(\kappa_{\text{max}} \log(1/\epsilon)) O(κmaxlog(1/ϵ)) 改善到 O ( κ max log ⁡ ( 1 / ϵ ) ) O(\sqrt{\kappa_{\text{max}}} \log(1/\epsilon)) O(κmax log(1/ϵ))。尽管我们可能希望在VR方法中看到相同的改进,将 $\kappa_{\text{max}} $ 依赖性替换为 κ max \sqrt{\kappa_{\text{max}}} κmax ,我们现在知道我们能够实现的最佳复杂度是 O ( ( n κ max + n ) log ⁡ ( 1 / ϵ ) ) O((\sqrt{n}\kappa_{\text{max}} + n) \log(1/\epsilon)) O((n κmax+n)log(1/ϵ)),这是首次由加速的SDCA方法实现的。然而,这种复杂度仍然保证了在病态设置中(即 κ max ≫ n \kappa_{\text{max}} \gg n κmaxn)更好的最坏情况性能。

已经提出了多种VR方法,它们结合了加速步骤以实现这种改进的复杂度。此外,Lin等人的“催化剂”框架可以用来修改任何实现了复杂度 O ( ( κ max + n ) log ⁡ ( 1 / ϵ ) ) O((\kappa_{\text{max}} + n) \log(1/\epsilon)) O((κmax+n)log(1/ϵ)) 的方法,使其成为一个具有复杂度 O ( ( n κ max + n ) log ⁡ ( 1 / ϵ ) ) O((\sqrt{n}\kappa_{\text{max}} + n) \log(1/\epsilon)) O((n κmax+n)log(1/ϵ)) 的加速方法。

4.5.放宽光滑性要求

已经提出了多种方法来放宽函数 f f f 必须是 L L L-光滑的这一假设。其中最早的是随机对偶坐标上升(SDCA)方法,它即便在函数集合 { f i } \{f_i\} {fi} 是非光滑的情况下,仍然可以应用于问题(2)。这是由于对偶问题本身仍然是一个光滑问题。一个典型的例子是支持向量机(SVM)损失函数,其中 f i ( x ) = max ⁡ { 0 , 1 − b i a i ⊤ x } f_i(x) = \max\{0, 1 - b_i a_i^\top x\} fi(x)=max{0,1biaix}。这种情况下,收敛速率变为 O ( 1 / ϵ ) O(1/\epsilon) O(1/ϵ),而不是 O ( log ⁡ ( 1 / ϵ ) ) O(\log(1/\epsilon)) O(log(1/ϵ)),因此它并不比经典SGD方法在最坏情况下具有优势。然而,与经典SGD方法不同,使用SDCA时可以最优地设置步长。实际上,在新一波VR方法出现之前,对偶坐标上升方法一直是解决SVM问题的流行方法之一。例如,广泛使用的libSVM软件包[7]就采用了对偶坐标上升方法。

处理非光滑问题并保持线性收敛速率的另一种方法是使用近端梯度方法。这些方法适用于 f f f 具有形式 f ( x ) = 1 n ∑ i = 1 n f i ( x ) + Ω ( x ) f(x) = \frac{1}{n} \sum_{i=1}^n f_i(x) + \Omega(x) f(x)=n1i=1nfi(x)+Ω(x) 的情况。在这个框架中,假设 f f f L L L-光滑的,并且正则化项 Ω \Omega Ω 在其定义域上是凸的。但是, Ω \Omega Ω 可能是非光滑的,并且可能对 x x x 施加约束。重要的是, Ω \Omega Ω 必须足够“简单”,以便于有效地计算其近端算子,即计算
x k + 1 = arg ⁡ min ⁡ x ∈ R d { 1 2 ∥ x − ( x k − γ ∇ f ( x k ) ) ∥ 2 + γ Ω ( x ) } x_{k+1} = \arg\min_{x \in \mathbb{R}^d} \left\{ \frac{1}{2} \|x - (x_k - \gamma \nabla f(x_k))\|^2 + \gamma \Omega(x) \right\} xk+1=argxRdmin{21x(xkγf(xk))2+γΩ(x)}
应该是成本相对较低的。这个方法被称为近端梯度算法[11],即使 Ω \Omega Ω(和 f f f)不是 L L L-光滑或甚至不一定可微,它也能达到 O ( κ log ⁡ ( 1 / ϵ ) ) O(\kappa \log(1/\epsilon)) O(κlog(1/ϵ)) 的迭代复杂度。一个常见的例子是 L 1 L1 L1 正则化,其中 Ω ( x ) = λ ∥ x ∥ 1 \Omega(x) = \lambda \|x\|_1 Ω(x)=λx1 对于正则化参数 λ > 0 \lambda > 0 λ>0

4.6. 放宽强凸性要求

虽然我们一直关注 f f f 是强凸的情况,但这些假设可以被放宽。例如,如果 f f f 是凸的但不是强凸的,早期的工作表明VR方法可以达到 O ( 1 / k ) O(1/k) O(1/k) 的收敛速率。这与在这些假设下GD方法达到的速率相同,并且比SGD方法在这种情况下的 O ( 1 / k ) O(1/\sqrt{k}) O(1/k ) 速率要快。

最近的工作通过采用更弱的假设,例如PL不等式和KL不等式,来替代强凸性,这些假设包括了标准问题,如未正则化的最小二乘问题,其中仍然可以证明线性收敛。

4.7.非凸问题

自2014年以来,一系列文章逐渐放宽了对函数 f i f_i fi f f f 的凸性假设,并将方差减少方法适应于几种不同的非凸设置,实现了最先进的复杂度结果。以下是一些结果的总结,从最接近强凸设置的情况开始,并逐渐放宽凸性假设。更详细的讨论,见[15]和[87]。

首先放宽的是对 f i f_i fi 函数凸性的假设。这方面的开创性工作是解决PCA问题,其中个别 f i f_i fi 函数是非凸的[18]、[72]。然后,Garber和Hazan[18]以及Shalev-Shwartz[69]展示了如何使用催化剂框架[41]设计算法,对于 L i L_i Li-光滑的非凸 f i f_i fi 函数,只要它们的平均值 f f f μ \mu μ-强凸的,就具有 O ( n + n 3 / 4 L max / μ ) log ⁡ ( 1 / ϵ ) O(n+n^{3/4}\sqrt{L_{\text{max}}}/\sqrt{\mu}) \log(1/\epsilon) O(n+n3/4Lmax /μ )log(1/ϵ) 的迭代复杂度。最近,这个复杂度被证明在这种情况下符合理论下限[87]。Allen-Zhu[2]进一步放宽了 f f f 是强凸的假设,允许 f f f 仅仅是凸的,甚至具有“有界非凸性”。为了解决这个问题,Allen-Zhu[2]提出了SVRG的加速变体,实现了最先进的复杂度结果,最近已被证明是最优的[87]。

最近,对 f f f 的凸性假设已经完全放弃。Fang等人仅假设 f i f_i fi 是光滑的,并且 f f f 有一个下界,基于连续更新提出了一种算法,该算法找到近似的稳定点,使得 E [ ∥ ∇ f ( x ) ∥ 2 ] ≤ ϵ \mathbb{E}[\|\nabla f(x)\|^2] \leq \epsilon E[∥∇f(x)2]ϵ 使用 O ( n + n / ϵ 2 ) O(n + \sqrt{n/\epsilon^2}) O(n+n/ϵ2 ) 次迭代。同时,Zhou等人提出了一种使用多个参考点的更复杂的SVRG变体,并实现了相同的迭代复杂度。Fang等人还提供了一个下限证明,表明前面的复杂度在这些假设下是最优的。

非凸函数的一个有趣来源是目标函数是由函数组合 f ( g ( x ) ) f(g(x)) f(g(x)) 构成的问题,其中 g : R d → R m g : \mathbb{R}^d \rightarrow \mathbb{R}^m g:RdRm 是一个映射。即使 f f f g g g 都是凸的,它们的组合也可能是非凸的。在这种情况下,有几种有趣的应用,其中 g g g 本身是一个映射的平均值(或甚至期望值),或者 f f f 是函数的平均值,或者两者都是。在这种情况下,可以利用有限和结构来开发VR方法,其中大部分基于SVRG的变体。在 g g g 是有限和的情况下,使用连续更新已经实现了最先进的复杂度结果。

4.8.二阶方法变体

牛顿法启发了一类二阶变体的变分减少(VR)方法,其形式如下:
x k + 1 = x k − γ k H k g k x_{k+1} = x_k - \gamma_k H_k g_k xk+1=xkγkHkgk
这里, H k H_k Hk 是逆海森矩阵 ∇ 2 f ( x k ) \nabla^2 f(x_k) 2f(xk) 的估计,其维度为 d × d d \times d d×d。设计这类方法的难点在于如何高效更新 H k H_k Hk,以确保获得足够精确的估计,同时不会产生过高的计算成本。找到这种平衡非常具有挑战性:如果 H k H_k Hk 的估计不准确,可能对算法的收敛性造成负面影响。另一方面,如果更新 H k H_k Hk 的过程过于复杂,当数据点数量庞大时,整个方法可能变得不切实际。

尽管存在挑战,二阶方法的潜在优势是显著的。它们可能对坐标变换和病态问题不敏感或保持不变性。这与一阶方法形成鲜明对比,后者通常需要特征工程、预处理和数据预处理才能实现收敛。

大多数二阶VR方法都基于BFGS拟牛顿更新。首个利用子采样的随机BFGS方法是在线L-BFGS方法,它使用子采样梯度来近似海森矩阵-向量乘积。正则化BFGS方法也采用了随机梯度,并通过在 H k H_k Hk 矩阵中加入正则化项来调整BFGS更新。首个在BFGS更新中使用子采样海森矩阵-向量乘积的方法是SQN方法,它不同于利用随机梯度差的方法。Moritz等人提出将SQN与SVRG结合,得到的算法在数值测试中表现优异,是首个被证明具有线性收敛性的二阶VR方法,尽管其复杂度比标准VR方法的 O ( κ max + n ) log ⁡ ( 1 / ϵ ) O(\kappa_{\text{max}} + n) \log(1/\epsilon) O(κmax+n)log(1/ϵ) 要高。Moritz等人的方法后来被扩展为块拟牛顿变体,并由Gower等人以改进的复杂度进行了分析。此外,还有专门针对非凸设置的二阶拟牛顿VR方法变体。

据我们所知,目前还没有拟牛顿变体能够实现与 n n n 无关的更新成本,并且证明其全局复杂度优于VR方法的 O ( ( κ max + n ) log ⁡ ( 1 / ϵ ) ) O((\kappa_{\text{max}} + n) \log(1/\epsilon)) O((κmax+n)log(1/ϵ))

然而,确实存在一些随机牛顿类型的方法,例如随机对偶牛顿上升(SDNA),它在对偶空间中执行类似minibatch的牛顿步骤,其成本与 n n n 无关,并且收敛速率优于SDCA。Kovalev等人最近提出了一种minibatch牛顿方法,具有与条件数无关的局部线性收敛速率 O ( ( n / b ) log ⁡ ( 1 / ϵ ) ) O((n/b) \log(1/\epsilon)) O((n/b)log(1/ϵ)),其中 b b b 是minibatch的大小。此外,还有一系列正在开发中的随机二阶方法,它们通过结合方差减少技术与立方正则化牛顿方法,这些方法用VR估计替换了海森矩阵和梯度,并在每次迭代中最小化带有额外立方正则化项的二阶泰勒展开的近似。这些方法在寻找光滑非凸问题的二阶稳定点方面,实现了最先进的样本复杂度,并且在访问个别函数 f i f_i fi 的梯度和海森矩阵方面表现出色。这是一个目前非常活跃的研究领域。

5.结论与局限性

在第四部分中,我们探讨了基本变分减少(VR)方法的多种扩展形式。虽然这些扩展被分别介绍,但它们中的许多实际上是可以结合使用的。例如,我们可以设计一个算法,它结合了minibatch处理、加速技术,并且采用近端梯度框架来解决非光滑问题。目前,文献中已经涵盖了这些可能的组合方式。

值得注意的是,传统的随机梯度下降(SGD)方法可以应用于一般性的问题,即可能依赖于随机变量 z z z 的函数 (f(x) = \mathbb{E}_z f(x, z)) 的最小化问题。在本研究中,我们主要关注了训练误差的情况,其中 z z z 只能取 n n n 个特定的值。然而,在机器学习领域,我们通常更关心的是测试误差,这里的 z z z 可能来源于一个连续的分布。如果我们拥有无限的样本资源,我们可以在SGD框架内直接使用它们来优化测试损失函数。或者,我们可以将我们的 n n n 个训练样本视为测试分布的样本子集,这样,SGD算法在训练样本上的一次遍历可以看作是在测试误差上取得直接进展。尽管使用VR方法无法改进测试误差的收敛速率,但一些研究显示,与SGD相比,VR方法可以在测试误差的收敛速率中的常数因子上做出改进。

我们主要集中讨论了将VR方法应用于线性模型的情况,并提到了其他一些重要的机器学习问题,比如图模型和主成分分析。然而,VR方法在深度神经网络训练这一重要应用领域的影响相对较小。事实上,最新的研究指出,VR可能对于加速深度神经网络的训练并不十分有效。尽管如此,VR方法仍在其他多种机器学习应用中找到了其用武之地,包括强化学习的策略评估、期望最大化算法、使用蒙特卡洛方法的模拟、鞍点问题,以及生成对抗网络。

附录A

引理

这里,我们提供并证明一些辅助性引理。

引理 1:
假设对于 i = 1 , … , n i = 1, \ldots, n i=1,,n f i ( x ) f_i(x) fi(x) 是凸函数且具有 L max L_{\text{max}} Lmax-光滑性。假设 i i i 在集合 { 1 , … , n } \{1, \ldots, n\} {1,,n} 上均匀采样。那么对于任意 x ∈ R d x \in \mathbb{R}^d xRd,我们有:
E k [ ∥ ∇ f i ( x ) − ∇ f i ( x ref ) ∥ 2 ] ≤ 2 L max ( f ( x ) − f ( x ref ) ) . \mathbb{E}_k[\|\nabla f_i(x) - \nabla f_i(x_{\text{ref}})\|^2] \leq 2L_{\text{max}}(f(x) - f(x_{\text{ref}})). Ek[∥∇fi(x)fi(xref)2]2Lmax(f(x)f(xref)).

证明:
由于 f i f_i fi 是凸函数,我们有:
f i ( z ) ≥ f i ( x ref ) + ⟨ ∇ f i ( x ref ) , z − x ref ⟩ , ∀ z ∈ R d . f_i(z) \geq f_i(x_{\text{ref}}) + \langle \nabla f_i(x_{\text{ref}}), z - x_{\text{ref}} \rangle, \quad \forall z \in \mathbb{R}^d. fi(z)fi(xref)+fi(xref),zxref,zRd.
由于 f i f_i fi L max L_{\text{max}} Lmax-光滑的,我们有:
f i ( z ) ≤ f i ( x ) + ⟨ ∇ f i ( x ) , z − x ⟩ + L max 2 ∥ z − x ∥ 2 , ∀ z , x ∈ R d . f_i(z) \leq f_i(x) + \langle \nabla f_i(x), z - x \rangle + \frac{L_{\text{max}}}{2}\|z - x\|^2, \quad \forall z, x \in \mathbb{R}^d. fi(z)fi(x)+fi(x),zx+2Lmaxzx2,z,xRd.
因此,对于任意 x , z ∈ R d x, z \in \mathbb{R}^d x,zRd,我们有:
f i ( x ref ) − f i ( x ) ≤ ⟨ ∇ f i ( x ref ) , x ref − z ⟩ + ⟨ ∇ f i ( x ) , z − x ⟩ + L max 2 ∥ z − x ∥ 2 . f_i(x_{\text{ref}}) - f_i(x) \leq \langle \nabla f_i(x_{\text{ref}}), x_{\text{ref}} - z \rangle + \langle \nabla f_i(x), z - x \rangle + \frac{L_{\text{max}}}{2}\|z - x\|^2. fi(xref)fi(x)fi(xref),xrefz+fi(x),zx+2Lmaxzx2.
为了获得右侧的最紧上界,我们可以在 z z z 中最小化右侧,得到:
z = x − 1 L max ( ∇ f i ( x ) − ∇ f i ( x ref ) ) . z = x - \frac{1}{L_{\text{max}}}(\nabla f_i(x) - \nabla f_i(x_{\text{ref}})). z=xLmax1(fi(x)fi(xref)).
代入上述方程,我们得到:
f i ( x ref ) − f i ( x ) = ⟨ ∇ f i ( x ref ) , x ref − x ⟩ − 1 2 L max ∥ ∇ f i ( x ) − ∇ f i ( x ref ) ∥ 2 . f_i(x_{\text{ref}}) - f_i(x) = \langle \nabla f_i(x_{\text{ref}}), x_{\text{ref}} - x \rangle - \frac{1}{2L_{\text{max}}}\|\nabla f_i(x) - \nabla f_i(x_{\text{ref}})\|^2. fi(xref)fi(x)=fi(xref),xrefx2Lmax1∥∇fi(x)fi(xref)2.
对上述表达式取期望,并使用 E i [ f i ( x ) ] = f ( x ) \mathbb{E}_i[f_i(x)] = f(x) Ei[fi(x)]=f(x) E [ ∇ f i ( x ref ) ] = 0 \mathbb{E}[\nabla f_i(x_{\text{ref}})] = 0 E[fi(xref)]=0,即可得到结果。

引理 2:
假设 X ∈ R d X \in \mathbb{R}^d XRd 是一个随机向量,具有有限的方差。那么:
E [ ∥ X − E [ X ] ∥ 2 ] ≤ E [ ∥ X ∥ 2 ] . \mathbb{E}[\|X - \mathbb{E}[X]\|^2] \leq \mathbb{E}[\|X\|^2]. E[XE[X]2]E[X2].

证明:
E [ ∥ X − E [ X ] ∥ 2 ] = E [ ∥ X ∥ 2 ] − 2 E [ ∥ X ∥ ] 2 + E [ ∥ X ∥ ] 2 = E [ ∥ X ∥ 2 ] − E [ ∥ X ∥ ] 2 ≤ E [ ∥ X ∥ 2 ] . \mathbb{E}[\|X - \mathbb{E}[X]\|^2] = \mathbb{E}[\|X\|^2] - 2\mathbb{E}[\|X\|]^2 + \mathbb{E}[\|X\|]^2 = \mathbb{E}[\|X\|^2] - \mathbb{E}[\|X\|]^2 \leq \mathbb{E}[\|X\|^2]. E[XE[X]2]=E[X2]2E[X]2+E[X]2=E[X2]E[X]2E[X2].

附录B

SGD²收敛性证明的示例

对于所有VR方法,证明收敛性的第一步是相同的。首先,我们展开:
∥ x k + 1 − x ref ∥ 2 = ∥ x k − x ref − γ g k ∥ 2 = ∥ x k − x ref ∥ 2 − 2 γ ⟨ x k − x ref , g k ⟩ + γ 2 ∥ g k ∥ 2 . \|x_{k+1} - x_{\text{ref}}\|^2 = \|x_k - x_{\text{ref}} - \gamma g_k\|^2 = \|x_k - x_{\text{ref}}\|^2 - 2\gamma \langle x_k - x_{\text{ref}}, g_k \rangle + \gamma^2 \|g_k\|^2. xk+1xref2=xkxrefγgk2=xkxref22γxkxref,gk+γ2gk2.
现在,对 x k x_k xk 取条件期望,并使用(6),我们得到:
E k [ ∥ x k + 1 − x ref ∥ 2 ] = ∥ x k − x ref ∥ 2 + γ 2 E k [ ∥ g k ∥ 2 ] − 2 γ ⟨ x k − x ref , ∇ f ( x k ) ⟩ . \mathbb{E}_k[\|x_{k+1} - x_{\text{ref}}\|^2] = \|x_k - x_{\text{ref}}\|^2 + \gamma^2 \mathbb{E}_k[\|g_k\|^2] - 2\gamma \langle x_k - x_{\text{ref}}, \nabla f(x_k) \rangle. Ek[xk+1xref2]=xkxref2+γ2Ek[gk2]2γxkxref,f(xk)⟩.
使用凸性或强凸性,我们可以消去 ⟨ x k − x ref , ∇ f ( x k ) ⟩ \langle x_k - x_{\text{ref}}, \nabla f(x_k) \rangle xkxref,f(xk)⟩ 这一项。特别是,由于 (f(x)) 是 (\mu)-强凸的,我们有:
E k [ ∥ x k + 1 − x ref ∥ 2 ] ≤ ( 1 − μ γ ) ∥ x k − x ref ∥ 2 + γ 2 E k [ ∥ g k ∥ 2 ] − 2 γ ( f ( x k ) − f ( x ref ) ) . \mathbb{E}_k[\|x_{k+1} - x_{\text{ref}}\|^2] \leq (1 - \mu\gamma)\|x_k - x_{\text{ref}}\|^2 + \gamma^2 \mathbb{E}_k[\|g_k\|^2] - 2\gamma(f(x_k) - f(x_{\text{ref}})). Ek[xk+1xref2](1μγ)xkxref2+γ2Ek[gk2]2γ(f(xk)f(xref)).
为了完成证明,我们需要对 g k g_k gk 的二阶矩 E k [ ∥ g k ∥ 2 ] \mathbb{E}_k[\|g_k\|^2] Ek[gk2] 进行限制。对于标准的SGD,通常简单地假设这个方差项被一个未知常数 B > 0 B > 0 B>0 所限制。然而,这个假设在实践中很少成立,即使成立,得到的收敛速度也依赖于这个未知常数 B B B。相比之下,对于VR方法,我们可以显式控制 g k g_k gk 的二阶矩,因为我们可以控制 g k g_k gk 的方差:
E k [ ∥ g k − ∇ f ( x k ) ∥ 2 ] = E [ ∥ g k ∥ 2 ] − ∥ ∇ f ( x k ) ∥ 2 . \mathbb{E}_k[\|g_k - \nabla f(x_k)\|^2] = \mathbb{E}[\|g_k\|^2] - \|\nabla f(x_k)\|^2. Ek[gkf(xk)2]=E[gk2]∥∇f(xk)2.
为了说明,我们现在证明SGD²的收敛性。

定理 1 :
考虑SGD²(13)的迭代。如果假设1和假设2成立,并且 γ ≤ 1 L max \gamma \leq \frac{1}{L_{\text{max}}} γLmax1,则迭代线性收敛:
E [ ∥ x k + 1 − x ref ∥ 2 ] ≤ ( 1 − γ μ ) E [ ∥ x k − x ref ∥ 2 ] . \mathbb{E}[\|x_{k+1} - x_{\text{ref}}\|^2] \leq (1 - \gamma\mu)\mathbb{E}[\|x_k - x_{\text{ref}}\|^2]. E[xk+1xref2](1γμ)E[xkxref2].
因此,SGD²的迭代复杂度由下式给出:
k ≥ L max μ log ⁡ ( 1 ϵ ) ⇒ E [ ∥ x k − x ref ∥ 2 ] < ϵ . k \geq \frac{L_{\text{max}}}{\mu} \log \left(\frac{1}{\epsilon}\right) \Rightarrow \mathbb{E}[\|x_k - x_{\text{ref}}\|^2] < \epsilon. kμLmaxlog(ϵ1)E[xkxref2]<ϵ.

证明:
使用引理1,我们有:
E k [ ∥ g k ∥ 2 ] = E k [ ∥ ∇ f i ( x k ) − ∇ f i ( x ref ) ∥ 2 ] ≤ 2 L max ( f ( x k ) − f ( x ref ) ) . \mathbb{E}_k[\|g_k\|^2] = \mathbb{E}_k[\|\nabla f_i(x_k) - \nabla f_i(x_{\text{ref}})\|^2] \leq 2L_{\text{max}}(f(x_k) - f(x_{\text{ref}})). Ek[gk2]=Ek[∥∇fi(xk)fi(xref)2]2Lmax(f(xk)f(xref)).
使用上述不等式在(40)中,我们有:
E k [ ∥ x k + 1 − x ref ∥ 2 ] ≤ ( 1 − μ γ ) ∥ x k − x ref ∥ 2 + 2 γ ( γ L max − 1 ) ( f ( x k ) − f ( x ref ) ) . \mathbb{E}_k[\|x_{k+1} - x_{\text{ref}}\|^2] \leq (1 - \mu\gamma)\|x_k - x_{\text{ref}}\|^2 + 2\gamma(\gamma L_{\text{max}} - 1)(f(x_k) - f(x_{\text{ref}})). Ek[xk+1xref2](1μγ)xkxref2+2γ(γLmax1)(f(xk)f(xref)).
现在,通过选择 γ ≤ 1 L max \gamma \leq \frac{1}{L_{\text{max}}} γLmax1,我们有 γ L max − 1 < 0 \gamma L_{\text{max}} - 1 < 0 γLmax1<0,并且因此, 2 γ ( γ L max − 1 ) ( f ( x k ) − f ( x ref ) ) 2\gamma(\gamma L_{\text{max}} - 1)(f(x_k) - f(x_{\text{ref}})) 2γ(γLmax1)(f(xk)f(xref)) 是负的,因为 f ( x k ) − f ( x ref ) ≥ 0 f(x_k) - f(x_{\text{ref}}) \geq 0 f(xk)f(xref)0
。因此,通过在(45)中取期望,我们得到:
E [ ∥ x k + 1 − x ref ∥ 2 ] ≤ ( 1 − μ γ ) E [ ∥ x k − x ref ∥ 2 ] . \mathbb{E}[\|x_{k+1} - x_{\text{ref}}\|^2] \leq (1 - \mu\gamma)\mathbb{E}[\|x_k - x_{\text{ref}}\|^2]. E[xk+1xref2](1μγ)E[xkxref2].
这个证明还表明,SGD方法是一个VR方法。实际上,由于:
E [ ∥ g k − ∇ f ( x k ) ∥ 2 ] = E [ ∥ ∇ f i ( x k ) − ∇ f i ( x ref ) − ∇ f ( x k ) ∥ 2 ] ≤ E [ ∥ ∇ f i ( x k ) − ∇ f i ( x ref ) ∥ 2 ] ≤ 2 L max ( f ( x k ) − f ( x ref ) ) , \mathbb{E}[\|g_k - \nabla f(x_k)\|^2] =\\ \mathbb{E}[\|\nabla f_i(x_k) - \nabla f_i(x_{\text{ref}}) - \nabla f(x_k)\|^2]\\ \leq \mathbb{E}[\|\nabla f_i(x_k) - \nabla f_i(x_{\text{ref}})\|^2] \leq 2L_{\text{max}}(f(x_k) - f(x_{\text{ref}})), E[gkf(xk)2]=E[∥∇fi(xk)fi(xref)f(xk)2]E[∥∇fi(xk)fi(xref)2]2Lmax(f(xk)f(xref)),
其中,在第一个不等式中,我们使用了引理2,设 X = ∇ f i ( x k ) − ∇ f i ( x ref ) X = \nabla f_i(x_k) - \nabla f_i(x_{\text{ref}}) X=fi(xk)fi(xref)

参考文献

  1. Z. Allen-Zhu, “Katyusha: The first direct acceleration of stochastic gradient methods,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 8194–8244, Jan. 2017.

  2. Z. Allen-Zhu, “Natasha: Faster non-convex stochastic optimization via strongly non-convex parameter,” in Proc. Int. Conf. Mach. Learn., vol. 70, Aug. 2017, pp. 89–97.

  3. L. Armijo, “Minimization of functions having Lipschitz continuous first partial derivatives,” Pacific J. Math., vol. 16, no. 1, pp. 1–3, Jan. 1966.

  4. D. Blatt, A. O. Hero, and H. Gauchman, “A convergent incremental gradient method with a constant step size,” SIAM J. Optim., vol. 18, no. 1, pp. 29–51, Jan. 2007.

  5. C. G. Broyden, “Quasi-Newton methods and their application to function minimisation,” Math. Comput., vol. 21, no. 99, pp. 368–381, 1969.

  6. R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic quasi-Newton method for large-scale optimization,” SIAM J. Optim., vol. 26, no. 2, pp. 1008–1031, Jan. 2016.

  7. C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, Apr. 2011.

  8. T. Chavdarova, G. Gidel, F. Fleuret, and S. Lacoste-Julien, “Reducing noise in GAN training with variance reduced extragradient,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 393–403.

  9. J. Chen, J. Zhu, Y. W. Teh, and T. Zhang, “Stochastic expectation maximization with variance reduction,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 7967–7977.

  10. M. Collins, A. Globerson, T. Koo, X. Carreras, and P. L. Bartlett, “Exponentiated gradient algorithms for conditional random fields and max-margin Markov networks,” J. Mach. Learn. Res., vol. 9, pp. 1775–1822, Jul. 2008.

  11. P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods in signal processing,” in Fixed-Point Algorithms for Inverse Problems in Science and Engineering. New York, NY, USA: Springer, 2011, pp. 185–212.

  12. D. Csiba, Z. Qu, and P. Richtárik, “Stochastic dual coordinate ascent with adaptive probabilities,” in Proc. 32nd Int. Conf. Mach. Learn., 2015, pp. 674–683.

  13. A. Defazio, F. Bach, and S. Lacoste-Julien, “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 1646–1654.

  14. A. Defazio and L. Bottou, “On the ineffectiveness of variance reduced optimization for deep learning,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 1755–1765.

  15. C. Fang, C. J. Li, Z. Lin, and T. Zhang, “SPIDER: Near-optimal non-convex optimization via stochastic path-integrated differential estimator,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 689–699.

  16. R. Fletcher, “A new approach to variable metric algorithms,” Comput. J., vol. 13, no. 3, pp. 317–323, 1970.

  17. R. Frostig, R. Ge, S. M. Kakade, and A. Sidford, “Competing with the empirical risk minimizer in a single pass,” in Proc. Conf. Learn. Theory, 2015, pp. 728–763.

  18. D. Garber and E. Hazan, “Fast and simple PCA via convex optimization,” arXiv:1509.05647, 2015. [Online]. Available: http://arxiv.org/abs/1509.05647

  19. N. Gazagnadou, R. M. Gower, and J. Salmon, “Optimal mini-batch and step sizes for SAGA,” in Proc. 36th Int. Conf. Mach. Learn., vol. 97, 2019, pp. 2142–2150.

  20. D. Goldfarb, “A family of variable-metric methods derived by variational means,” Math. Comput., vol. 24, no. 109, pp. 23–26, 1970.

  21. P. Gong and J. Ye, “Linear convergence of variance-reduced stochastic gradient without strong convexity,” arXiv:1406.1102, 2014. [Online]. Available: http://arxiv.org/abs/1406.1102

  22. E. Gorbunov, F. Hanzely, and P. Richtárik, “A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent,” in Proc. Mach. Learn. Res., vol. 108, S. Chiappa and R. Calandra, Eds. PMLR, Aug. 2020, pp. 680–690.

  23. R. M. Gower, D. Goldfarb, and P. Richtárik, “Stochastic block BFGS: Squeezing more curvature out of data,” in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, 2016, pp. 1869–1878.

  24. R. M. Gower, N. Loizou, X. Qian, A. Sailanbayev, E. Shulgin, and P. Richtárik, “SGD: General analysis and improved rates,” in Proc. 36th Int. Conf. Mach. Learn., vol. 97, 2019, pp. 5200–5209.

  25. R. M. Gower, P. Richtárik, and F. Bach, “Stochastic quasi-gradient methods: Variance reduction via Jacobian sketching,” Math. Program., May 2020. [Online]. Available: https://link.springer.com/article/10.1007%2Fs10107-020-01506-0

  26. R. Harikandeh, M. O. Ahmed, A. Virani, M. Schmidt, J. Koneˇcný, and S. Sallinen, “Stop wasting my gradients: Practical SVRG,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2251–2259.

  27. T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams, “Variance reduced stochastic gradient descent with neighbors,” in Proc. Neural Inf. Process. Syst., 2015, pp. 2305–2313.

  28. R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 315–323.

  29. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-Lojasiewicz condition,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discovery Databases. Springer, 2016, pp. 795–811.

  30. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. 3rd Int. Conf. Learn. Represent. (ICLR), 2015, pp. 1–13.

  31. J. Koneˇcný, J. Liu, P. Richtárik, and M. Takáˇc, “Mini-batch semi-stochastic gradient descent in the proximal setting,” IEEE J. Sel. Topics Signal Process., vol. 10, no. 2, pp. 242–255, Mar. 2016.

  32. J. Koneˇcný and P. Richtárik, “Semi-stochastic gradient descent methods,” CoRR, vols. abs/1312.1666, pp. 1–9, May 2013.

  33. D. Kovalev, S. Horváth, and P. Richtárik, “Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop,” in Proc. 31st Int. Conf. Algorithmic Learn. Theory, 2020, pp. 451–467.

  34. D. Kovalev, K. Mishchenko, and P. Richtárik, “Stochastic Newton and cubic Newton methods with simple local linear-quadratic rates,” in Proc. NeurIPS Beyond 1st Order Methods Workshop, 2019, pp. 1–16.

  35. A. Kulunchakov and J. Mairal, “Estimate sequences for variance-reduced stochastic composite optimization,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 3541–3550.

  36. G. Lan and Y. Zhou, “An optimal randomized incremental gradient method,” Math. Program., vol. 171, nos. 1–2, pp. 167–215, Sep. 2018.

  37. R. N. Roux, M. Schmidt, and F. Bach, “A stochastic gradient method with an exponential convergence rate for finite training sets,” in Proc. Adv. Neural Inf. Process. Syst., 2012, pp. 2663–2671.

  38. R. Leblond, F. Pedregosa, and S. Lacoste-Julien, “ASAGA: Asynchronous parallel SAGA,” in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 46–54.

  39. L. Lei and M. Jordan, “Less than a single pass: Stochastically controlled stochastic gradient,” in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 148–156.

  40. X. Lian, M. Wang, and J. Liu, “Finite-sum composition optimization via variance reduced gradient descent,” in Proc. Int. Conf. Artif. Intell. Statist., 2017, pp. 1159–1167.

  41. H. Lin, J. Mairal, and Z. Harchaoui, “Catalyst acceleration for first-order convex optimization: From theory to practice,” J. Mach. Learn. Res., vol. 18, no. 212, pp. 1–54, 2018.

  42. S. Lohr, Sampling: Design and Analysis. Duxbury Press, 1999.

  43. M. Mahdavi and R. Jin, “MixedGrad: An O(1/T) convergence rate algorithm for stochastic smooth optimization,” arXiv:1307.7192, 2013. [Online]. Available: http://arxiv.org/abs/1307.7192

  44. J. Mairal, “Incremental majorization-minimization optimization with application to large-scale machine learning,” SIAM J. Optim., vol. 25, no. 2, pp. 829–855, Jan. 2015.

  45. Y. Malitsky and K. Mishchenko, “Adaptive gradient descent without descent,” arXiv:1910.09529, 2019. [Online]. Available: http://arxiv.org/abs/1910.09529

  46. A. Mokhtari and A. Ribeiro, “RES: Regularized stochastic BFGS algorithm,” IEEE Trans. Signal Process., vol. 62, no. 23, pp. 1109–1112, Dec. 2014.

  47. A. Mokhtari and A. Ribeiro, “Global convergence of online limited memory BFGS,” J. Mach. Learn. Res., vol. 16, no. 1, pp. 3151–3181, 2015.

  48. P. Moritz, R. Nishihara, and M. I. Jordan, “A linearly-convergent stochastic L-BFGS algorithm,” in Proc. Int. Conf. Artif. Intell. Statist., 2016, pp. 249–258.

  49. D. Needell, N. Srebro, and R. Ward, “Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm,” Math. Program., vol. 155, nos. 1–2, pp. 549–573, Jan. 2016.

  50. Y. Nesterov, “A method for solving a convex programming problem with convergence rate O( 1 / k 2 1/k^2 1/k2),” Sov. Math. Doklady, vol. 27, no. 2, pp. 372–376, 1983.

  51. Y. Nesterov, “Efficiency of coordinate descent methods on huge-scale optimization problems,” SIAM J. Optim., vol. 22, no. 2, pp. 341–362, Jan. 2012.

  52. Y. Nesterov, Introductory Lectures Convex Optimization: A Basic Course, 2nd ed. New York, NY, USA: Springer, 2014.

  53. Y. Nesterov and T. B. Polyak, “Cubic regularization of Newton method and its global performance,” Math. Program., vol. 108, pp. 177–205, Apr. 2006.

  54. L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč, “SARAH: A novel method for machine learning problems using stochastic recursive gradient,” in Proc. 34th Int. Conf. Mach. Learn., vol. 70, 2017, pp. 2613–2621.

  55. B. Palaniappan and F. Bach, “Stochastic variance reduction methods for saddle-point problems,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 1416–1424.

  56. M. Papini, D. Binaghi, G. Canonaco, M. Pirotta, and M. Restelli, “Stochastic variance-reduced policy gradient,” in Proc. 35th Int. Conf. Mach. Learn., vol. 80, 2018, pp. 4026–4035.

  57. B. Polyak, “Gradient methods for the minimisation of functionals,” in Proc. USSR Comput. Math. Math. Phys., vol. 3, 1963, pp. 864–878.

  58. X. Qian, Z. Qu, and P. Richtárik, “SAGA with arbitrary sampling,” in Proc. 36th Int. Conf. Mach. Learn., 2019, pp. 5190–5199.

  59. Z. Qu, P. Richtárik, M. Takáč, and O. Fercoq, “SDNA: Stochastic dual Newton ascent for empirical risk minimization,” in Proc. 33rd Int. Conf. Mach. Learn., 2016, pp. 1823–1832.

  60. Z. Qu, P. Richtárik, and T. Zhang, “Quartz: Randomized dual coordinate ascent with arbitrary sampling,” in Proc. 28th Int. Conf. Neural Inf. Process. Syst., 2015, pp. 865–873.

  61. S. J. Reddi, A. Hefny, S. Sra, B. Póczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, 2016, pp. 314–323.

  62. S. J. Reddi, S. Sra, B. Poczos, and A. Smola, “Fast incremental method for smooth nonconvex optimization,” in Proc. IEEE 55th Conf. Decis. Control (CDC), Dec. 2016, pp. 1971–1977.

  63. P. Richtárik and M. Takáč, “Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function,” Math. Program., vol. 144, no. 1, pp. 1–38, Dec. 2012.

  64. H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Statist., vol. 22, no. 3, pp. 400–407, Sep. 1951.

  65. M. Schmidt, N. Le Roux, and F. Bach, “Minimizing finite sums with the stochastic average gradient,” Math. Program., vol. 162, nos. 1–2, pp. 83–112, 2017.

  66. M. W. Schmidt, R. Babanezhad, M. O. Ahmed, A. Defazio, A. Clifton, and A. Sarkar, “Non-uniform stochastic average gradient method for training conditional random fields,” in Proc. 18th Int. Conf. Artif. Intell. Statist. (AISTATS), 2015, pp. 819–828.

  67. N. N. Schraudolph and G. Simon, “A stochastic quasi-Newton method for online convex optimization,” in Proc. 11th Int. Conf. Artif. Intell. Statist., 2007, pp. 436–443.

  68. O. Sebbouh, N. Gazagnadou, S. Jelassi, F. Bach, and R. M. Gower, “Towards closing the gap between the theory and practice of SVRG,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 646–656.

  69. S. Shalev-Shwartz, “SDCA without duality, regularization, and individual convexity,” in Proc. 33rd Int. Conf. Mach. Learn., vol. 48, 2016, pp. 747–754.

  70. S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent methods for regularized loss,” J. Mach. Learn. Res., vol. 14, no. 1, pp. 567–599, 2013.

  71. S. Shalev-Shwartz and T. Zhang, “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization,” in Proc. Int. Conf. Mach. Learn., 2014, pp. 64–72.

  72. O. Shamir, “A stochastic PCA and SVD algorithm with an exponential convergence rate,” in Proc. 32nd Int. Conf. Mach. Learn. (ICML), vol. 37, 2015, pp. 144–152.

  73. D. F. Shanno, “Conditioning of quasi-Newton methods for function minimization,” Math. Comput., vol. 24, no. 111, pp. 647–656, 1971.

  74. T. Strohmer and R. Vershynin, “A randomized Kaczmarz algorithm with exponential convergence,” J. Fourier Anal. Appl., vol. 15, no. 2, p. 262, 2009.

  75. M. Takáč, A. Bijral, P. Richtárik, and N. Srebro, “Mini-batch primal and dual methods for SVMs,” in Proc. 30th Int. Conf. Mach. Learn., 2013, pp. 537–552.

  76. D. Vainsencher, H. Liu, and T. Zhang, “Local smoothness in variance reduced optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2179–2187.

  77. H.-T. Wai, M. Hong, Z. Yang, Z. Wang, and K. Tang, “Variance reduced policy evaluation with smooth function approximation,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 5776–5787.

  78. X. Wang, S. Ma, D. Goldfarb, and W. Liu, “Stochastic quasi-Newton methods for nonconvex stochastic optimization,” SIAM J. Optim., vol. 27, no. 2, pp. 927–956, 2017.

  79. Z. Wang, Y. Zhou, Y. Liang, and G. Lan, “Stochastic variance-reduced cubic regularization for nonconvex optimization,” arXiv:1802.07372, 2018. [Online]. Available: http://arxiv.org/abs/1802.07372

  80. B. E. Woodworth and N. Srebro, “Tight complexity bounds for optimizing composite objectives,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 3639–3647.

  81. L. Xiao and T. Zhang, “A proximal stochastic gradient method with progressive variance reduction,” SIAM J. Optim., vol. 24, no. 4, pp. 2057–2075, Jan. 2014.

  82. P. Xu, F. Gao, and Q. Gu, “An improved convergence analysis of stochastic variance-reduced policy gradient,” in Proc. 35th Conf. Uncertainty Artif. Intell., 2019, pp. 541–551.

  83. J. Zhang and L. Xiao, “Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization,” Microsoft, Albuquerque, NM, USA, Tech. Rep. MSR-TR-2020-11, 2020.

  84. L. Zhang, M. Mahdavi, and R. Jin, “Linear convergence with condition number independent access of full gradients,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 980–988.

  85. Y. Zhang and L. Xiao, “Stochastic primal-dual coordinate method for regularized empirical risk minimization,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 2939–2980, Jan. 2017.

  86. L. W. Zhong and J. T. Kwok, “Fast stochastic alternating direction method of multipliers,” in Proc. 31st Int. Conf. Mach. Learn., vol. 32, 2014, pp. 46–54.

  87. D. Zhou and Q. Gu, “Lower bounds for smooth nonconvex finite-sum optimization,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 7574–7583.

  88. D. Zhou, P. Xu, and Q. Gu, “Stochastic nested variance reduced gradient descent for nonconvex optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 3921–3932.

  89. D. Zhou, P. Xu, and Q. Gu, “Stochastic variance-reduced cubic regularization methods,” J. Mach. Learn. Res., vol. 20, no. 134, pp. 1–47, 2019.

  90. D. Zou, P. Xu, and Q. Gu, “Stochastic gradient Hamiltonian Monte Carlo methods with recursive variance reduction,” in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 3835–3846.

;