😉Lecture 8-9
● Fancier optimizers
○ Problems with SGD(随机gradient decent),
有什么问题//loss function has high condition number
会stop training
为什么可以解决eg momentum
momentum
Nesterov:
Nesterov Momentum:先按照velocity走到一个位置;计算那里的gradient
好处:如果开始的velocity不对,Nesterov走到velocity位置后的gradient计算可以减少这个错误危害
//Momentum:容易overshoot
AdaGrad:
不同parameter 用不用的learning rate
Steep更steep
(因为square了gradient
//因为一直再加,最后weight update will be 0
RMSProp : fix AdaGrad;;或者叫leaky AdaGrad
!加上了一个decay rate(0.9 or 0.99)
eg。0.9*cummulated gradient score+0.1current gradient
防止了分母无限变大后,learning rate slow down
//并且不像sgd+momentum那么overshoot
小总结:
Momentum:加了velocity term
RMSProp: adapative learning rata
Adam
momentum+rmsProp
Beta1/2:是decay rate
//
A:in favor of first step
//Beta1**t,converge to 0;
所以分母converge to 1-0=1
解决方案:
Normalize first and second moment one more time to get the unbiased estimate
//
AdaGrad, RMSProp, Adam : more smooth loss
只能take small step,不然可能增加loss
Fix:
Use hessian(二阶导
好处: adjust step size
//坏处:很多memory,slow
Maybe:
● Learning rate schedules
Why may want to reduce the learning rate
A:不想overshoot
//可用于SGD+momentum(何时;设置成什么
Nlp也可以用linear decay schedule、
///Adam一般用不变的learning rate eg0.0001
因为它会figure out how much to update
Last updated