😉Lecture 8-9

● Fancier optimizers

○ Problems with SGD(随机gradient decent),

  • 有什么问题//loss function has high condition number

会stop training

  • 为什么可以解决eg momentum

    • momentum

  • Nesterov:

Nesterov Momentum:先按照velocity走到一个位置;计算那里的gradient

好处:如果开始的velocity不对,Nesterov走到velocity位置后的gradient计算可以减少这个错误危害

//Momentum:容易overshoot

  • AdaGrad:

不同parameter 用不用的learning rate

Steep更steep

(因为square了gradient

//因为一直再加,最后weight update will be 0

  • RMSProp : fix AdaGrad;;或者叫leaky AdaGrad

!加上了一个decay rate(0.9 or 0.99)

eg。0.9*cummulated gradient score+0.1current gradient

防止了分母无限变大后,learning rate slow down

//并且不像sgd+momentum那么overshoot

小总结:

Momentum:加了velocity term

RMSProp: adapative learning rata

  • Adam

momentum+rmsProp

Beta1/2:是decay rate

//

A:in favor of first step

//Beta1**t,converge to 0;

所以分母converge to 1-0=1

解决方案:

Normalize first and second moment one more time to get the unbiased estimate

//

AdaGrad, RMSProp, Adam : more smooth loss

只能take small step,不然可能增加loss

Fix:

Use hessian(二阶导

好处: adjust step size

//坏处:很多memory,slow

Maybe:

● Learning rate schedules

Why may want to reduce the learning rate

A:不想overshoot

//可用于SGD+momentum(何时;设置成什么

Nlp也可以用linear decay schedule、

///Adam一般用不变的learning rate eg0.0001

因为它会figure out how much to update

Last updated