😉Lecture 8-9

● Fancier optimizers

○ Problems with SGD（随机gradient decent）,

有什么问题//loss function has high condition number

会stop training

为什么可以解决eg momentum
- momentum

Nesterov：

Nesterov Momentum：先按照velocity走到一个位置；计算那里的gradient

好处：如果开始的velocity不对，Nesterov走到velocity位置后的gradient计算可以减少这个错误危害

//Momentum：容易overshoot

AdaGrad：

不同parameter 用不用的learning rate

Steep更steep

（因为square了gradient

//因为一直再加，最后weight update will be 0

RMSProp ： fix AdaGrad；；或者叫leaky AdaGrad

！加上了一个decay rate（0.9 or 0.99）

eg。0.9*cummulated gradient score+0.1current gradient

防止了分母无限变大后，learning rate slow down

//并且不像sgd+momentum那么overshoot

小总结：

Momentum：加了velocity term

RMSProp： adapative learning rata

Adam

momentum+rmsProp

Beta1/2:是decay rate

//

A：in favor of first step

//Beta1**t，converge to 0；

所以分母converge to 1-0=1

解决方案：

Normalize first and second moment one more time to get the unbiased estimate

//

AdaGrad, RMSProp, Adam ： more smooth loss

只能take small step，不然可能增加loss

Fix：

Use hessian（二阶导

好处： adjust step size

//坏处：很多memory，slow

Maybe：

● Learning rate schedules

Why may want to reduce the learning rate

A：不想overshoot

//可用于SGD+momentum（何时；设置成什么

Nlp也可以用linear decay schedule、

///Adam一般用不变的learning rate eg0.0001

因为它会figure out how much to update

PreviousLecture 6-7 NextLecture 10

Last updated 2 years ago