🫐Lecture 18-19

● Sequence-to-Sequence

//但如果序列太长,一个c无法带代表

● Attention, Self-attention

E: how relevant h1(encoder state) with s0(decoder state)-》normalized theminto prob distribution

A: 4个scale,non-negative

AH: scaler*vector

亮:higher attention weight

//

//

Ith query with jth input//relevant

//

X用于compare the similarity& produce output;;;可以用不同的W

Attention:a way to create context vectors that approperate for decoding output for every time step

///self-attention layer

Transfer x into Query ,Key and Value

!!

//projectx

//1*1 conv 可用于change dimension

//

!attention 用于sequence

-》 用transformer 结构

● Transformer

Attention mean

What we want to do with attention

Position encoding

//transformer: neural network block that consist only self-attention layer

//layer normalization: 可以incease gradient flow

//

GPT: generative pretrained transformer

Last updated