🫐Lecture 18-19
● Sequence-to-Sequence
//但如果序列太长,一个c无法带代表
● Attention, Self-attention
E: how relevant h1(encoder state) with s0(decoder state)-》normalized theminto prob distribution
A: 4个scale,non-negative
AH: scaler*vector
亮:higher attention weight
//
//
Ith query with jth input//relevant
//
X用于compare the similarity& produce output;;;可以用不同的W
Attention:a way to create context vectors that approperate for decoding output for every time step
///self-attention layer
Transfer x into Query ,Key and Value
‘
!!
//projectx
//1*1 conv 可用于change dimension
//
!attention 用于sequence
-》 用transformer 结构
● Transformer
Attention mean
What we want to do with attention
Position encoding
//transformer: neural network block that consist only self-attention layer
//layer normalization: 可以incease gradient flow
//
GPT: generative pretrained transformer
Last updated