🫐Lecture 18-19

● Sequence-to-Sequence

//但如果序列太长，一个c无法带代表

● Attention, Self-attention

E： how relevant h1（encoder state） with s0（decoder state）-》normalized theminto prob distribution

A： 4个scale，non-negative

AH： scaler*vector

亮：higher attention weight

//

//

Ith query with jth input//relevant

//

X用于compare the similarity& produce output；；；可以用不同的W

Attention：a way to create context vectors that approperate for decoding output for every time step

///self-attention layer

Transfer x into Query ，Key and Value

‘

！！

//projectx

//1*1 conv 可用于change dimension

//

！attention 用于sequence

-》用transformer 结构

● Transformer

Attention mean

What we want to do with attention

Position encoding

//transformer： neural network block that consist only self-attention layer

//layer normalization：可以incease gradient flow

//

GPT： generative pretrained transformer

PreviousLecture 17 NextLecture 20

Last updated 2 years ago