LG Aimers 4기 Transformer

LG Aimers 4기 Transformer

2024. 1. 22. 07:00ㆍ코딩 도구/LG Aimers

LG Aimers: AI전문가과정 4차

Module 6. 『딥러닝(Deep Learning)』

ㅇ 교수 : KAIST 주재걸 교수
ㅇ 학습목표
Neural Networks의 한 종류인 딥러닝(Deep Learning)에 대한 기본 개념과 대표적인 모형들의 학습원리를 배우게 됩니다.
이미지와 언어모델 학습을 위한 딥러닝 모델과 학습원리를 배우게 됩니다.

Part 5. Transformer

-How Transformer Model Works

-Review: Seq2Seq with Attention

-Transformer: High-level View
• Attention module can work as both a sequence encoder and a decoder in seq2seq with attention.
• In other words, RNNs or CNNs are no longer necessary, but all we need is attention modules.

-Transformer: Scaled Dot-product Attention
• Problem
• As 𝑑𝑘 gets large, the variance of 𝑞^𝑇*𝑘 increases.
• Some values inside the softmax get large.
• The softmax gets very peaked.
• Hence its gradient gets smaller.

• Solution
• Scaled by the length of query / key vectors:

-Transformer: Multi-head Attention

• The input word vectors can be the queries, keys and values.
• In other words, the word vectors themselves select one another.

• Problem: only one way for words to interact with one another.
• Solution: multi-head attention maps 𝑄,𝐾, and 𝑉 into the ℎ number of lower-dimensional spaces via 𝑊 matrices.

• Afterwards, apply attention, then concatenate outputs and pipe through linear layer.

-Transformer: Block-Based Model

Each block has two sub-layers
• Multi-head attention
• Two-layer feed-forward NN (with ReLU)
Each of these two steps also has
• Residual connection and layer normalization:
LayerNorm(𝑥 + sublayer(𝑥))

-Layer Normalization

Layer normalization consists of two steps:
• Normalization of each word vectors to have zero mean of zero and variance of one.
• Affine transformation of each sequence vector with learnable parameters.

-Transformer: Positional Encoding
• Use sinusoidal functions of different frequencies:

𝑃𝐸(pos,2𝑖) = sin(pos/10000^2𝑖/𝑑 model)
𝑃𝐸(pos,2𝑖+1) = cos(pos/10000^2𝑖/𝑑 model)

• Easily learn to attend by relative position, since for any fixed offset 𝑘, 𝑃𝐸(pos+𝑘) can be represented as linear function of 𝑃𝐸 pos

• Another positional encoding can also be used (e.g., positional encoding in ConvS2S).

-Transformer: Warm-up Learning Rate Scheduler
• learning rate = 𝑑^−0.5model⋅ min(#step^−0.5
, #step ⋅ warmup_step−1.5)

-Transformer: Encoder Self-attention Visualization
• Words start to pay attention to other words in sensible ways.

-Transformer: Decoder
• Two sub-layer changes in decoder
• Masked decoder self-attention on previously generated outputs:

• Encoder-Decoder attention,
where queries come from previous decoder layer and keys and values come from output of encoder

-Transformer: Masked Self-attention
• Those words not yet generated cannot be accessed during the inference time.
• Renormalization of softmax output prevents the model from accessing not yet
generated words.

-Recent Trends

• Transformer model and its self-attention block has become a general-purpose
sequence (or set) encoder in recent NLP applications as well as in other areas.
• Training deeply stacked Transformer models via a self-supervised learning
framework has significantly advanced various NLP tasks via transfer learning, e.g.,BERT, GPT-2, GPT-3, XLNet, ALBERT, RoBERTa, Reformer, T5, …
• Other applications are fast adopting the self-attention architecture and selfsupervised learning settings, e.g., computer vision, recommender systems, drug
discovery, and so on
• As for natural language generation, self-attention models still require a greedy
decoding of words one at a time.

저작자표시 비영리 변경금지

'코딩 도구 > LG Aimers' 카테고리의 다른 글

LG Aimers 4기 B2B 시장, 소비자와 고객의 차이 (6)	2024.01.24
LG Aimers 4기 Self-Supervised Learning & Large-Scale Pre-Trained Models (3)	2024.01.23
LG Aimers 4기 Seq2Seq , Natural Language Understanding and Generation (2)	2024.01.21
LG Aimers 4기 Convolutional Neural Networks and Image Classification (5)	2024.01.20
LG Aimers 4기 Training Neural Networks (3)	2024.01.19

MK 실험실

MK 실험실

태그

최근글

댓글

공지사항

아카이브

LG Aimers: AI전문가과정 4차

Module 6. 『딥러닝(Deep Learning)』

'코딩 도구 > LG Aimers' 카테고리의 다른 글

관련글

티스토리툴바