update chapter 03

slds-lmu · Apr 23, 2024 · 64b61d0 · 64b61d0
1 parent d46dc12
commit 64b61d0
Show file tree

Hide file tree

Showing 7 changed files with 27 additions and 20 deletions.
diff --git a/content/chapters/02_dl_basics/02_04_tokenization.md b/content/chapters/02_dl_basics/02_04_tokenization.md
@@ -2,7 +2,7 @@
 title: "Chapter 02.04 Revisiting words: Tokenization"
 weight: 2004
 ---
-This chapter is about Tokenization, which is the process of breaking down a sequence of text into smaller, meaningful units, such as words or subwords, to facilitate natural language processing tasks. Various tokenization methods exist, including Byte Pair Encoding (BPE) or WordPiece, each with its own approach to dividing text into tokens. BPE and WordPiece are subword tokenization techniques that iteratively merge frequent character sequences to form larger units, effectively capturing both common words and rare morphological variations.
+This chapter is about Tokenization, which is the process of breaking down a sequence of text into smaller, meaningful units, such as words or subwords, to facilitate natural language processing tasks. Various tokenization methods exist, including Byte Pair Encoding (BPE) [1] or WordPiece [2], each with its own approach to dividing text into tokens. BPE and WordPiece are subword tokenization techniques that iteratively merge frequent character sequences to form larger units, effectively capturing both common words and rare morphological variations.
 
 <!--more-->
 

diff --git a/content/chapters/03_transformer/03_01_intro_trafo.md b/content/chapters/03_transformer/03_01_intro_trafo.md
@@ -1,8 +1,10 @@
 ---
-title: "Chapter 3.1: A universal deep learning architecture"
+title: "Chapter 03.01: A universal deep learning architecture"
 weight: 3001
 ---
-This chapter briefly introduces different use cases of the Transformer.
+Transformers have been adapted and applied to various domains and tasks in addition to traditional sequence-to-sequence tasks in NLP. This chapter mentions a few examples of models that apply the transformer architecture to various domains. 
+Examples include: Vision Transformer (ViT) [1]: Utilizes transformer architecture for image classification tasks, demonstrating competitive performance compared to convolutional neural networks (CNNs). CLIP [2]: A model that connects images and text through a unified embedding space, enabling tasks such as zero-shot image classification and image-text retrieval.
+
 
 <!--more-->
 
@@ -16,4 +18,8 @@ This chapter briefly introduces different use cases of the Transformer.
 
 ### References 
 
+- [1] [Dosovitskiy et al., 2021](https://arxiv.org/abs/2010.11929)
+- [2] [Radford et al., 2021](https://arxiv.org/abs/2103.00020)
+
+
 
diff --git a/content/chapters/03_transformer/03_02_encoder.md b/content/chapters/03_transformer/03_02_encoder.md
@@ -1,8 +1,8 @@
 ---
-title: "Chapter 3.2: The Encoder"
+title: "Chapter 03.02: The Encoder"
 weight: 3002
 ---
-This chapter further elaborates on the Transformer by focusing on the Encoder part and introducing the concepts of self- and cross attention.
+The Encoder in a transformer model is responsible for processing the input sequence and generating contextualized representations of each token, capturing both local and global dependencies within the sequence. It achieves this by employing self-attention mechanisms, which allow each token to attend to all other tokens in the input sequence, enabling the model to capture relationships and dependencies between tokens regardless of their positions. Additionally, the encoder includes position-wise feedforward networks to further refine the representations and incorporate positional information.
 
 <!--more-->
 
@@ -14,7 +14,3 @@ This chapter further elaborates on the Transformer by focusing on the Encoder pa
 ### Lecture Slides
 {{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-32-encoder.pdf" >}}
 
-### References 
-
-
-
diff --git a/content/chapters/03_transformer/03_03_decoder.md b/content/chapters/03_transformer/03_03_decoder.md
@@ -1,8 +1,8 @@
 ---
-title: "Chapter 3.3: The Decoder"
+title: "Chapter 03.03: The Decoder"
 weight: 3003
 ---
-This chapter is about the decoder part of the transformer and masked self attention.
+The Decoder in a transformer model is responsible for generating an output sequence based on the contextualized representations generated by the encoder, facilitating tasks such as sequence generation and machine translation. It achieves this by utilizing self-attention mechanisms, similar to the encoder, to capture dependencies within the input sequence and cross-attention mechanisms to attend to the Encoder-output, enabling the model to focus on relevant parts of the input during decoding. Additionally, the decoder includes position-wise feedforward networks to further refine the representations and generate the output sequence token by token.
 
 <!--more-->
 
@@ -14,6 +14,3 @@ This chapter is about the decoder part of the transformer and masked self attent
 ### Lecture Slides
 {{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-33-decoder.pdf" >}}
 
-### References 
-
-
diff --git a/content/chapters/03_transformer/03_04_trafo_xl.md b/content/chapters/03_transformer/03_04_trafo_xl.md
@@ -1,8 +1,8 @@
 ---
-title: "Chapter 3.4: Long Sequences: Transformer-XL"
+title: "Chapter 03.04: Long Sequences: Transformer-XL"
 weight: 3004
 ---
-This chapter is about the Transformer-XL and how it deals with the issue of long sequences. 
+This chapter is about the Transformer-XL [1] and how it deals with the issue of long sequences. Transformer-XL is an extension of the original Transformer architecture designed to address the limitations of long-range dependency modeling in sequence-to-sequence tasks. It aims to solve the problem of capturing and retaining information over long sequences by introducing a segment-level recurrence mechanism, enabling the model to process sequences of arbitrary length without being constrained by fixed-length contexts or running into computational limitations. Additionally, Transformer-XL incorporates relative positional embeddings to better capture positional information across segments of varying lengths. 
 
 <!--more-->
 
@@ -16,3 +16,4 @@ This chapter is about the Transformer-XL and how it deals with the issue of long
 
 ### References 
 
+- [1] [Dai et al., 2019](https://arxiv.org/abs/1901.02860)
diff --git a/content/chapters/03_transformer/03_05_efficient.md b/content/chapters/03_transformer/03_05_efficient.md
@@ -1,8 +1,8 @@
 ---
-title: "Chapter 3.5: Efficient Transformers"
+title: "Chapter 03.05: Efficient Transformers"
 weight: 3005
 ---
-This chapter discusses the efficiency problems and shortcomings of transformer-based models and briefly talks about ways to deal with these issues.
+Efficient Transformers are designed to mitigate the computational and memory requirements of standard transformer architectures, particularly when dealing with large-scale datasets or resource-constrained environments. They aim to address issues such as scalability and efficiency in training and inference. One approach used in efficient transformers is replacing the standard self-attention mechanism with more lightweight attention mechanisms, which reduce the computational complexity of attending to long sequences by approximating the attention mechanism with lower-rank matrices or restricting attention to local or sparse regions of the sequence. These approaches enable transformers to be more practical for real-world applications where computational resources are limited.
 
 <!--more-->
 
@@ -14,5 +14,7 @@ This chapter discusses the efficiency problems and shortcomings of transformer-b
 ### Lecture Slides
 {{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-efficient.pdf" >}}
 
-### References 
+### Additional Resources
+
+[Blogpost about Flash Attention](https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention)
 
diff --git a/content/chapters/03_transformer/_index.md b/content/chapters/03_transformer/_index.md
@@ -1,10 +1,15 @@
 ---
 title: "Chapter 3: Transformer"
 ---
-This chapter will introduce the Transformer architecture as introduced in [1]. We explore the different parts of the transformer model (Encoder and Decoder) and discuss ways to improve the architecture, such as Transformer-XL and Efficient Transformers.
+The Transformer, as introduced in [1], is a deep learning model architecture specifically designed for sequence-to-sequence tasks in natural language processing. It revolutionizes NLP by replacing recurrent layers with self-attention mechanisms, enabling it to process entire sequences in parallel, overcoming the limitations of sequential processing in traditional RNN-based models like LSTMs. This architecture has become the foundation for state-of-the-art models in various NLP tasks such as machine translation, text summarization, and language understanding. In this chapter we first introduce the transformer, explore different parts of it (Encoder and Decoder) and finally discuss ways to improve the architecture, such as Transformer-XL and Efficient Transformers.
 
 <!--more-->
 
 ### References 
 
 - [1] [Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)   
+
+### Additional Resources 
+
+- [Very good video explaining the Transformer and Attention](https://www.youtube.com/watch?v=bCz4OMemCcA&t)
+- [3Blue1Brown Videoseries about the Transformer](https://www.youtube.com/watch?v=wjZofJX0v4M&t)