diff --git a/content/chapters/03_transformer/03_04_trafo-params.md b/content/chapters/03_transformer/03_04_trafo-params.md new file mode 100644 index 0000000..82cf4d7 --- /dev/null +++ b/content/chapters/03_transformer/03_04_trafo-params.md @@ -0,0 +1,29 @@ +--- +title: "Chapter 03.04: Transformer Parameter Count" +weight: 3004 +--- +This chapter deals with the number of parameters of the transformer. The parameter count of a transformer model refers to the total number of learnable parameters present in its architecture, which are distributed across various components of the model. +These components typically include: + +1. **Embedding Layers**: Parameters associated with the input and output embeddings for tokens, which encode their semantic meanings. +2. **Encoder Layers**: Parameters within each encoder layer, including those associated with self-attention mechanisms, position-wise feedforward networks, and layer normalization. +3. **Decoder Layers**: Parameters within each decoder layer, including self-attention mechanisms, cross-attention mechanisms, position-wise feedforward networks, and layer normalization. +4. **Positional Encodings**: Parameters used to encode positional information in the input sequences. + +The total parameter count of a transformer model is the sum of parameters from all these components, with variations depending on the specific architecture and hyperparameters chosen for the model. + + + + + + +### Lecture Slides +{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/chapter11/slides/chapter03-transformer/slides-34-trafo-params.pdf" >}} + +### Additional Resources + +- [Blog about the Transformer Parameter Count](https://towardsdatascience.com/how-to-estimate-the-number-of-parameters-in-transformer-models-ca0f57d8dff0) + diff --git a/content/chapters/03_transformer/03_04_trafo_xl.md b/content/chapters/03_transformer/03_05_trafo_xl.md similarity index 88% rename from content/chapters/03_transformer/03_04_trafo_xl.md rename to content/chapters/03_transformer/03_05_trafo_xl.md index 75fbc07..e00564c 100644 --- a/content/chapters/03_transformer/03_04_trafo_xl.md +++ b/content/chapters/03_transformer/03_05_trafo_xl.md @@ -1,6 +1,6 @@ --- -title: "Chapter 03.04: Long Sequences: Transformer-XL" -weight: 3004 +title: "Chapter 03.05: Long Sequences: Transformer-XL" +weight: 3005 --- This chapter is about the Transformer-XL [1] and how it deals with the issue of long sequences. Transformer-XL is an extension of the original Transformer architecture designed to address the limitations of long-range dependency modeling in sequence-to-sequence tasks. It aims to solve the problem of capturing and retaining information over long sequences by introducing a segment-level recurrence mechanism, enabling the model to process sequences of arbitrary length without being constrained by fixed-length contexts or running into computational limitations. Additionally, Transformer-XL incorporates relative positional embeddings to better capture positional information across segments of varying lengths. @@ -12,7 +12,7 @@ This chapter is about the Transformer-XL [1] and how it deals with the issue of --> ### Lecture Slides -{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-34-trafo-xl.pdf" >}} +{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-trafo-xl.pdf" >}} ### References diff --git a/content/chapters/03_transformer/03_05_efficient.md b/content/chapters/03_transformer/03_06_efficient.md similarity index 89% rename from content/chapters/03_transformer/03_05_efficient.md rename to content/chapters/03_transformer/03_06_efficient.md index cfa5ea4..017f650 100644 --- a/content/chapters/03_transformer/03_05_efficient.md +++ b/content/chapters/03_transformer/03_06_efficient.md @@ -1,6 +1,6 @@ --- -title: "Chapter 03.05: Efficient Transformers" -weight: 3005 +title: "Chapter 03.06: Efficient Transformers" +weight: 3006 --- Efficient Transformers are designed to mitigate the computational and memory requirements of standard transformer architectures, particularly when dealing with large-scale datasets or resource-constrained environments. They aim to address issues such as scalability and efficiency in training and inference. One approach used in efficient transformers is replacing the standard self-attention mechanism with more lightweight attention mechanisms, which reduce the computational complexity of attending to long sequences by approximating the attention mechanism with lower-rank matrices or restricting attention to local or sparse regions of the sequence. These approaches enable transformers to be more practical for real-world applications where computational resources are limited. @@ -12,7 +12,7 @@ Efficient Transformers are designed to mitigate the computational and memory req --> ### Lecture Slides -{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-35-efficient.pdf" >}} +{{< pdfjs file="https://github.com/slds-lmu/lecture_dl4nlp/blob/main/slides/chapter03-transformer/slides-36-efficient.pdf" >}} ### Additional Resources