forked from slds-lmu/seminar_multimodal_dl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
97 lines (74 loc) · 5.25 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: "Multimodal Deep Learning"
author: ""
date: "`r Sys.Date()`"
documentclass: krantz
bibliography: [book.bib, packages.bib]
biblio-style: apalike
link-citations: yes
colorlinks: yes
lot: False
lof: False
site: bookdown::bookdown_site
description: "."
graphics: yes
---
<!--- cover-image: images/cover.png -->
```{r setup, include=FALSE}
options(
htmltools.dir.version = FALSE, formatR.indent = 2, width = 55, digits = 4
)
output <- knitr::opts_knit$get("rmarkdown.pandoc.to")
is.html = !is.null(output) && output == "html"
```
# Preface {-}
*Author: Matthias Aßenmacher*
```{r cover, cache=FALSE, out.width="800", fig.align="center", fig.cap="LMU seal (left) style-transferred to Van Gogh's Sunflower painting (center) and blended with the prompt - Van Gogh, sunflowers - via CLIP+VGAN (right).", echo=FALSE, eval = is.html}
knitr::include_graphics('figures/titelbild.png')
```
In the last few years, there have been several breakthroughs in the methodologies used in Natural Language Processing (NLP) as well as Computer Vision (CV). Beyond these improvements on single-modality models, large-scale multi-modal approaches have become a very active area of research.
In this seminar, we reviewed these approaches and attempted to create a solid overview of the field, starting with the current state-of-the-art approaches in the two subfields of Deep Learning individually. Further, modeling frameworks are discussed where one modality is transformed into the other Chapter \@ref(c02-01-img2text) and Chapter \@ref(c02-02-text2img)), as well as models in which one modality is utilized to enhance representation learning for the other (Chapter \@ref(c02-03-img-support-text) and Chapter \@ref(c02-04-text-support-img)). To conclude the second part, architectures with a focus on handling both modalities simultaneously are introduced (Chapter \@ref(c02-05-text-plus-img)). Finally, we also cover other modalities (Chapter \@ref(c03-01-further-modalities) and Chapter \@ref(c03-02-structured-unstructured)) as well as general-purpose multi-modal models (Chapter \@ref(c03-03-multi-purpose)), which are able to handle different tasks on different modalities within one unified architecture. One interesting application (Generative Art, Chapter \@ref(c03-04-usecase)) eventually caps off this booklet.
<!---
Cover by [\@YvonneDoinel](https://twitter.com/YvonneDoinel)
-->
## Citation {-}
If you refer to the book, please use the following citation (authors in alphabetical order):
```
@misc{seminar_22_multimodal,
title = {Multimodal Deep Learning},
author = {Akkus, Cem and Chu, Luyang and Djakovic, Vladana and
Jauch-Walser, Steffen and Koch, Philipp and Loss,
Giacomo and Marquardt, Christopher and Moldovan,
Marco and Sauter, Nadja and Schneider, Maximilian
and Schulte, Rickmer and Urbanczyk, Karol and
Goschenhofer, Jann and Heumann, Christian and
Hvingelby, Rasmus and Schalk, Daniel and
Aßenmacher, Matthias},
url = {https://slds-lmu.github.io/seminar_multimodal_dl/},
day = {30},
month = {Sep},
year = {2022}
}
```

This book is licensed under the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).
\mainmatter
# Foreword {-}
*Author: Matthias Aßenmacher*
This book is the result of an experiment in university teaching. We were inspired by a group of other PhD Students around Christoph Molnar, who conducted another [seminar on Interpretable Machine Learning](https://compstat-lmu.github.io/iml_methods_limitations/) in this format.
Instead of letting every student work on a seminar paper, which more or less isolated from the other students, we wanted to foster collaboration between the students and enable them to produce a tangible outout (that isn't written to spend the rest of its time in (digital) drawers).
In the summer term 2022, some Statistics, Data Science and Computer Science students signed up for our seminar entitled "Multimodal Deep Learning" and had (before kick-off meeting) no idea what they had signed up for: Having written an entire book by the end of the semester.
We were bound by the examination rules for conducting the seminar, but otherwise we could deviate from the traditional format.
We deviated in several ways:
1. Each student project is a chapter of this booklet, linked contentwise to other chapers since there's partly a large overlap between the topics.
1. We gave challenges to the students, instead of papers. The challenge was to investigate a specific impactful recent model or method from the field of NLP, Computer Vision or Multimodal Learning.
1. We designed the work to live beyond the seminar.
1. We emphasized collaboration. Students wrote the introduction to chapters in teams and reviewed each others individual texts.
<!-- Technical setup -->
## Technical Setup {-}
The book chapters are written in the Markdown language.
The simulations, data examples and visualizations were created with R [@rlang].
To combine R-code and Markdown, we used rmarkdown.
The book was compiled with the bookdown package.
We collaborated using git and github.
For details, head over to the [book's repository](https://github.com/slds-lmu/seminar_multimodal_dl).