00-introduction.Rmd


# Introduction

*Author: Nadja Sauter*

*Supervisor: Matthias Aßenmacher*

## Introduction to Multimodal Deep Learning

There are five basic human senses: hearing, touch, smell, taste and sight. Possessing these five modalities, we are able to perceive and understand the world around us. Thus, "multimodal" means to combine different channels of information simultaneously to understand our surroundings. For example, when toddlers learn the word “cat”, they use different modalities by saying the word out loud, pointing on cats and making sounds like “meow”. Using the human learning process as a role model, artificial intelligence (AI) researchers also try to combine different modalities to train deep learning models. On a superficial level, deep learning algorithms are based on a neural network that is trained to optimize some objective which is mathematically defined via the so-called loss function. The optimization, i.e. minimizing the loss, is done via a numerical procedure called gradient descent. Consequently, deep learning models can only handle numeric input and can only result in a numeric output. However, in multimodal tasks we are often confronted with unstructured data like pictures or text. Thus, the first major problem is how to represent the input numerically. The second issue with regard to multimodal tasks is how exactly to combine different modalities. For instance, a typical task could be to train a deep learning model to generate a picture of a cat. First of all, the computer needs to understand the text input “cat” and then somehow translate this information into a specific image. Therefore, it is necessary to identify the contextual relationships between words in the text input and the spatial relationships betweent pixels in the image output. What might be easy for a toddler in pre-school, is a huge challenge for the computer. Both have to learn some understanding of the word "cat" that comprises the meaning and appearance of the animal. A common approach in modern deep learning is to generate embeddings that represent the cat numerically as a vector in some latent space. However, to achieve this, different approaches and algorithmic architectures have been developed in recent years. This book gives an overview of the different methods used in state-of-the-art (SOTA) multimodal deep learning to overcome challenges arising from unstructured data and combining inputs of different modalities.

## Outline of the Booklet

Since multimodal models often use text and images as input or output, methods of Natural Language Processing (NLP) and Computer Vision (CV) are introduced as foundation in Chapter \@ref(c01-00-intro-modalities). Methods in the area of NLP try to handle text data, whereas CV deals with image processing. With regard to NLP (subsection \@ref(c01-01-sota-nlp)), one concept of major importance is the so-called word embedding, which is nowadays an essential part of (nearly) all multimodal deep learning architectures. This concept also sets the foundation for transformer-based models like BERT [@BERT], which achieved a huge improvement in several NLP tasks. Especially the (self-)attention mechanism [@attention] of transformers revolutionized NLP models, which is why most of them rely on the transformer as a backbone. In Computer Vision (subsection \@ref(c01-02-sota-cv)) different network architectures, namely ResNet [@ResNet], EfficientNet [@EfficientNet], SimCLR [@SimCLR] and BYOL [@BYOL], will be introduced. In both fields it is of great interest to compare the different approaches and their performance on challenging benchmarks. For this reason, the last subsection \@ref(c01-03-benchmarks) of Chapter \@ref(c01-00-intro-modalities) gives an overall overview of different data sets, pre-training tasks and benchmarks for CV as well as for NLP.  


The second Chapter (see \@ref(c02-00-multimodal)) focuses on different multimodal architectures, covering a wide variety of how text and images can be combined. The presented models combine and advance different methods of NLP and CV. First of all, looking at Img2Text tasks (subsection \@ref(c02-01-img2text)), the data set Microsoft COCO for object recognition [@COCO] and the meshed-memory transformer for Image Captioning (M^2^ Transformer) [@meshed_memory] will be presented. Contrariwise, researchers developed methods to generate pictures based on a short text prompt (subsection \@ref(c02-02-text2img)). The first models accomplishing this task were generative adversarial networks (GANs) [@GAN] and Variational Autoencoders (VAEs) [@VAE]. These methods were improved in recent years and today's SOTA transformer architectures and text-guided diffusion models like DALL-E [@DALLE] and GLIDE [@GLIDE] achieve remarkable results. Another interesting question is how images can be utilized to support language models (subsection \@ref(c02-03-img-support-text)). This can be done via sequential embeddings, more advanced grounded embeddings or, again, inside transformers. On the other hand, one can also look at text supporting CV models like CLIP [@CLIP], ALIGN [@ALIGN] and Florence [@yuan2021florence] (subsection \@ref(c02-04-text-support-img)). They use foundation models meaning reusing models (e.g. CLIP inside DALL-E 2) as well as a contrastive loss for connecting text with images. Besides, zero-shooting makes it possible to classify new and unseen data without expensive fine-tuning. Especially the open-source architecture CLIP [@CLIP] for image classification and generation attracted a lot of attention last year. In the end of the second chapter, some further architectures to handle text and images simultaneously are introduced (subsection \@ref(c02-05-text-plus-img)). For instance, Data2Vec uses the same learning method for speech, vision and language and in this way aims to find a general approach to handle different modalities in one architecture. Furthermore, VilBert [@VilBert] extends the popular BERT architecture to handle both image and text as input by implementing co-attention. This method is also used in Google's Deepmind Flamingo [@alayrac2022flamingo]. In addition, Flamingo aims to tackle multiple tasks with a single visual language model via few-shot learning and freezing the pre-trained vision and language model.  


In the last chapter (see \@ref(c03-00-further)), methods are introduced that are also able to handle modalities other than text and image, like e.g. video, speech or tabular data. The overall goal here is to find a general multimodal architecture based on challenges rather than modalities. Therefore, one needs to handle problems of multimodal fusion and alignment and decide whether you use a join or coordinated representation (subsection \@ref(c03-01-further-modalities)). Moreover we go more into detail about how exactly to combine structured and unstructured data (subsection \@ref(c03-02-structured-unstructured)). Therefore, different fusion strategies which evolved in recent years will be presented. This is illustrated in this book by two use cases in survival analysis and economics. Besides this, another interesting research question is how to tackle different tasks in one so called multi-purpose model (subsection \@ref(c03-03-multi-purpose)) like it is intended to be created by Google researchers [@Pathways] in their "Pathway" model. Last but not least, we show one exemplary application of Multimodal Deep Learning in the arts scene where image generation models like DALL-E [@DALLE] are used to create art pieces in the area of Generative Arts (subsection \@ref(c03-04-usecase)).