02-00-multimodal.Rmd

# Multimodal architectures {#c02-00-multimodal}

*Authors: Luyang Chu, Karol Urbanczyk, Giacomo Loss, Max Schneider, Steffen Jauch-Walser*

*Supervisor: Christian Heumann*

Multimodal learning refers to the process of learning representations from different types of input modalities, such as image data, text or speech.
Due to methodological breakthroughs in the fields of Natural Language Processing (NLP) as well as Computer Vision (CV) in recent years, multimodal models have gained increasing attention as they are able to strengthen predictions and better emulate the way humans learn.
This chapter focuses on discussing images and text as input data.
The remainder of the chapter is structured as follows:

The first part “Image2Text” discusses how transformer-based architectures improve meaningful captioning for complex images using a new large scale, richly annotated dataset COCO [@mccoco; @cornia2020m2].
While looking at a photograph and describing it or parsing a complex scene and describing its context is not a difficult task for humans, it appears to be much more complex and challenging for computers.
We start with focusing on images as input modalities.
In 2014 Microsoft COCO was developed with a primary goal of advancing the state-of-the-art (SOTA) in object recognition by diving deeper into a broader question of scene understanding [@mccoco].
"COCO" in this case is the acronym for _Common Objects in Context_.
It addresses three core problems in scene understanding: object detection (non-iconic views), segmentation, and captioning.
While for tasks like machine translation and language understanding in NLP, transformer-based architecture are already widely used, the potential for applications in the multi-modal context has not been fully covered yet.
With the help of the MS COCO dataset, the transformer-based architecture "Meshed-Memory Transformer for Image Captioning" ($M^2$) will be introduced, which was able to improve both image encoding and the language generation steps [@cornia2020m2].
The performance of $M^2$ and other different fully-attentive models will be compared on the MS COCO dataset.

Next, in *Text2Image*, the idea of incorporating textual input in order to generate visual representations is described. Current advancements in this field have been made possible largely due to recent breakthroughs in NLP, which first allowed for learning contextual representations of text. Transformer-like architectures are being used to encode the input into embedding vectors, which are later helpful in guiding the process of image generation. The chapter discusses the development of the field in chronological order, looking into details of the most recent milestones. Concepts such as generative adversarial networks (GAN), variational auto-encoders (VAE), VAE with vector quantization (VQ-VAE), diffusion, and autoregressive models are covered to provide the reader with a better understanding of the roots of the current research and where it might be heading. Some of the most outstanding outputs generated by state-of-the-art works are also presented in the chapter.

The third part, “Images supporting Language Models”, deals with the integration of visual elements in pure textual language models.
Distributional semantic models such as Word2Vec and BERT assume that the meaning of a given word or sentence can be understood by looking at how (in which context) and when the word or the sentence appear in the text corpus, namely from its “distribution” within the text.
But this assumption has been historically questioned, because words and sentences must be grounded in other perceptual dimensions in order to understand their meaning [see for example the “symbol grounding problem”\; @harnad1990symbol].
For these reasons, a broad range of models has been developed with the aim to improve pure language models, leveraging the addition of other perceptual dimensions, such as the visual one.
This subchapter focuses in particular on the integration of visual elements (here: images) to support pure language models for various tasks at the word-/token-level as well as on the sentence-level.
The starting point in this case is always a language model, into which visual representations (extracted often with the help of large pools of images rom data sets like MS COCO, see chapter "Img2Text" for further references) are to be “integrated”.
But how?
There has been proposed a wide range of solutions:
On one side of the spectrum, textual elements and visual ones are learned separately and then “combined” afterwards, whereas on the other side, the learning of textual and visual features takes place simultaneously/jointly.

```{r f02-00-01, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:f02-00-01)"}
knitr::include_graphics("figures/02-chapter2/Img_Ch_Intro.png")
```
(ref:f02-00-01) Left: @silberer2014learning stack autoencoders to learn higher-level embeddings from textual and visual modalities, encoded as vectors of attributes. Right: @bordes2020incorporating fuse textual and visual information in an intermediate space denoted as "grounded space"; the "grounding objective function" is not applied directly on sentence embeddings but trained on this intermediate space, on which sentence embeddings are projected.

For example, @silberer2014learning implement a model where a one-to-one correspondence between textual and visual space is assumed.
Text and visual representations are passed to two separate unimodal encoders and both outputs are then fed to a bimodal autoencoder.
On the other side, @bordes2020incorporating propose a “text objective function” whose parameters are shared with an additional “grounded objective function”.
The training of the latter takes place in what the authors called a “grounded space”, which allows to avoid the one-to-one correspondence between textual and visual space.
These are just introductory examples and between these two approaches there are many shades of gray (probably even more than fifty ..).
These models exhibit in many instances better performance than pure language models, but they still struggle on some aspects, for example when they deal with abstract words and sentences.

Afterwards, in the subchapter on “Text supporting Image Models”, approaches where natural language is used as additional supervision for CV models are described.
Intuitively these models should be more powerful compared to models supervised solely by manually labeled data, simply because there is much more signal available in the training data.

One prominent example for this is the CLIP model [@radford2021learning] with its new dataset WIT (WebImageText) comprising 400 million text-image pairs scraped from the internet.
Similar to "Text2Image" the recent success stories in NLP have inspired most of the new approaches in this field.
Most importantly pre-training methods, which directly learn from raw text [e.g. GPT-n, Generative Pre-trained Transformer\; @brown2020language].
So, the acronym CLIP stands for _C_ontrastive _L_anguage-_I_mage _P_re-training here.
A transformer-like architecture is used for jointly pre-training a text encoder and an image encoder.
For this, the contrastive goal to correctly predict which natural language text pertains to which image inside a certain batch, is employed.
Training this way turned out to be more efficient than to generate captions for images.
This leads to a flexible model, which at test time uses the Learned text encoder as a “zero-shot” classifier on embeddings of the target dataset’s classes.
The model, for example, can perform optical character recognition, geo-location detection and action-recognition.
Performance-wise CLIP can be competitive with task-specific supervised models, while never seeing an instance of the specific dataset before.
This suggests an important step towards closing the “robustness gap”, where machine learning models fail to meet the expectations set by their previous performance -- especially on ImageNet test-sets -- on new datasets.

Finally, the subchapter “Models for both modalities” discusses how text and image inputs can be incorporated into a single unifying framework in order to get closer to a general self-supervised learning framework.
There are two key advantages that make such an architecture particularly interesting.
Similar to models mentioned in previous parts, devoid of human labelling, self-supervised models don't suffer from the same capacity constraints as regular supervised learning models.
On top of that, while there have been notable advances in dealing with different modalities using single modality models, it is often unclear to which extend a model structure generalizes across different modalities.
Rather than potentially learning modality-specific biases, a general multipurpose framework can help increase robustness while also simplifying the learner portfolio.
In order to investigate different challenges and trends in vision-and-language modelling,
this section takes a closer look at three different models, namely data2vec (@baevski2022data2vec), VilBert (@lu2019vilbert) and Flamingo (@alayrac2022flamingo)
Data2vec is a new multimodal self-supervised learning model which uses a single framework to process either speech, natural language or visual information.
This is in contrast to earlier models which used different algorithms for different modalities.
The core idea of data2vec, developed by MetaAI, is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard transformer architecture. (@baevski2022data2vec)
As a result, the main improvement is in the framework itself, not the underlying architectures themselves.
For example, the transformer architecture being used follows @vaswani2017attention.
Through their parallelizability, transformers have several advantages over RNNs/CNNs particularly when
large amounts of data are being used, making them the de-facto standard approach in vision-language modelling. (@dosovitskiy2020image)
VilBert is an earlier model that in contrast to data2vec can handle cross-modality tasks.
Finally, Flamingo is a modern few shot learning model which features 80B parameters -
significantly more than the other two models. Through a large language model incorporated in its architecture, it has great text generating capabilities to tackle open-ended tasks. It also poses the question how to efficiently train increasingly large models and shows the effectiveness of using perceiver architectures (@jaegle2021perceiver) to encode inputs from different modalities as well as how to leverage communication between pretrained and frozen models.