02-01-img2text.Rmd

## Image2Text {#c02-01-img2text}

*Author: Luyang Chu*

*Supervisor: Christian Heumann*

Image captioning refers to the task of producing descriptive text for given images. It has stimulated interest in both natural language processing and computer vision research in recent years. Image captioning is a key task that requires a semantic comprehension of images as well as the capacity to generate accurate and precise description sentences.

### Microsoft COCO: Common Objects in Context

The uderstanding of visual scenes plays an important role in computer vision (CV) research. It includes many tasks, such as image classification, object detection, object localization and semantic scene labeling.
Through the CV research history, high-quality image datasets have played a critical role. They are not only essential for training and evaluating new algorithms, but also lead the research to new challenging directions [@mccoco]. In the early years, researchers developed Datasets [@deng2009imagenet],[@sun],[@pascalvoc] which enabled the direct comparison of hundreds of image recognition algorithms, which led to an early evolution in object recognition. In the more recent past, ImageNet [@deng2009imagenet], which contains millions of images, has enabled breakthroughs in both object classification and detection research using new deep learning algorithms.

With the goal of advancing the state-of-the-art in object recognition, especially scene understanding, a new large scale data called "Microsoft Common Objects in Context" (MS COCO) was published in 2014. MS COCO focuses on three core problems in scene understanding: detecting non-iconic views, detecting the semantic relationships between objects and determining the precise localization of objects [@mccoco].

The MS COCO data set contains 91 common object categories with a total of 328,000 images as well as 2,500,000 instance labels. The authors claim, that all of these images could be recognized by a 4 year old child. 82 of the categories include more than 5000 labeled instances. These labeled instances wmay support the detection of relationships between objects in MS COCO. In order to provide precise localization of object instances, only "Thing" categories like e.g. car, table, or dog were included. Objects which do not have clear boundaries like e.g. sky, sea, or grass, were not included. In current object recognition research, algorithms perform well on images with iconic views. Images with iconic view are defined as containing the one single object category of interest in the center of the image. To accomplish the goal of detecting the contextual relationships between objects, more complex images with multiple objects or natural images, coming from our daily life, are also gathered for the data set.

In addition to MS COCO, researchers have been working on the development of new large databases. In recent years many new large databases like ImageNet, PASCAL VOC [@pascalvoc] and SUN [@sun] have been developed in the field of computer vision. Each of this dataset has its on specific focus.

Datasets for object recognition can be roughly split into three groups: object classification, object detection and semantic scene labeling.

Object classification requires binary labels to indicate whether objects are present in an image, ImageNet [@deng2009imagenet] is clearly distinguishable from other datasets in terms of the data set size. ImageNet contains 22k categories with 500-1000 images each.In comparison to other data sets, the ImageNet data set contains thus over 14 million labeled images with both entity-level and fine-grained categories by using the WordNet hierarchy and has enabled significant advances in image classification.

Detecting an object includes two steps: first is to ensure that an object from a specified class is present, the second step is to localize the object in the image with a given bounding box. This can be implemented to solve tasks like face detection or pedestrians detection. The PASCAL VOC [@pascalvoc] data set can be used to help with the detection of basic object categories. With 20 object categories and over 11,000 images, PASCAL VOC contains over 27,000 labeled object instances by additionally using bounding boxes. Almost 7,000 object instances from them come with detailed segmentations [@mccoco].

Labeling semantic objects in a scene requires that each pixel of an image is labeled with respect to belonging to a category, such as sky, chair, etc., but individual instances of objects do not need to be segmented [@mccoco]. Some objects like sky, grass, street can also be defined and labeled in this way.
The SUN data set [@sun] combines many of the properties of both object detection and semantic scene labeling data sets for the task of scene understanding, it contains 908 scene categories from the WordNet dictionary [@WordNet] with segmented objects.
The 3,819 object categories split them to object detection datasets (person, chair) and to semantic scene labeling (wall, sky, floor) [@mccoco].

#### Image Collection and Annotation for MS COCO

MS COCO is a large-scale richly annotated data set, the progress of building consisted of two phases: data collection and image annotation.

In order to select representative object categories for images in MS COCO, researchers collected several categories from different existing data sets like PASCAL VOC [@pascalvoc] and other sources. All these object categories could, according to the authors, be recognized by children between 4 to 8. The quality of the object categories was ensured by co-authors. Co-authors rated the categories on a scale from 1 to 5 depending on their common occurrence, practical applicability and diversity from other categories [@mccoco]. The final number of categories on their list was 91. All the categories from PASCAL VOC are included in MS COCO.

With the help of representative object categories, the authors of MS COCO wanted to collect a data set in which a majority of the included images are non-iconic. All included images can be roughly divided into three types according to Fig. \@ref(fig:imagetype): iconic-object images, iconic-scene images and non-iconic images [@mccoco].

```{r imagetype, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:imagetype)" }
knitr::include_graphics("figures/02-01/2.1 imagetype.png")
```
(ref:imagetype) Type of images in the data set [@mccoco].

Images are collected through two strategies: firstly images from Flickr, a platform for photos uploaded by amateur photographers, with their keywords are collected. Secondly, researchers searched for pairwise combinations of object categories like "dog + car" to gather more non-iconic images and images with rich contextual relationships [@mccoco].

Due to the scale of the dataset and the high cost of the annotation process, the design of a high quality annotation pipeline with efficient cost depicted a difficult task. The annotation pipeline in Fig. \@ref(fig:cocoannotation) for MS COCO was split into three primary tasks: 1. category labeling, 2.instance spotting, and 3. instance segmentation [@mccoco].

```{r cocoannotation, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:cocoannotation)" }
knitr::include_graphics("figures/02-01/2.1 annotation pipeline.png")
```
(ref:cocoannotation) Annotation pipeline for MS COCO [@mccoco].

As we can see in the Fig \@ref(fig:cocoannotation), object categories in each image were determined in the first step. Due to the large number of data sets and categories, they used a hierarchical approach instead of doing binary classification for each category. All the 91 categories were grouped into 11 super-categories. The annotator did then examine for each single instance whether it belongs to one of the given super-categories.
Workers only had to label one instance for each of the super-categories with a category’s icon [@mccoco]. For each image, eight workers were asked to label it. This hierarchical approach helped to reduce the time for labeling. However, the first phase still took ∼20k worker hours to be completed.

In the next step, all instances of the object categories in an image were labeled, at most 10 instances of a given category per image were labeled by each worker. In both the instance spotting and the instance segmentation steps, the location of the instance found by a worker in the previous stage could be seen by the current worker. Each image was labeled again by eight workers summing up to a total of ∼10k worker hours.

In the final segmenting stage, each object instance was segmented, the segmentation for other instances and the specification of the object instance by a worker in the previous stage were again shown to the worker. Segmenting 2.5 million object instances was an extremely time consuming task which required over 22 worker hours per 1,000 segmentations. To minimize cost and improve the quality of segmentation, all workers were required to complete a training task for each object category.
In order to ensure a better quality, an explicit verification step on each segmented instance was performed as well.

#### Comparison with other data sets

In recent years, researchers have developed several pre-training data sets and benchmarks which helped the developemnt of algorithms for CV.
Each of these data sets varies significantly in size, number of categories and types of images.
In the previos part, we also introduced the different research focus of some data sets like e.g. ImageNet [@deng2009imagenet], PASCAL VOC [@pascalvoc] and SUN [@sun].
ImageNet, containing millions of images, has enabled major breakthroughs in both object classification and detection research using a new class of deep learning algorithms. It was created with the intention to capture a large number of object categories, many of which are fine-grained. SUN focuses on labeling scene types and the objects that commonly occur in them. Finally, PASCAL VOC’s primary application is in object detection in natural images. MS COCO is designed for the detection and segmentation of objects occurring in their natural context [@mccoco].

With the help of Fig. \@ref(fig:cococomparison), one can compare MS COCO to ImageNet, PASCAL VOC and SUN with respect to different aspects [@mccoco].

```{r cococomparison, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:cococomparison)" }
knitr::include_graphics("figures/02-01/2.1 coco comparison.png")
```
(ref:cococomparison) Comparison MS COCO with PASCAL VOC, SUN and ImageNet [@mccoco].

The number of instances per category for all 91 categories in MS COCO and PASCAL VOC is shown in subfigure \@ref(fig:cococomparison) (a). Compared to PASCAL VOC, MS COCO has both more categories and (on average) more instances per category. The number of object categories and the number of instances per category for all the datasets is shown in subfigure \@ref(fig:cococomparison) (d). MS COCO has fewer categories than ImageNet and SUN, but it has the highest average number of instances per category among all the data sets, which from the perspective of authors might be useful for learning complex models capable of precise localization [@mccoco].
Subfigures \@ref(fig:cococomparison) (b) and (c) show the number of annotated categories and annotated instances per image for MS COCO, ImageNet, PASCAL VOC and SUN (average number of categories and instances are shown in parentheses). On average MS COCO contains 3.5 categories and 7.7 instances per image. ImageNet and PASCAL VOC both have on average less than 2 categories and 3 instances per image. The SUN data set has the most contextual information, on average 9.8 categories and 17 instances per image. Subfigure \@ref(fig:cococomparison) (e) depicts the distribution of instance sizes for the MS COCO, ImageNet Detection, PASCAL VOC and SUN data set.

#### Discussion

MS COCO is a large scale data set for detecting and segmenting objects found in everyday life, with the aim of improving the state-of-the-art in object recognition and scene understanding. It focuses on non-iconic images of objects in natural environments and contains rich contextual information with many objects present per image. MS COCO is one of the typically used vision data sets, which are labor intensive and costly to create.
With the vast cost and over 70,000 worker hours, 2.5 Mio instances were annotated to drive the advancement of object detection and segmentation algorithms. MS COCO is still a good benchmark for the field of CV [@mccoco].
The MS COCO Team also shows directions for future. For example “stuff” label like "sky", "grass", and "street", etc, may also be included in the dataset since “stuff” categories provide significant contextual information for the object detection.

### Models for Image captioning

The image captioning task is generally to describe the visual content of an image in natural language, so it requires an algorithm to understand and model the relationships between visual and textual elements, and to generate a sequence of output words [@cornia2020m2].
In the last few years, collections of methods have been proposed for image captioning. Earlier approaches were based on generations of simple templates, which contained the output produced from the object detector or attribute predictor [@Socher10connectingmodalities], [@5487377].
With the sequential nature of language, most research on image captioning has focused on deep learning techniques, using especially Recurrent Neural Network models (RNNs) [@vinyals], [@karpthy1] or one of their special variants (e.g. LSTMs). Mostly, RNNs are used for sequence generation as languages models, while visual information is encoded in the output of a CNN. With the aim of modelling the relationships between image regions and words, graph convolution neural networks in the image encoding phase [@yao1] or single-layer attention mechanisms [@xu1] on the image encoding side have been proposed to incorporate more semantic and spatial relationships between objects.
RNN-based models are widely adopted, however, the model has its limitation on representation power and due to its sequential nature [@cornia2020m2].
Recently, new fully-attentive models, in which the use of self-attention has replaced the recurrence, have been proposed. New approaches apply the Transformer architecture [@NIPS2017_3f5ee243] and BERT [@devlin-etal-2019-bert] models to solve image captioning tasks.
The transformer consists of an encoder with a stack of self-attention and feed-forward layers, and a decoder which uses (masked) self-attention on words and cross-attention over the output of the last encoder layer [@cornia2020m2]. In some other transformer-based approaches, a transformer-like encoder was paired with an LSTM decoder, while the aforementioned approaches have exploited the original transformer architecture.
Others [@HerdadeKBS19] proposed a transformer architecture for image captioning with the focus on geometric relations between input objects at the same time. Specifically, additional geometric weights between object pairs, which is used to scale attention weights, are computed. Similarly, an extension of the attention operator, in which the final attended information is weighted by a gate guided by the context, was introduced at a similar time [@huang1].

### Meshed-Memory Transformer for Image Captioning ($M^2$)

Transformer-based architectures have been widely implemented in sequence modeling tasks like machine translation and language understanding. However, their applicability for multi-modal tasks like image captioning has still been largely under-explored [@cornia2020m2].

```{r m2arc1, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:m2arc1)" }
knitr::include_graphics("figures/02-01/2.2 m2arc1.png")
```
(ref:m2arc1) $M^2$ Transformer [@cornia2020m2].

A novel fully-attentive approach called Meshed-Memory Transformer for Image Captioning ($M^2$) was proposed in 2020 [@cornia2020m2] with the aim of improving the design of both the image encoder and the language decoder. Compared to all previous image captioning models, $M^2$ (see Fig. \@ref(fig:m2arc1) has two new novelties: The encoder encodes a multi-level representation of the relationships between image regions with respect to low-level and high-level relations, and a-priori knowledge can be learned and modeled by using persistent memory vectors. The multi-layer architecture exploits both low- and high-level visual relationships through a learned gating mechanism, which computes the weight at each level, therefore, a mesh-like connectivity between encoder and decoder layers is created for the sentence generation process [@cornia2020m2].

####  $M^2$ Transformer Architecture

```{r m2arc2, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:m2arc2)" }
knitr::include_graphics("figures/02-01/2.1 m2.png")
```
(ref:m2arc2) $M^2$ Transformer Architecture [@cornia2020m2].

Fig. \@ref(fig:m2arc2) shows the detailed architecture of $M^2$ Transformer. It can be divided into the encoder (left) module and the decoder (right) module, both modules with multiple layers. Given the input image region $X$, the image is passed through the attention and feed forward layers. The relationship between image regions with a-priori knowledge will be encoded in each encoding layer, the output of each encoding layers will be read by decoding layers to generate the caption for image word by word [@cornia2020m2].

All interactions between word and image-level features of the input image $X$ are modeled by using scaled dot-product attention. Attention operates on vectors of queries $q$, keys $k$ and values $n$, and takes a weighted sum of the value vectors according to a similarity distribution between query and key vectors.
Attention can be defined as follows [@cornia2020m2]:

\begin{equation}
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d}}) V
(\#eq:binom)
\end{equation}

where $Q$ is a matrix of $n_q$ query vectors, $K$ and $V$ both contain $n_k$ keys and values, all the vectors has the same dimensionality, and $d$ is a scaling factor.

##### Memory-Augmented Encoder

For the given image region $X$, attention can be used to obtain a permutation invariant encoding of $X$ through the self-attention operations, the operator from the Transformer can be defined as follows [@cornia2020m2]:

\begin{equation}
S(X) = Attention(W_q X, W_k X, W_vX)
\end{equation}

In this case, queries, keys, and values are linear projections of the input features, and $W_q$, $W_k$, $W_v$ are their learnable weights, they depend solely on the pairwise similarities between linear projections of the input set X. The self-attention operator encodes the pairwise relationships inside the input.
But self-attention also has its limitation: a prior knowledge on relationships between image regions can not be modelled.
To overcome the limitation, the authors introduce a **Memory-Augmented Attention** operator by extending the keys and values with additional prior information, which does not depend on image region $X$.
The additional keys and values are initialized as plain learnable vectors which can be directly updated via SGD.
The operator can be defined as follows [@cornia2020m2]:

\begin{align}
M_{mem}(X) &=  Attention(W_qX, K, V ) \notag \\
K &=  [W_kX, M_k]\notag \\
V &= [W_vX, M_v]
\end{align}

$M_k$ and $M_v$ are learnable matrices, with $n_m$ rows, [·,·] indicates concatenation. The additional keys and value could help to retrieve a priori knowledge from input while keeping the quries unchanged [@cornia2020m2].

For the **Encoding Layer**, a memory-augmented operator d is injected into a transformer-like layer, the output is fed into a position-wise feed-forward layer [@cornia2020m2]:

\begin{equation}
F(X)_i= U\sigma(V X_i + b) + c;
\end{equation}

$X_i$ indicates the $i$-th vector of the input set, and $F(X)_i$ the $i$-th vector of the output. Also, $\sigma(·)$ is the ReLU activation function, $V$ and $U$ are learnable weight matrices, $b$ and $c$ are bias terms [@cornia2020m2].

Each component will be complemented by a residual connection and the layer norm operation. The complete definition of an encoding layer can be finally written as [@cornia2020m2]:

\begin{align}
Z &= AddNorm(M_{mem}(X))\notag \\
\tilde{X}&=AddNorm(F(Z))
\end{align}

Finally the **Full Encoder** has multiple encoder layers in a sequential fashion, therefore the $i$-th layer uses the output set computed by layer $i − 1$, higher encoding layers can exploit and refine relationships identified by previous layers, $n$ encoding layers will produce the output $\tilde{X} = (\tilde{X}^1 \dots \tilde{X}^n)$ [@cornia2020m2].

##### Meshed Decoder

The decoder depends on both previously generated words and image region encodings.
**Meshed Cross-Attention** can take  advantage of all the encoder layers to generate captions for the image. On the right side of the Fig. \@ref(fig:m2arc2) the structure of the meshed decoder is shown. The input sequence vector $Y$ and the outputs from all encoder layers $\tilde{X}$ are connected by the meshed attention operator gated through cross-attention. The meshed attention operator can is formally defined as [@cornia2020m2]:

\begin{equation}
M_{mesh}(\tilde{X}, Y) =\sum_{i = 1}^{N}\alpha_i C(\tilde{X^i}, Y)
\end{equation}

$C(·,·)$ stands for the encoder-decoder cross-attention, it is defined with queries from decoder, while the keys and values come from the encoder [@cornia2020m2].

\begin{equation}
C(\tilde{X^i}, Y) = Attention(W_q Y, W_k \tilde{X^i}, W_v \tilde{X^i})
\end{equation}

$\alpha_i$ is a matrix of weights of the same size as the cross-attention results, $\alpha_i$ models both single contribution of each encoder layer and the relative importance between different layers [@cornia2020m2].

\begin{equation}
\alpha_i = \sigma(W_i [Y,C(\tilde{X^i}, Y)]+ b_i)
\end{equation}

The [·,·] indicates concatenation and $\sigma(·)$ is the sigmoid activation function here, $W_i$ is a weight matrix, and $b_i$ is a learnable bias vector [@cornia2020m2].

In decoder layers the prediction of a word should only depend on the previously generated word, so the decoder layer comprises a masked self-attention operation, which means that the operator can only make connections between queries derived from the $t$-th element of its input sequence Y with keys and values from left sub-sequence, i.e. $Y_{≤t}$.

Simlilar as the encoder layers, the decoder layers also contain a position-wise feed-forward layer, so the decoder layer can be finally defined as [@cornia2020m2]:

\begin{align}
Z &= AddNorm(M_{mesh}(X,AddNorm(S_{mask}(Y ))) \notag \\
\tilde{Y} &= AddNorm(F(Z)),
\end{align}

where $S_{mask}$ indicates a masked self-attention over time [@cornia2020m2].
The full decoder with multiple decoder layers takes the input word vectors as well as the $t$-th element (and all elements prior to it) of its output sequence to make the prediction for the word at $t + 1$, conditioned on th $Y_{≤t}$. Finally the decoder takes a linear projection and a softmax operation, which can be seen as a probability distribution over all words in the vocabulary [@cornia2020m2].

##### Comparison with other models on the MS COCO data sets

The $M^2$ Transformer was evaluated on MS COCO, which is still one of the most commonly used test data set for image captioning. Instead of using the original MS COCO dat set, @cornia2020m2 follow the split of MS COCO provided by @karpthy1. Karpathy uses 5000 images for validation, 5000 images for testing and the rest for training.

For model evaluation and comparison, standard metrics for evaluating generated sequences, like BLEU [@papineni-etal-2002-bleu], METEOR [@meteor], ROUGE [@lin-2004-rouge,], CIDEr [@cider], and SPICE [@spice], which have been introduced in the second chapter, are used.

```{r compare1, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:compare1)" }
knitr::include_graphics("figures/02-01/02-02 compare1.png")
```
(ref:compare1) Comparison of $M^2$ with Transformer-based alternatives [@cornia2020m2]

The transformer architecture in its original configuration with six layers has been applied for captioning, researchers speculated that specific architectures might be required for captioning, so variations of the original transformer are compared with $M^2$ Transformer. Other variations are a transformer with three layers and the “Attention on Attention” (AoA) approach [@huang1] to the attentive layers, both in the encoder and in the decoder [@cornia2020m2].
The second part intends to evaluate the importance of the meshed connections between encoder and decoder layers. $M^2$ Transformer (1 to 1) is a reduced version of the original $M^2$ Transformer, in which one encoder layer is connected to only corresponding decoder layer instead of being connected to all the decoder layers.
As one can see from the Fig. \@ref(fig:compare1), the original Transformer has a 121.8 CIDEr score, which is lower than the reduced version of $M^2$ Transformer, showing an improvement to 129.2 CIDEr. With respect to meshed connectivity, which helps to exploit relationships encoded at all layers and weights them with a sigmoid gating, one can observe a further improvement in CIDEr from 129.2 to 131.2. Also the role of memory vectors and the softmax gating schema for $M^2$ Transformer are also included in the table. Eliminating the memory vector leads to a reduction of the performance by nearly 1 point in CIDEr in both the reduced $M^2$ Transformer and the original $M^2$ Transformer [@cornia2020m2].

```{r compare2, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:compare2)" }
knitr::include_graphics("figures/02-01/02-02 compare2.png")
```
(ref:compare2) Comparison with the state-of-the-art on the “Karpathy” test split, in single-model setting [@cornia2020m2].

Fig \@ref(fig:compare2) compares the performance of $M^2$ Transformer with several recently proposed models for image captioning.
SCST [@8099614] and Up-Down [@8578734], use attention over the grid of features and attention over regions. RFNet [@renet] uses a recurrent fusion network to merge different CNN features; GCN-LSTM [@GCN-LSTM] uses a Graph CNN to exploit pairwise relationships between image regions; SGAE [@Yang_2019_CVPR] uses scene graphs instead ofauto-encoding. The original AoA-Net [@huang1] approach uses attention on attention for encoding image regions and an LSTM language model. Finally, the ORT [@HerdadeKBS19] uses a plain transformer and weights attention scores in the region encoder with pairwise distances between detections [@cornia2020m2].

In Fig. \@ref(fig:compare2), the $M^2$ Transformer exceeds all other models on BLEU-4, METEOR, and CIDEr. The performance of the $M^2$ Transformer was very close and competitive with SGAE on BLEU-1 and with ORT with respect to SPICE.

```{r example2, fig.align = 'center', out.width = '100%',echo=FALSE, fig.cap="(ref:example2)"}
knitr::include_graphics("figures/02-01/02-02 example2.png")
```
(ref:example2) Examples of captions generated by $M^2$ Transformer and the original Transformer model, as well as the corresponding ground-truths [@cornia2020m2].

Fig. \@ref(fig:example2) shows some examples of captions generated by $M^2$ Transformer and the original transformer model, as well as the corresponding ground-truths. According to the selected examples of captions, $M^2$ Transformer shows the ability to generate more accurate descriptions of the images, and the approach could detect the more detailed relationships between image regions [@cornia2020m2].

The $M^2$ Transformer is a new transformer-based architecture for image captioning. It improves the image encoding by learning a multi-level representation of the relationships between image regions while exploiting a priori knowledge from each encoder layer, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features at the language generation steps. The results of model evaluation with MS COCO shows that the performance of the $M^2$ Transformer approach surpasses most of the recent approaches and achieves a new state of the art on MS COCO [@cornia2020m2].