diff --git a/docs/image_captioning.md b/docs/image_captioning.md
index 0a195f6c6..b08533c26 100644
--- a/docs/image_captioning.md
+++ b/docs/image_captioning.md
@@ -3,9 +3,19 @@ layout: default
 ---
 # Image Captioning
 
-This module extends Sockeye to perform image captioning. It follows the same logic of sequence-to-sequence frameworks, which consist of encoder-decoder models.
+Sockeye provides also a module to perform image captioning.
+It follows the same logic of sequence-to-sequence frameworks, which consist of encoder-decoder models.
 In this case the encoder takes an image instead of a sentence and encodes it in a feature representation.
 This is decoded with attention (optionally) using exactly the same models of Sockeye (RNNs, transformers, or CNNs).
+This tutorial explains how to train image captioning models.
+
+
+## Citation
+
+For technical information about the image captioning module, see our paper on the arXiv ([BibTeX](sockeye_captioning.bib)):
+
+> Loris Bazzani, Tobias Domhan, and Felix Hieber. 2018.
+> [Image Captioning as Neural Machine Translation Task in SOCKEYE](https://arxiv.org/abs/1810.04101). ArXiv e-prints.
 
 
 ## Installation
@@ -22,9 +32,7 @@ Optionally you can also install matplotlib for visualization:
 ```
 
 
-## First Steps
-
-### Train
+## Train
 
 In order to train your first image captioning model you will need two sets of parallel files: one for training
 and one for validation. The latter will be used for computing various metrics during training.
@@ -91,7 +99,7 @@ There is an initial overhead to load the feature (training does not start immedi
 
 You can add the options `--decode-and-evaluate 200 --max-output-length 60` to perform captioning of the part of the validation set (200 samples in this case) during training.
 
-### Image to Text
+## Image to Text
 
 Assuming that features were pre-extracted, you can do image captioning as follows:
 
@@ -126,7 +134,7 @@ You can also caption directly from image with the option `--extract-image-featur
 ```
 
 
-#### Using Lexical Constrains
+### Using Lexical Constrains
 
 It is also possible to use lexical constraints during inference as described [here](inference.html#lexical-constraints).
 The input JSON object needs to have the following form, with the image path in the `text` field, and constraints specified as usual:
@@ -139,7 +147,7 @@ The input JSON object needs to have the following form, with the image path in t
 You can use the `sockeye.lexical_constraints` module to generate this (for usage, run `python3 -m sockeye.lexical_constraints`).
 Once the file is generated, the CLI option `--json-input` needs to be passed to `sockeye.image_captioning.captioner`.
 
-### Visualization
+## Visualization
 
 You can now visualize the results in a nice format as follows:
 
diff --git a/docs/index.md b/docs/index.md
index b885de3e6..43ed555cf 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -18,6 +18,7 @@ It implements state-of-the-art encoder-decoder architectures, such as
 - Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)]
 
 In addition, this framework provides an experimental [image-to-description module](https://github.com/awslabs/sockeye/tree/master/sockeye/image_captioning) that can be used for [image captioning](image_captioning.html).
+
 Recent developments and changes are tracked in our [CHANGELOG](https://github.com/awslabs/sockeye/blob/master/CHANGELOG.md).
 
 If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new). 
diff --git a/docs/sockeye_captioning.bib b/docs/sockeye_captioning.bib
new file mode 100644
index 000000000..4c26cffb1
--- /dev/null
+++ b/docs/sockeye_captioning.bib
@@ -0,0 +1,12 @@
+@article{SockeyeCaptioning:18,
+   author = {Bazzani, Loris and Domhan, Tobias and Hieber, Felix},
+    title = "{Image Captioning as Neural Machine Translation Task in SOCKEYE}",
+  journal = {arXiv preprint arXiv:1810.04101},
+archivePrefix = "arXiv",
+   eprint = {1810.04101},
+ primaryClass = "cs.CV",
+ keywords = {Computer Science - Computer Vision and Pattern Recognition},
+     year = 2018,
+    month = oct,
+      url = {https://arxiv.org/abs/1810.04101}
+}