diff --git a/docs/image_captioning.md b/docs/image_captioning.md index 0a195f6c6..b08533c26 100644 --- a/docs/image_captioning.md +++ b/docs/image_captioning.md @@ -3,9 +3,19 @@ layout: default --- # Image Captioning -This module extends Sockeye to perform image captioning. It follows the same logic of sequence-to-sequence frameworks, which consist of encoder-decoder models. +Sockeye provides also a module to perform image captioning. +It follows the same logic of sequence-to-sequence frameworks, which consist of encoder-decoder models. In this case the encoder takes an image instead of a sentence and encodes it in a feature representation. This is decoded with attention (optionally) using exactly the same models of Sockeye (RNNs, transformers, or CNNs). +This tutorial explains how to train image captioning models. + + +## Citation + +For technical information about the image captioning module, see our paper on the arXiv ([BibTeX](sockeye_captioning.bib)): + +> Loris Bazzani, Tobias Domhan, and Felix Hieber. 2018. +> [Image Captioning as Neural Machine Translation Task in SOCKEYE](https://arxiv.org/abs/1810.04101). ArXiv e-prints. ## Installation @@ -22,9 +32,7 @@ Optionally you can also install matplotlib for visualization: ``` -## First Steps - -### Train +## Train In order to train your first image captioning model you will need two sets of parallel files: one for training and one for validation. The latter will be used for computing various metrics during training. @@ -91,7 +99,7 @@ There is an initial overhead to load the feature (training does not start immedi You can add the options `--decode-and-evaluate 200 --max-output-length 60` to perform captioning of the part of the validation set (200 samples in this case) during training. -### Image to Text +## Image to Text Assuming that features were pre-extracted, you can do image captioning as follows: @@ -126,7 +134,7 @@ You can also caption directly from image with the option `--extract-image-featur ``` -#### Using Lexical Constrains +### Using Lexical Constrains It is also possible to use lexical constraints during inference as described [here](inference.html#lexical-constraints). The input JSON object needs to have the following form, with the image path in the `text` field, and constraints specified as usual: @@ -139,7 +147,7 @@ The input JSON object needs to have the following form, with the image path in t You can use the `sockeye.lexical_constraints` module to generate this (for usage, run `python3 -m sockeye.lexical_constraints`). Once the file is generated, the CLI option `--json-input` needs to be passed to `sockeye.image_captioning.captioner`. -### Visualization +## Visualization You can now visualize the results in a nice format as follows: diff --git a/docs/index.md b/docs/index.md index b885de3e6..43ed555cf 100644 --- a/docs/index.md +++ b/docs/index.md @@ -18,6 +18,7 @@ It implements state-of-the-art encoder-decoder architectures, such as - Fully convolutional sequence-to-sequence models [[Gehring et al, '17](https://arxiv.org/abs/1705.03122)] In addition, this framework provides an experimental [image-to-description module](https://github.com/awslabs/sockeye/tree/master/sockeye/image_captioning) that can be used for [image captioning](image_captioning.html). + Recent developments and changes are tracked in our [CHANGELOG](https://github.com/awslabs/sockeye/blob/master/CHANGELOG.md). If you are interested in collaborating or have any questions, please submit a pull request or [issue](https://github.com/awslabs/sockeye/issues/new). diff --git a/docs/sockeye_captioning.bib b/docs/sockeye_captioning.bib new file mode 100644 index 000000000..4c26cffb1 --- /dev/null +++ b/docs/sockeye_captioning.bib @@ -0,0 +1,12 @@ +@article{SockeyeCaptioning:18, + author = {Bazzani, Loris and Domhan, Tobias and Hieber, Felix}, + title = "{Image Captioning as Neural Machine Translation Task in SOCKEYE}", + journal = {arXiv preprint arXiv:1810.04101}, +archivePrefix = "arXiv", + eprint = {1810.04101}, + primaryClass = "cs.CV", + keywords = {Computer Science - Computer Vision and Pattern Recognition}, + year = 2018, + month = oct, + url = {https://arxiv.org/abs/1810.04101} +}