Quick Start

Environment

Set up your environment:

conda create -n abe python=3.11
conda activate abe
pip install -r requirements.txt

Inputs

Input streams are tab-separated dictionaries with the inputs.

Using a series of simple bash commands, we can create our input stream.

echo "This is a test." | python ensembling/build/bilingual-no-tags > input.1
echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn > input.2

Run

Then we can paste these files together and pipe into our ensembling code:

paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 beam

You can run with the flag -d to see the beams at each time step.

paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 -d beam

A small note on running is that you need a flag between -m (a nargs='*' argument) and beam or you will get an ArgParse:

ensemble.py: error: the following arguments are required: command

Repository Structure

Ensembling Code

All code to do our method of ensembling can be found in the ensembling directory. The important files include ensemble.py which contains the main function; models.py which has the model wrappers for each model to maintain it's own hidden state; search.py which has our cube-pruning-esque search algorithm. utils.py has some functions to help with tokenization.

Data

Inputs

For all our experiments, we use WMT24 data (en-XX, but mostly en-de). The raw inputs can be found in refs. These were made via commands such as:

sacrebleu -t wmt24 -l en-de --echo src > wmt24.en-de.en
sacrebleu -t wmt24 -l en-de --echo ref > wmt24.en-de.de

Creation

These inputs are unsegmented (multiple sentences per line) which can make some machine translation models add or remove content. To circumvent these issues, we first segment these files into sentences. We then translate, and then reconcatenate. This requires an intermediate file (the sentences with the associated line numbers). We create this using ersatz:

cat wmt24.en-de.en | awk '{print NR "\t" $0}' | ersatz -m en -C 1 > wmt24.en-de.en.sentences

Our ensembling code requires jsonl inputs. We provide several scripts to automatically create these from plain text inputs. All scripts are in ensembling/build/

bilingual-no-tags creates inputs for a traditional encoder-decoder model which takes the input line as encoder input and has no additional special tags. We use these for our Marian en-de models.
empty creates an empty input. This would be used for a traditional decoder-only model that does not take prompts.
prompt creates input for both LLAMA and Tower specifically for translation. This is highly constrained to the set of languages we cover but we provide both 0-shot and 3-shot options. Calling looks like echo "This is a test." | python ensembling/build/prompt llama3-0-shot English German
src-tgt creates input for both M2M and NLLB by taking the source language token and the target language token. Calling looks like echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn

The processed inputs (jsonl) can be found in input_data. They are labelled by model, and language pair.

Outputs

Ensembling

Outputs that were generated by our ensembling method can be found at translations/wmt24/$LANGUAGE_PAIR. The sentences directory contains the sentence-level translations. The targets directory contains the concatenated translations that align to the original reference file.

They were created by calling translation.sh

Baselines

Simple Translations

Outputs that were generated natively (as if the model was run alone) can be found at baselines/simple-translations/outputs. Similar to the above, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.

These files were created by calling baselines/simple-translations/scripts/*.py (depends on specific model)

Linear Interpolation

Outputs that were generated using a more traditional linear interpolation of the log probabilities (only for our models which guarantee the same vocabulary) can be found at baselines/interpolation/outputs. Again, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.

These files were created by calling baselines/interpolation/interpolate-translate.py

Scoring

All scores are handled in the scoring directory. We score both BLEU and COMET.

bleu-scores is generated by bleu.py and creates a file of BLEU scores of our ensembled outputs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and BLEU_SCORE respectively.
comet-scores is generated by comet.py and creates a file of COMET scores of our ensembled outputs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and COMET_SCORE respectively.
simple-bleu-scores is generated by simple-bleu.py and creates a file of BLEU scores of the individual model outputs. The format is tsv where the columns are MODEL and BLEU_SCORE
simple-comet-scores is generated by simple-comet.py and creates a file of COMET scores of the individual model outputs. The format is tsv where the columns are MODEL and COMET_SCORE
interpolate-bleu-scores is generated by interpolate-bleu.py and creates a file of BLEU scores of the models ensembled via linear interpolation of the log probs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and BLEU_SCORE respectively.
interpolate-comet-scores is generated by interpolate-comet.py and creates a file of COMET scores of the models ensembled via linear interpolation of the log probs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and COMET_SCORE respectively.

Adding a New Model

The main file to edit is utils.py which has a variable called TOKENIZER_CONFIG. This is a series of variables to tell our code how tokenization is handled. So long as the new model is reasonably similar to these standard tokenization modes, it should be added in seamlessly. Recall that we detokenize to byte strings for agreement comparison, so [de-]tokenization is extremely important to get correct.

For example:

    "facebook/nllb-200-distilled-600M": {
        "lstrip": False,
        "special_character": SPIECE_UNDERLINE,
        "begin_word": True,
        "byte_map": BYTE_MAP,
        "add_space": True
    },

is the tokenization scheme for NLLB. In order to add a new model, you add a new key with the huggingface model id.

lstrip: Does the Tokenizer use lstrip on word beginnings? i.e., if I decode ▁Hello does it decode as Hello or [SPACE]Hello. special_character: what is the whitespace special character. Common exampls are ▁ or Ġ. begin_word: Does this special character begin the word? byte_map: The mapping of how the vocabulary stores bytes to the underlying byte. For example, many SPM models store as a string <0xBYTE> add_space: Do we need to add a space to the beginning of the string? This is typically because the model removes spaces at the beginning of sentences.

If you run into problems, please contact the authors (e.g., rewicks@jhu.edu) or file an issue for assistance.

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
analysis		analysis
baselines		baselines
ensembling		ensembling
input_data		input_data
refs		refs
scoring		scoring
test		test
translations/wmt24		translations/wmt24
.gitignore		.gitignore
Changelog		Changelog
LICENSE		LICENSE
README.md		README.md
Token_level_ensembling.pdf		Token_level_ensembling.pdf
combine-by-line-number.py		combine-by-line-number.py
get-model-input.py		get-model-input.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
translation.sh		translation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick Start

Environment

Inputs

Run

Repository Structure

Ensembling Code

Data

Inputs

Creation

Outputs

Ensembling

Baselines

Simple Translations

Linear Interpolation

Scoring

Adding a New Model

About

Releases

Packages

Contributors 3

Languages

License

mjpost/abe

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Environment

Inputs

Run

Repository Structure

Ensembling Code

Data

Inputs

Creation

Outputs

Ensembling

Baselines

Simple Translations

Linear Interpolation

Scoring

Adding a New Model

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages