Set up your environment:
conda create -n abe python=3.11
conda activate abe
pip install -r requirements.txt
Input streams are tab-separated dictionaries with the inputs.
Using a series of simple bash commands, we can create our input stream.
echo "This is a test." | python ensembling/build/bilingual-no-tags > input.1
echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn > input.2
Then we can paste these files together and pipe into our ensembling code:
paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 beam
You can run with the flag -d
to see the beams at each time step.
paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 -d beam
- A small note on running is that you need a flag between
-m
(anargs='*'
argument) andbeam
or you will get anArgParse
:
ensemble.py: error: the following arguments are required: command
All code to do our method of ensembling can be found in the ensembling
directory. The important files include ensemble.py
which contains the main function; models.py
which has the model wrappers for each model to maintain it's own hidden state; search.py
which has our cube-pruning-esque search algorithm. utils.py
has some functions to help with tokenization.
For all our experiments, we use WMT24 data (en-XX, but mostly en-de).
The raw inputs can be found in refs
. These were made via commands such as:
sacrebleu -t wmt24 -l en-de --echo src > wmt24.en-de.en
sacrebleu -t wmt24 -l en-de --echo ref > wmt24.en-de.de
These inputs are unsegmented (multiple sentences per line) which can make some machine translation models add or remove content. To circumvent these issues, we first segment these files into sentences. We then translate, and then reconcatenate. This requires an intermediate file (the sentences with the associated line numbers). We create this using ersatz
:
cat wmt24.en-de.en | awk '{print NR "\t" $0}' | ersatz -m en -C 1 > wmt24.en-de.en.sentences
Our ensembling code requires jsonl
inputs. We provide several scripts to automatically create these from plain text inputs. All scripts are in ensembling/build/
bilingual-no-tags
creates inputs for a traditional encoder-decoder model which takes the input line as encoder input and has no additional special tags. We use these for our Marianen-de
models.empty
creates an empty input. This would be used for a traditional decoder-only model that does not take prompts.prompt
creates input for both LLAMA and Tower specifically for translation. This is highly constrained to the set of languages we cover but we provide both 0-shot and 3-shot options. Calling looks likeecho "This is a test." | python ensembling/build/prompt llama3-0-shot English German
src-tgt
creates input for both M2M and NLLB by taking the source language token and the target language token. Calling looks likeecho "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn
The processed inputs (jsonl
) can be found in input_data
. They are labelled by model, and language pair.
Outputs that were generated by our ensembling method can be found at translations/wmt24/$LANGUAGE_PAIR
. The sentences
directory contains the sentence-level translations. The targets
directory contains the concatenated translations that align to the original reference file.
They were created by calling translation.sh
Outputs that were generated natively (as if the model was run alone) can be found at baselines/simple-translations/outputs
. Similar to the above, the sentences
directory contains the sentence-level translations, while the targets
directory contains the concatenanted translations that align to the original reference file.
These files were created by calling baselines/simple-translations/scripts/*.py
(depends on specific model)
Outputs that were generated using a more traditional linear interpolation of the log probabilities (only for our models which guarantee the same vocabulary) can be found at baselines/interpolation/outputs
. Again, the sentences
directory contains the sentence-level translations, while the targets
directory contains the concatenanted translations that align to the original reference file.
These files were created by calling baselines/interpolation/interpolate-translate.py
All scores are handled in the scoring
directory. We score both BLEU
and COMET
.
bleu-scores
is generated bybleu.py
and creates a file of BLEU scores of our ensembled outputs. The format istsv
where the columns areMODEL_ONE
,MODEL_TWO
, andBLEU_SCORE
respectively.comet-scores
is generated bycomet.py
and creates a file of COMET scores of our ensembled outputs. The format istsv
where the columns areMODEL_ONE
,MODEL_TWO
, andCOMET_SCORE
respectively.simple-bleu-scores
is generated bysimple-bleu.py
and creates a file of BLEU scores of the individual model outputs. The format istsv
where the columns areMODEL
andBLEU_SCORE
simple-comet-scores
is generated bysimple-comet.py
and creates a file of COMET scores of the individual model outputs. The format istsv
where the columns areMODEL
andCOMET_SCORE
interpolate-bleu-scores
is generated byinterpolate-bleu.py
and creates a file of BLEU scores of the models ensembled via linear interpolation of the log probs. The format istsv
where the columns areMODEL_ONE
,MODEL_TWO
, andBLEU_SCORE
respectively.interpolate-comet-scores
is generated byinterpolate-comet.py
and creates a file of COMET scores of the models ensembled via linear interpolation of the log probs. The format istsv
where the columns areMODEL_ONE
,MODEL_TWO
, andCOMET_SCORE
respectively.
The main file to edit is utils.py
which has a variable called TOKENIZER_CONFIG
.
This is a series of variables to tell our code how tokenization is handled.
So long as the new model is reasonably similar to these standard tokenization modes, it should be added in seamlessly.
Recall that we detokenize to byte strings for agreement comparison, so [de-]tokenization is extremely important to get correct.
For example:
"facebook/nllb-200-distilled-600M": {
"lstrip": False,
"special_character": SPIECE_UNDERLINE,
"begin_word": True,
"byte_map": BYTE_MAP,
"add_space": True
},
is the tokenization scheme for NLLB. In order to add a new model, you add a new key with the huggingface model id.
lstrip
: Does the Tokenizer use lstrip on word beginnings? i.e., if I decode ▁Hello
does it decode as Hello
or [SPACE]Hello
.
special_character
: what is the whitespace special character. Common exampls are ▁
or Ġ
.
begin_word
: Does this special character begin the word?
byte_map
: The mapping of how the vocabulary stores bytes to the underlying byte. For example, many SPM models store as a string <0xBYTE>
add_space
: Do we need to add a space to the beginning of the string? This is typically because the model removes spaces at the beginning of sentences.
If you run into problems, please contact the authors (e.g., rewicks@jhu.edu) or file an issue for assistance.