Skip to content
/ abe Public

Code and data for "agreement-based ensembling": token-level ensembling of models with different vocabularies

License

Notifications You must be signed in to change notification settings

mjpost/abe

Repository files navigation

Quick Start

Environment

Set up your environment:

conda create -n abe python=3.11
conda activate abe
pip install -r requirements.txt

Inputs

Input streams are tab-separated dictionaries with the inputs.

Using a series of simple bash commands, we can create our input stream.

echo "This is a test." | python ensembling/build/bilingual-no-tags > input.1
echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn > input.2

Run

Then we can paste these files together and pipe into our ensembling code:

paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 beam

You can run with the flag -d to see the beams at each time step.

paste input.1 input.2 | python ensembling/ensemble.py -m rewicks/baseline_en-de_64k_ep25 facebook/nllb-200-distilled-600M -l 256 -d beam
  • A small note on running is that you need a flag between -m (a nargs='*' argument) and beam or you will get an ArgParse:
ensemble.py: error: the following arguments are required: command

Repository Structure

Ensembling Code

All code to do our method of ensembling can be found in the ensembling directory. The important files include ensemble.py which contains the main function; models.py which has the model wrappers for each model to maintain it's own hidden state; search.py which has our cube-pruning-esque search algorithm. utils.py has some functions to help with tokenization.

Data

Inputs

For all our experiments, we use WMT24 data (en-XX, but mostly en-de). The raw inputs can be found in refs. These were made via commands such as:

sacrebleu -t wmt24 -l en-de --echo src > wmt24.en-de.en
sacrebleu -t wmt24 -l en-de --echo ref > wmt24.en-de.de

Creation

These inputs are unsegmented (multiple sentences per line) which can make some machine translation models add or remove content. To circumvent these issues, we first segment these files into sentences. We then translate, and then reconcatenate. This requires an intermediate file (the sentences with the associated line numbers). We create this using ersatz:

cat wmt24.en-de.en | awk '{print NR "\t" $0}' | ersatz -m en -C 1 > wmt24.en-de.en.sentences

Our ensembling code requires jsonl inputs. We provide several scripts to automatically create these from plain text inputs. All scripts are in ensembling/build/

  1. bilingual-no-tags creates inputs for a traditional encoder-decoder model which takes the input line as encoder input and has no additional special tags. We use these for our Marian en-de models.
  2. empty creates an empty input. This would be used for a traditional decoder-only model that does not take prompts.
  3. prompt creates input for both LLAMA and Tower specifically for translation. This is highly constrained to the set of languages we cover but we provide both 0-shot and 3-shot options. Calling looks like echo "This is a test." | python ensembling/build/prompt llama3-0-shot English German
  4. src-tgt creates input for both M2M and NLLB by taking the source language token and the target language token. Calling looks like echo "This is a test." | python ensembling/build/src-tgt eng_Latn deu_Latn

The processed inputs (jsonl) can be found in input_data. They are labelled by model, and language pair.

Outputs

Ensembling

Outputs that were generated by our ensembling method can be found at translations/wmt24/$LANGUAGE_PAIR. The sentences directory contains the sentence-level translations. The targets directory contains the concatenated translations that align to the original reference file.

They were created by calling translation.sh

Baselines

Simple Translations

Outputs that were generated natively (as if the model was run alone) can be found at baselines/simple-translations/outputs. Similar to the above, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.

These files were created by calling baselines/simple-translations/scripts/*.py (depends on specific model)

Linear Interpolation

Outputs that were generated using a more traditional linear interpolation of the log probabilities (only for our models which guarantee the same vocabulary) can be found at baselines/interpolation/outputs. Again, the sentences directory contains the sentence-level translations, while the targets directory contains the concatenanted translations that align to the original reference file.

These files were created by calling baselines/interpolation/interpolate-translate.py

Scoring

All scores are handled in the scoring directory. We score both BLEU and COMET.

  • bleu-scores is generated by bleu.py and creates a file of BLEU scores of our ensembled outputs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and BLEU_SCORE respectively.
  • comet-scores is generated by comet.py and creates a file of COMET scores of our ensembled outputs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and COMET_SCORE respectively.
  • simple-bleu-scores is generated by simple-bleu.py and creates a file of BLEU scores of the individual model outputs. The format is tsv where the columns are MODEL and BLEU_SCORE
  • simple-comet-scores is generated by simple-comet.py and creates a file of COMET scores of the individual model outputs. The format is tsv where the columns are MODEL and COMET_SCORE
  • interpolate-bleu-scores is generated by interpolate-bleu.py and creates a file of BLEU scores of the models ensembled via linear interpolation of the log probs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and BLEU_SCORE respectively.
  • interpolate-comet-scores is generated by interpolate-comet.py and creates a file of COMET scores of the models ensembled via linear interpolation of the log probs. The format is tsv where the columns are MODEL_ONE, MODEL_TWO, and COMET_SCORE respectively.

Adding a New Model

The main file to edit is utils.py which has a variable called TOKENIZER_CONFIG. This is a series of variables to tell our code how tokenization is handled. So long as the new model is reasonably similar to these standard tokenization modes, it should be added in seamlessly. Recall that we detokenize to byte strings for agreement comparison, so [de-]tokenization is extremely important to get correct.

For example:

    "facebook/nllb-200-distilled-600M": {
        "lstrip": False,
        "special_character": SPIECE_UNDERLINE,
        "begin_word": True,
        "byte_map": BYTE_MAP,
        "add_space": True
    },

is the tokenization scheme for NLLB. In order to add a new model, you add a new key with the huggingface model id.

lstrip: Does the Tokenizer use lstrip on word beginnings? i.e., if I decode ▁Hello does it decode as Hello or [SPACE]Hello. special_character: what is the whitespace special character. Common exampls are or Ġ. begin_word: Does this special character begin the word? byte_map: The mapping of how the vocabulary stores bytes to the underlying byte. For example, many SPM models store as a string <0xBYTE> add_space: Do we need to add a space to the beginning of the string? This is typically because the model removes spaces at the beginning of sentences.

If you run into problems, please contact the authors (e.g., rewicks@jhu.edu) or file an issue for assistance.

About

Code and data for "agreement-based ensembling": token-level ensembling of models with different vocabularies

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages