Skip to content

Releases: huggingface/tokenizers

Release v0.20.0: faster encode, better python support

08 Aug 16:56
Compare
Choose a tag to compare

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :
image

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

New Contributors

Full Changelog: v0.19.1...v0.20.0rc1

v0.19.1

17 Apr 21:37
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.19.0...v0.19.1

v0.19.0

17 Apr 08:51
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.15.2...v0.19.0

v0.19.0rc0

16 Apr 14:06
Compare
Choose a tag to compare
v0.19.0rc0 Pre-release
Pre-release

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

  • chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
  • [remove black] And use ruff by @ArthurZucker in #1436
  • Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
  • Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in #1443
  • 🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
  • Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
  • PyO3 0.21. by @Narsil in #1494
  • Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
  • Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498

Full Changelog: v0.15.2...v0.19.0rc0

v0.15.2

12 Feb 02:35
Compare
Choose a tag to compare

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:
image

New Contributors

Full Changelog: v0.15.1...v0.15.2rc1

v0.15.1

22 Jan 16:49
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.15.0...v0.15.1

v0.15.1.rc0

18 Jan 16:34
888dd4b
Compare
Choose a tag to compare
v0.15.1.rc0 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.13.4.rc2...v0.15.1.rc0

v0.15.0

14 Nov 19:06
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.14.1...v0.15.0

v0.14.1

06 Oct 11:10
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.13.3...v0.14.1

v0.14.1rc1

05 Oct 13:56
Compare
Choose a tag to compare
v0.14.1rc1 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.13.4.rc2...v0.14.1rc1