15 Nov 11:12

Narsil

cf102e6

Release v0.21.0 Latest

Latest

Release v0.20.4 v0.21.0

More cache options. by @Narsil in #1675
Disable caching for long strings. by @Narsil in #1676
Testing ABI3 wheels to reduce number of wheels by @Narsil in #1674
Adding an API for decode streaming. by @Narsil in #1677
Decode stream python by @Narsil in #1678
Fix encode_batch and encode_batch_fast to accept ndarrays again by @diliop in #1679

We also no longer support python 3.7 or 3.8 (similar to transformers) as they are deprecated.

Full Changelog: v0.20.3...v0.21.0

Contributors

Narsil and diliop

Assets 2

05 Nov 17:20

ArthurZucker

v0.20.3

b63262a

v0.20.3

What's Changed

There was a breaking change in 0.20.3 for tuple inputs of encode_batch!

fix pylist by @ArthurZucker in #1673
[MINOR:TYPO] Fix docstrings by @cakiki in #1653

New Contributors

@cakiki made their first contribution in #1653

Full Changelog: v0.20.2...v0.20.3

Contributors

cakiki and ArthurZucker

Assets 2

04 Nov 17:25

ArthurZucker

v0.20.2

caa6505

v0.20.2

Release v0.20.2

Thanks a MILE to @diliop we now have support for python 3.13! 🥳

What's Changed

Bump cookie and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1648
Fix off-by-one error in tokenizer::normalizer::Range::len by @rlanday in #1638
Arg name correction: auth_token -> token by @rravenel in #1621
Unsound call of set_var by @sftse in #1664
Add safety comments by @Manishearth in #1651
Bump actions/checkout to v4 by @tinyboxvk in #1667
PyO3 0.22 by @diliop in #1665
Bump actions versions by @tinyboxvk in #1669

New Contributors

@rlanday made their first contribution in #1638
@rravenel made their first contribution in #1621
@sftse made their first contribution in #1664
@Manishearth made their first contribution in #1651
@tinyboxvk made their first contribution in #1667
@diliop made their first contribution in #1665

Full Changelog: v0.20.1...v0.20.2

Contributors

diliop, Manishearth, and 5 other contributors

Assets 2

10 Oct 09:56

ArthurZucker

v0.20.1

d98298a

Release v0.20.1

What's Changed

The most awaited offset issue with Llama is fixed 🥳

Update README.md by @ArthurZucker in #1608
fix benchmark file link by @152334H in #1610
Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in #1626
[ignore_merges] Fix offsets by @ArthurZucker in #1640
Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1629
Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1630
Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in #1631
Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1641
Fix documentation build by @ArthurZucker in #1642
style: simplify string formatting for readability by @hamirmahal in #1632

New Contributors

@152334H made their first contribution in #1610
@hamirmahal made their first contribution in #1632

Full Changelog: v0.20.0...v0.20.1

Contributors

dependabot, hamirmahal, and 2 other contributors

Assets 2

08 Aug 16:56

ArthurZucker

v0.20.0

a5adaac

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

remove enforcement of non special when adding tokens by @ArthurZucker in #1521
[BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder by @Narsil in #1513
Make USED_PARALLELISM atomic by @nathaniel-daniel in #1532
Fixing for clippy 1.78 by @Narsil in #1548
feat(ci): add trufflehog secrets detection by @McPatate in #1551
Switch from cached_download to hf_hub_download in tests by @Wauplin in #1547
Fix "dictionnary" typo by @nprisbrey in #1511
make sure we don't warn on empty tokens by @ArthurZucker in #1554
Enable dropout = 0.0 as an equivalent to none in BPE by @mcognetta in #1550
Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) … by @ArthurZucker in #1569
Add bytelevel normalizer to fix decode when adding tokens to BPE by @ArthurZucker in #1555
Fix clippy + feature test management. by @Narsil in #1580
Bump spm_precompiled to 0.1.3 by @MikeIvanichev in #1571
Add benchmark vs tiktoken by @Narsil in #1582
Fixing the benchmark. by @Narsil in #1583
Tiny improvement by @Narsil in #1585
Enable fancy regex by @Narsil in #1586
Fixing release CI strict (taken from safetensors). by @Narsil in #1593
Adding some serialization testing around the wrapper. by @Narsil in #1594
Add-legacy-tests by @ArthurZucker in #1597
Adding a few tests for decoder deserialization. by @Narsil in #1598
Better serialization error by @Narsil in #1595
Add test normalizers by @ArthurZucker in #1600
Improve decoder deserialization by @Narsil in #1599
Using serde (serde_pyo3) to get str and repr easily. by @Narsil in #1588
Merges cannot handle tokens containing spaces. by @Narsil in #909
Fix doc about split by @ArthurZucker in #1591
Support None to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in #1590
Fix strip python type by @ArthurZucker in #1602
Tests + Deserialization improvement for normalizers. by @Narsil in #1604
add deserialize for pre tokenizers by @ArthurZucker in #1603
Perf improvement 16% by removing offsets. by @Narsil in #1587

New Contributors

@nathaniel-daniel made their first contribution in #1532
@nprisbrey made their first contribution in #1511
@mcognetta made their first contribution in #1550
@MikeIvanichev made their first contribution in #1571

Full Changelog: v0.19.1...v0.20.0rc1

Contributors

Narsil, mcognetta, and 6 other contributors

Assets 2

17 Apr 21:37

ArthurZucker

v0.19.1

3b3c960

v0.19.1

What's Changed

add serialization for ignore_merges by @ArthurZucker in #1504

Full Changelog: v0.19.0...v0.19.1

Contributors

ArthurZucker

Assets 2

17 Apr 08:51

Narsil

v0.19.0

e59020d

v0.19.0

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
[remove black] And use ruff by @ArthurZucker in #1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in #1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
PyO3 0.21. by @Narsil in #1494
Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498
Fixing doc. by @Narsil in #1499

Full Changelog: v0.15.2...v0.19.0

Contributors

Narsil, eaplatanios, and 3 other contributors

Assets 2

16 Apr 14:06

Narsil

v0.19.0rc0

36846e8

v0.19.0rc0 Pre-release

Pre-release

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

chore: Remove CLI - this was originally intended for local development by @bryantbiggs in #1442
[remove black] And use ruff by @ArthurZucker in #1436
Bump ip from 2.0.0 to 2.0.1 in /bindings/node by @dependabot in #1456
Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. by @eaplatanios in #1443
🚨🚨 BREAKING CHANGE 🚨🚨: (add_prefix_space dropped everything is using prepend_scheme enum instead) Refactor metaspace by @ArthurZucker in #1476
Add more support for tiktoken based tokenizers by @ArthurZucker in #1493
PyO3 0.21. by @Narsil in #1494
Remove 3.13 (potential undefined behavior.) by @Narsil in #1497
Bumping all versions 3 times (ty transformers :) ) by @Narsil in #1498

Full Changelog: v0.15.2...v0.19.0rc0

Contributors

Narsil, eaplatanios, and 3 other contributors

Assets 2

12 Feb 02:35

ArthurZucker

v0.15.2

701a73b

v0.15.2

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:

chore: Update dependencies to latest supported versions by @bryantbiggs in #1441
Convert word counts to u64 by @stephenroller in #1433
Efficient Replace normalizer by @rlrs in #1413

New Contributors

@bryantbiggs made their first contribution in #1441
@stephenroller made their first contribution in #1433
@rlrs made their first contribution in #1413

Full Changelog: v0.15.1...v0.15.2rc1

Contributors

stephenroller, rlrs, and bryantbiggs

Assets 2

22 Jan 16:49

ArthurZucker

v0.15.1

d38be16

v0.15.1

What's Changed

udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
Derive Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in #1381
Stale bot. by @Narsil in #1404
Fix doc links in readme by @Pierrci in #1367
Faster HF dataset iteration in docs by @mariosasko in #1414
Add quick doc to byte_level.rs by @steventrouble in #1420
Fix make bench. by @Narsil in #1428
Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
pyo3: update to 0.20 by @mikelui in #1386
Encode special tokens by @ArthurZucker in #1437
Update release for python3.12 windows by @ArthurZucker in #1438

New Contributors

@steventrouble made their first contribution in #1420

Full Changelog: v0.15.0...v0.15.1

Contributors

Narsil, steventrouble, and 6 other contributors

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.20.4 v0.21.0

Contributors

What's Changed

New Contributors

Contributors

Release v0.20.2

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Release v0.20.0

Performances:

Python API

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: huggingface/tokenizers

Release v0.21.0

Release v0.20.4 v0.21.0

Contributors

v0.20.3

What's Changed

New Contributors

Contributors

v0.20.2

Release v0.20.2

What's Changed

New Contributors

Contributors

Release v0.20.1

What's Changed

New Contributors

Contributors

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API

What's Changed

New Contributors

Contributors

v0.19.1

What's Changed

Contributors

v0.19.0

What's Changed

Contributors

v0.19.0rc0

What's Changed

Contributors

v0.15.2

What's Changed

New Contributors

Contributors

v0.15.1

What's Changed

New Contributors

Contributors