Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bonus material: extending tokenizers #496

Merged
merged 2 commits into from
Jan 22, 2025
Merged

Bonus material: extending tokenizers #496

merged 2 commits into from
Jan 22, 2025

Conversation

rasbt
Copy link
Owner

@rasbt rasbt commented Jan 21, 2025

Adds bonus material to add special tokens to a tokenizer and update the LLM accordingly.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@rasbt rasbt merged commit a22d612 into main Jan 22, 2025
8 checks passed
@rasbt rasbt deleted the extend-tiktoken branch January 22, 2025 15:26
jiyangzh added a commit to jiyangzh/LLMs-from-scratch that referenced this pull request Feb 1, 2025
* Add "What's next" section (rasbt#432)

* Add What's next section

* Delete appendix-D/01_main-chapter-code/appendix-D-Copy2.ipynb

* Delete ch03/01_main-chapter-code/ch03-Copy1.ipynb

* Delete appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb

* Update ch07.ipynb

* Update ch07.ipynb

* Add chapter names

* Add missing device transfer in gpt_generate.py (rasbt#436)

* Add utility to prevent double execution of certain cells (rasbt#437)

* Add flexible padding bonus experiment (rasbt#438)

* Add flexible padding bonus experiment

* fix links

* Fixed command for row 16 additional experiment (rasbt#439)

* fixed command for row 16 experiment

* Update README.md

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

* [minor] typo & comments (rasbt#441)

* typo & comment

- safe -> save
- commenting code: batch_size, seq_len = in_idx.shape

* comment

- adding # NEW for assert num_heads % num_kv_groups == 0

* update memory wording

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* fix misplaced parenthesis and update license (rasbt#466)

* Minor readability improvement in dataloader.ipynb (rasbt#461)

* Minor readability improvement in dataloader.ipynb

- The tokenizer and encoded_text variables at the root level are unused.
- The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length.

* readability improvements

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* typo fixed (rasbt#468)

* typo fixed

* only update plot

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* Add backup URL for gpt2 weights (rasbt#469)

* Add backup URL for gpt2 weights

* newline

* fix ch07 unit test (rasbt#470)

* adds no-grad context for reference model to DPO (rasbt#473)

* Auto download DPO dataset if not already available in path (rasbt#479)

* Auto download DPO dataset if not already available in path

* update tests to account for latest HF transformers release in unit tests

* pep 8

* fix reward margins plot label in dpo nb

* Print out embeddings for more illustrative learning (rasbt#481)

* print out embeddings for illustrative learning

* suggestion print embeddingcontents

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* Include mathematical breakdown for exercise solution 4.1 (rasbt#483)

* 04_optional-aws-sagemaker-notebook (rasbt#451)

* 04_optional-aws-sagemaker-notebook

* Update setup/04_optional-aws-sagemaker-notebook/cloudformation-template.yml

* Update README.md

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

* Implementingthe  BPE Tokenizer from Scratch (rasbt#487)

* BPE: fixed typo (rasbt#492)

* fixed typo

* use rel path if exists

* mod gitignore and use existing vocab files

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* fix: preserve newline tokens in BPE encoder (rasbt#495)

* fix: preserve newline tokens in BPE encoder

* further fixes

* more fixes

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* add GPT2TokenizerFast to BPE comparison (rasbt#498)

* added HF BPE Fast

* update benchmarks

* add note about performance

* revert accidental changes

---------

Co-authored-by: rasbt <mail@sebastianraschka.com>

* Bonus material: extending tokenizers (rasbt#496)

* Bonus material: extending tokenizers

* small wording update

* Test for PyTorch 2.6 release candidate (rasbt#500)

* Test for PyTorch 2.6 release candidate

* update

* update

* remove extra added file

* A few cosmetic updates (rasbt#504)

* Fix default argument in ex 7.2 (rasbt#506)

* Alternative weight loading via .safetensors (rasbt#507)

* Test PyTorch nightly releases (rasbt#509)

---------

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Daniel Kleine <53251018+d-kleine@users.noreply.github.com>
Co-authored-by: casinca <47400729+casinca@users.noreply.github.com>
Co-authored-by: Tao Qian <taoxqian@gmail.com>
Co-authored-by: QS <41225783+Mike-7777777@users.noreply.github.com>
Co-authored-by: Henry Shi <henrythe9th@gmail.com>
Co-authored-by: rvaneijk <rob@blaeu.com>
Co-authored-by: Austin Welch <austinmw@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant