Bonus material: extending tokenizers #496

rasbt · 2025-01-21T21:57:47Z

Adds bonus material to add special tokens to a tokenizer and update the LLM accordingly.

review-notebook-app · 2025-01-21T21:57:52Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

* Add "What's next" section (rasbt#432) * Add What's next section * Delete appendix-D/01_main-chapter-code/appendix-D-Copy2.ipynb * Delete ch03/01_main-chapter-code/ch03-Copy1.ipynb * Delete appendix-D/01_main-chapter-code/appendix-D-Copy1.ipynb * Update ch07.ipynb * Update ch07.ipynb * Add chapter names * Add missing device transfer in gpt_generate.py (rasbt#436) * Add utility to prevent double execution of certain cells (rasbt#437) * Add flexible padding bonus experiment (rasbt#438) * Add flexible padding bonus experiment * fix links * Fixed command for row 16 additional experiment (rasbt#439) * fixed command for row 16 experiment * Update README.md --------- Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com> * [minor] typo & comments (rasbt#441) * typo & comment - safe -> save - commenting code: batch_size, seq_len = in_idx.shape * comment - adding # NEW for assert num_heads % num_kv_groups == 0 * update memory wording --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * fix misplaced parenthesis and update license (rasbt#466) * Minor readability improvement in dataloader.ipynb (rasbt#461) * Minor readability improvement in dataloader.ipynb - The tokenizer and encoded_text variables at the root level are unused. - The default params for create_dataloader_v1 are confusing, especially for the default batch_size 4, which happens to be the same as the max_length. * readability improvements --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * typo fixed (rasbt#468) * typo fixed * only update plot --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * Add backup URL for gpt2 weights (rasbt#469) * Add backup URL for gpt2 weights * newline * fix ch07 unit test (rasbt#470) * adds no-grad context for reference model to DPO (rasbt#473) * Auto download DPO dataset if not already available in path (rasbt#479) * Auto download DPO dataset if not already available in path * update tests to account for latest HF transformers release in unit tests * pep 8 * fix reward margins plot label in dpo nb * Print out embeddings for more illustrative learning (rasbt#481) * print out embeddings for illustrative learning * suggestion print embeddingcontents --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * Include mathematical breakdown for exercise solution 4.1 (rasbt#483) * 04_optional-aws-sagemaker-notebook (rasbt#451) * 04_optional-aws-sagemaker-notebook * Update setup/04_optional-aws-sagemaker-notebook/cloudformation-template.yml * Update README.md --------- Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com> * Implementingthe BPE Tokenizer from Scratch (rasbt#487) * BPE: fixed typo (rasbt#492) * fixed typo * use rel path if exists * mod gitignore and use existing vocab files --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * fix: preserve newline tokens in BPE encoder (rasbt#495) * fix: preserve newline tokens in BPE encoder * further fixes * more fixes --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * add GPT2TokenizerFast to BPE comparison (rasbt#498) * added HF BPE Fast * update benchmarks * add note about performance * revert accidental changes --------- Co-authored-by: rasbt <mail@sebastianraschka.com> * Bonus material: extending tokenizers (rasbt#496) * Bonus material: extending tokenizers * small wording update * Test for PyTorch 2.6 release candidate (rasbt#500) * Test for PyTorch 2.6 release candidate * update * update * remove extra added file * A few cosmetic updates (rasbt#504) * Fix default argument in ex 7.2 (rasbt#506) * Alternative weight loading via .safetensors (rasbt#507) * Test PyTorch nightly releases (rasbt#509) --------- Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com> Co-authored-by: Daniel Kleine <53251018+d-kleine@users.noreply.github.com> Co-authored-by: casinca <47400729+casinca@users.noreply.github.com> Co-authored-by: Tao Qian <taoxqian@gmail.com> Co-authored-by: QS <41225783+Mike-7777777@users.noreply.github.com> Co-authored-by: Henry Shi <henrythe9th@gmail.com> Co-authored-by: rvaneijk <rob@blaeu.com> Co-authored-by: Austin Welch <austinmw@users.noreply.github.com>

Bonus material: extending tokenizers

eddc5ff

small wording update

73b4a34

rasbt merged commit a22d612 into main Jan 22, 2025
8 checks passed

rasbt deleted the extend-tiktoken branch January 22, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bonus material: extending tokenizers #496

Bonus material: extending tokenizers #496

rasbt commented Jan 21, 2025

review-notebook-app bot commented Jan 21, 2025

Bonus material: extending tokenizers #496

Bonus material: extending tokenizers #496

Conversation

rasbt commented Jan 21, 2025

review-notebook-app bot commented Jan 21, 2025