Skip to content

Commit

Permalink
Update documentation to reflect tokenizers refactor under transformer…
Browse files Browse the repository at this point in the history
…s module
  • Loading branch information
Ankur-singh committed Jan 5, 2025
1 parent a87dfc8 commit f8732e8
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 11 deletions.
12 changes: 6 additions & 6 deletions docs/source/api_ref_modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,10 @@ model specific tokenizers.
:toctree: generated/
:nosignatures:

tokenizers.SentencePieceBaseTokenizer
tokenizers.TikTokenBaseTokenizer
tokenizers.ModelTokenizer
tokenizers.BaseTokenizer
transforms.tokenizers.SentencePieceBaseTokenizer
transforms.tokenizers.TikTokenBaseTokenizer
transforms.tokenizers.ModelTokenizer
transforms.tokenizers.BaseTokenizer

Tokenizer Utilities
-------------------
Expand All @@ -61,8 +61,8 @@ These are helper methods that can be used by any tokenizer.
:toctree: generated/
:nosignatures:

tokenizers.tokenize_messages_no_special_tokens
tokenizers.parse_hf_tokenizer_json
transforms.tokenizers.tokenize_messages_no_special_tokens
transforms.tokenizers.parse_hf_tokenizer_json


PEFT Components
Expand Down
10 changes: 5 additions & 5 deletions docs/source/basics/tokenizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ For example, here we change the ``"<|begin_of_text|>"`` and ``"<|end_of_text|>"`
Base tokenizers
---------------

:class:`~torchtune.modules.tokenizers.BaseTokenizer` are the underlying byte-pair encoding modules that perform the actual raw string to token ID conversion and back.
:class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` are the underlying byte-pair encoding modules that perform the actual raw string to token ID conversion and back.
In torchtune, they are required to implement ``encode`` and ``decode`` methods, which are called by the :ref:`model_tokenizers` to convert
between raw text and token IDs.

Expand Down Expand Up @@ -202,13 +202,13 @@ between raw text and token IDs.
"""
pass
If you load any :ref:`model_tokenizers`, you can see that it calls its underlying :class:`~torchtune.modules.tokenizers.BaseTokenizer`
If you load any :ref:`model_tokenizers`, you can see that it calls its underlying :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer`
to do the actual encoding and decoding.

.. code-block:: python
from torchtune.models.mistral import mistral_tokenizer
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer
from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer
m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
# Mistral uses SentencePiece for its underlying BPE
Expand All @@ -227,7 +227,7 @@ to do the actual encoding and decoding.
Model tokenizers
----------------

:class:`~torchtune.modules.tokenizers.ModelTokenizer` are specific to a particular model. They are required to implement the ``tokenize_messages`` method,
:class:`~torchtune.modules.transforms.tokenizers.ModelTokenizer` are specific to a particular model. They are required to implement the ``tokenize_messages`` method,
which converts a list of Messages into a list of token IDs.

.. code-block:: python
Expand Down Expand Up @@ -259,7 +259,7 @@ is because they add all the necessary special tokens or prompt templates require
.. code-block:: python
from torchtune.models.mistral import mistral_tokenizer
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer
from torchtune.modules.transforms.tokenizers import SentencePieceBaseTokenizer
from torchtune.data import Message
m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
Expand Down

0 comments on commit f8732e8

Please sign in to comment.