Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/new_tokenization_docs'
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed May 24, 2024
2 parents 738d3ee + 305f0b7 commit ed030fc
Showing 1 changed file with 9 additions and 8 deletions.
17 changes: 9 additions & 8 deletions docs/geniml/tutorials/tokenization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,21 @@ The `geniml` tokenizers are used to prepare data for training, evaluation, and i

All tokenizers require a *universe file* (or, vocab file). This is a bedfile that contains all possible regions that can be tokenized. It may also include special tokens like the start, end, unknown, and padding token.

In addition to tokenizers implemented here, we also have a standalone package called [`gtokenizers`](https://github.com/databio/gtokenizers) which provides tokenizer implementations in Rust with python bindings. The Rust implementations are much faster than the python implementations. We recommend using the Rust implementations whenever possible.
Our tokenizers are implemented in Rust for speed and efficiency. They exist in the `geniml` companion library called [`genimtools`](https://github.com/databio/genimtools). Currently, there are two tokenizers available: the TreeTokenizer, and the AnnDataTokenizer. The TreeTokenizer is a simple and flexible tokenizer that can be used for any type of data. The AnnDataTokenizer is specifically designed for use with single-cell AnnData objects from the `anndata` library.

The API is loosely based on the [`transformers`](https://github.com/huggingface/tokenizers) library, so it should be familiar to users of that library.

## Using the tokenizers
To start using a tokenizer, simply pass it an appropriate universe file:

```python
from geniml.tokenization import ITTokenizer # or any other tokenizer
from geniml.tokenization import TreeTokenizer # or any other tokenizer
from geniml.io import RegionSet

rs = RegionSet("/path/to/file.bed")
t = ITTokenizer("/path/to/universe.bed")
t = TreeTokenizer("/path/to/universe.bed")

tokens = t.tokenize(rs)
tokens = t(rs)
for token in tokens:
print(f"{t.chr}:{t.start}-{t.end}")
```
Expand All @@ -31,13 +33,12 @@ rs = RegionSet("/path/to/file.bed")
t = ITTokenizer("/path/to/universe.bed")

model = Region2Vec(len(t), 100) # 100 dimensional embedding
tokens = t.tokenize(rs)
tokens = t(rs))

out = model(tokens.ids)
print(out.shape)
ids = tokens.to_ids()
```

## Future work
Genomic region tokenization is an active area of research. We will implement new tokenizers as they are developed. If you have a tokenizer you'd like to see implemented, please open an issue or submit a pull request.

For core development of our tokenizers, see the [gtokenizers](https://github.com/databio/gtokenizers) repository.
For core development of our tokenizers, see the [gtokenizers](https://github.com/databio/genimtools) repository.

0 comments on commit ed030fc

Please sign in to comment.