Skip to content

How many tokens per word using given tokenizer and given dataset?

Notifications You must be signed in to change notification settings

sorenmulli/tokens-per-word

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

How many tokens per word using given tokenizer and given dataset?

$ pip install .

$ cat raw-sentences.txt | tokens-per-word -t "north/t5_base_scand3M"
# or (if you have csvkit/csvtools)
$ csvcut -c "text-column" tabular-dataset.csv | tokens-per-word -t "north/t5_base_scand3M"
# or (if you jave jq)
$ cat line-formatted-dataset.jsonl | jq .text-field | tokens-per-word -t "north/t5_base_scand3M"

About

How many tokens per word using given tokenizer and given dataset?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages