Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tune tokenizers #80

Open
rth opened this issue Jun 15, 2020 · 0 comments
Open

Fine-tune tokenizers #80

rth opened this issue Jun 15, 2020 · 0 comments

Comments

@rth
Copy link
Owner

rth commented Jun 15, 2020

It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

There is probably a balance that needs to be found between the two.

For instance,

  1. PunctuationTokenizer,
    • currently doesn't take into account repeated punctuation
      >>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
      ['test!', '!', '!']
    • will tokenize abbreviations separated by . as separate sentence
      >>> PunctuationTokenizer().tokenize("W.T.O.")
      ['W.', 'T.', 'O.']
    both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).
  2. UnicodeSentenceTokenizer,
    will not tokenizer sentences separated by a punctuation without space e.g.,
    >>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
    ['One sentence.Another sentence.']
    That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.

  1. UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant