Decide how to handle tokenization conventions #58

jonathanbratt · 2021-01-19T19:46:25Z

There are several tokenization conventions (e.g. the token used for padding, separating segments, etc.) that need to be specified when doing the wordpiece tokenization for BERT. Currently, some of these conventions are hard-coded in, while others are function parameters. We should decide on a consistent approach here.

Also, more clearly delineate what belongs in RBERT vs. wordpiece.
(macmillancontentscience/wordpiece#15)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide how to handle tokenization conventions #58

Decide how to handle tokenization conventions #58

jonathanbratt commented Jan 19, 2021

Decide how to handle tokenization conventions #58

Decide how to handle tokenization conventions #58

Comments

jonathanbratt commented Jan 19, 2021