Preparing text for generating word vectors with Floret #11285

orglce · 2022-08-09T14:43:52Z

orglce
Aug 9, 2022

I have been trying to generate my own word vectors with Floret and I was just wondering if there are any recommended preprocessing steps besides tokenization. Would it improve the down-stream accuracy of the pipeline if the text would be

converted to lowercase
numbers removed
punctuation removed
lemmatized

I reckon none of these things would prove beneficial as a part of the whole pipeline (POS, NER, lemmatization...) but I don't know exactly how Spacy uses the vectors under the hood.

adrianeboyd · 2022-08-09T17:59:10Z

adrianeboyd
Aug 9, 2022

I would recommend only tokenizing. The static vectors currently look up a token's vector by the token text (token.orth_), so lowercasing or lemmatizing is not useful.

(It's technically possible to have vectors for a token attribute other than ORTH, but the token attribute setting is not currently exposed in the config, so it would currently require extra work to implement a custom tok2vec embed architecture.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preparing text for generating word vectors with Floret #11285

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Preparing text for generating word vectors with Floret #11285

orglce Aug 9, 2022

Replies: 1 comment

adrianeboyd Aug 9, 2022

orglce
Aug 9, 2022

adrianeboyd
Aug 9, 2022