Analyzing support tickets with spaCy #12130

Gitclop · 2023-01-19T11:35:21Z

Gitclop
Jan 19, 2023

Hi, i am using spacy to analyse support-tickets. The tickets are written by humans and contain a lot of domain-specific words. With not much training of the standard model i managed to implement a sentiment analysis tool to calculate ticket-urgency.
Now i want to move onto the next nlp-tasks, like clustering and text similarity,
The tickets have a very technical background with abbreviations, technical terms and gibberish.
I am a bit lost at how to move onto those next nlp-task. Like, should i train custom wordvectors, or just NER? Train BERT with unlabeld text, or do i need pos-,dep-tags, etc.?
In short: I want a solid model based on my "non-standard" data and need some guidance on what kind of training results in the most solid model that can be used to work on a multitude of nlp-tasks.

danieldk · 2023-01-23T13:01:02Z

danieldk
Jan 23, 2023

In short: I want a solid model based on my "non-standard" data and need some guidance on what kind of training results in the most solid model that can be used to work on a multitude of nlp-tasks.

It is hard to say a priori what representations work for document similarity and clustering. It really depends on the vocabulary, the amount of noise, etc of the data set. At any rate, I would recommend you to make a setup for reproducible evaluations, so that you can easily compare how well different measures and document representations work. It probably also makes sense to tackle document similarity before document clustering, since many cluster methods require a document similarity measure.

spaCy itself has the Doc.similarity method to compute the cosine similarity between the average vectors of the tokens in a document. It uses the word vectors of the model that was used to annotate the documents. An alternative are Sentence Transformer models, which composes piece representations of a document with a transformer model and may give more powerful document representations at the cost of memory and speed.

3 replies

Gitclop Jan 24, 2023
Author

Thanks a lot for your reply!
I've tried spacys Doc.similarity but the results were not as good as i have hoped for. Probably because there is a lot of noise in my text and tech jargon to which spaCy doen't have any wordvectos.
That's why i thought training a language modell with my data should be the first step. But i am not sure if i have to train the modell from scratch with pos- and dep-tagging, or if it is enough to just calculate new word-vectors and maybe training the ner-tagger to recognise the tech-jargon.
Then again i thought, maybe this is all "outdated" and i should train transformer-modell with my data.

I could probably calculate new word-vectors, train the ner-tagger and get a decent similarity-score with theese two components. But is that "future-proof" There are so many possibilities and i would like to have a solid foundation, also for future analysis

danieldk Jan 24, 2023

That's why i thought training a language modell with my data should be the first step.

Sounds fair!

But i am not sure if i have to train the modell from scratch with pos- and dep-tagging, or if it is enough to just calculate new word-vectors and maybe training the ner-tagger to recognise the tech-jargon.

You can train a spaCy pipeline with just NER. For some tasks/datasets, multi-task learning can sometimes give better results because there is a certain amount of 'cross-pollination' between tasks, but it is by no means necessary. In fact, many pretrained spaCy convolution pipelines use a separate tok2vec model for NER, so the contextual representations are not shared between NER and other tasks.

But is that "future-proof" There are so many possibilities and i would like to have a solid foundation,

This is why a good evaluation setup will help. It is likely that sentence transformers will perform among the best models. But if a model with newly-trained word vectors can provide similar accuracy (e.g. because similarity/clustering can be done mainly using term-overlap between tickets), it would be much faster, easier to interpret, etc. So, if you have the chance and time to compare a couple of approaches, it's definitely worthwhile to do so!

Gitclop Jan 24, 2023
Author

I am setting up a training set just now and will start evaluating different approaches. Sometimes you just need a litle nudge in the right direction. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyzing support tickets with spaCy #12130

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Analyzing support tickets with spaCy #12130

Gitclop Jan 19, 2023

Replies: 1 comment · 3 replies

danieldk Jan 23, 2023

Gitclop Jan 24, 2023 Author

danieldk Jan 24, 2023

Gitclop Jan 24, 2023 Author

Gitclop
Jan 19, 2023

Replies: 1 comment 3 replies

danieldk
Jan 23, 2023

Gitclop Jan 24, 2023
Author

Gitclop Jan 24, 2023
Author