DupPredictor is a framework to detect duplicate questions on Stack Overflow using machine learning techniques. The framework consists of two phases: model building and prediction. During the model building phase, a model is trained using a dataset of past duplicate questions. The data is obtained by running a query on StackExchange and consists of 50,000 rows with original questions and duplicate questions. The data preprocessing involves removing html tags, punctuation, stopwords and stemming the text in the titles and bodies. The similarity scores are calculated by computing the similarity between the titles of the questions based on the common words they share. The model also uses Latent Dirichlet Allocation (LDA) for topic modeling to classify the text into topics. The model is trained using a set of 300 duplicate questions and the parameters are calculated using a sample-based greedy method or a gradient-based optimization method.
-
Notifications
You must be signed in to change notification settings - Fork 1
DupPredictor is a framework to detect duplicate questions on Stack Overflow using machine learning techniques . The algorithm consists of LDA (latent Dirichlet Allocation) for topic modelling to classify the text into topics.
Abhi7410/StackOverFlow_Duplicate_Question_Detection
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
DupPredictor is a framework to detect duplicate questions on Stack Overflow using machine learning techniques . The algorithm consists of LDA (latent Dirichlet Allocation) for topic modelling to classify the text into topics.
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published