“A fool's brain digests philosophy into madness, science into superstition and art into pedantry. Hence a university education. ” George Bernard Shaw
- Introduction
- Preprocessing and Cleaning dataset
- Story Generation and Visualization from reviews
- Text reviews
- Extracting Features from Cleaned reviews
- Model Building: Sentiment Analysis
- Group Project
This project proposed the analyse of consumer behaviour in order to assist a business to build an effective and targeted marketing strategy.
To do this we will build predictive models on data sets compiled from e-Commerce giants, Amazon & Walmart datasets.
• *** Build a Sentiment Analysis model*** to predict the effect on sales in relation to customer reviews.
• *** Build a Market Basket Model on the Amazon dataset***. This will enable the enterprise to predict consumer behaviour by suggesting complimentary goods to purchase.
• *** Analyse the conversion rates*** in this dataset also with a view to building a model to increase these.
Examine customer sensitivity to price by building a linear regression model on the Walmart dataset.
The retail industry has taken a 180 degree turn with the rise in online shopping. In 2019, retail e-commerce sales worldwide amounted to 3.53 trillion US dollars and e-retail revenues are projected to grow to 6.54 trillion US dollars by 2022.
It was predicted that in 2020 the global e-commerce market exceed 4 trillion dollars, and one in every four online consumers purchases from stores once a week according to Invespcro (2020) report.
Importing Libraries
- Visualization libraries
Pandas, Seaborn, Matplotlib.pyplot, Plotly.express as px
- NLTK libraries
nltk, re, Wordcloud, PorterStemmer, TfidfVectorizer, Stopwords, Word_tokenize, TextBlob
- Machine Learning libraries
sklearn, SVC, LabelEncoder, StandardScaler, Preprocessing import normalize, ExtraTreesClassifier, GridSearchCV
- Machine Learning Models
LogisticRegression, DecisionTreeClassifier, BernoulliNB, KNeighborsClassifier, OneVsRestClassifier
model_selection import train_test_split, label_binarize
- Other Libraries
Counter, SMOTE, CountVectorizer
⌛️ Dataset features
uniq_id, product_name, manufacturer, price, number_available_in_stock, number_of_reviews, number_of_answered_questions, average_review_rating, amazon_category_and_sub_category, customers_who_bought_this_also_bought, description, product_information, product_description, items_customers_buy_after_viewing_this_item, customer_questions_and_answers, customer_reviews, sellers
By go further in the exploratory data analysis on texts we are try to understand what features contributes to the sentiment category.
Prior analysis assumptions:
-
Higher the rate the sentiment becomes positive
-
There are be many positive sentiment reviews which lead to bias
-
These assumptions will be verified with our plots also we will do text analysis
NLKT stop words contains words like not, hasn't, would'nt which actually conveys a negative sentiment. If we remove that it will end up contradicting the target variable(sentiment). So I have curated the stop words which doesn't have any negative sentiment or any negative alternatives.
Create polarity, review length and word count
Polarity: By using Textblob for figuring out the rate of sentiment between [-1,1] where -1 is negative and 1 is positive
Review length: length of the review which includes each letters and spaces
Word length: It measures how many words are in the customer review column
Before we build the model for our sentiment analysis, it is required to convert the review texts into vector formation as computer cannot understand words and their sentiment. In this project, we are going to use TF-TDF method to convert the texts.
Encoding target variable-sentiment
Stemming is a method of deriving root word from the inflected word. Here we extract the customer reviews and convert the words to its root word.
There is another technique knows as Lemmatization where it converts the words into root words which has a semantic meaning.
We noticed that we got a lot of positive sentiments compared to negative and neutral. So it is crucial to balanced the classes in such situation. SMOTE(Synthetic Minority Oversampling Technique)is used to balance out the imbalanced dataset problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them.
Sentiment Analysis refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Understanding people’s emotions is essential for businesses since customers are able to express their thoughts and feelings more openly than ever before.It is quite hard for a human to go through each single line and identify the emotion being the user experience. With machine learning models nowadays we can automatically analyzing customer feedback, from product reviews and survey responses to social media conversations for example, which allows to tailor products and services to meet customer needs.
CCT COLLEGE DUBLIN
Higher Diploma in Science in Data Analytics for Business
Under Supervision of: GRAHAM GLANVILLE & MARK MORRISSEY
Released in March 2021.
This project is under the MIT license.
Made with love by Sirlene Andreis 💚🚀