- Project Question and Dataset Description
- Loading and viewing dataset
- Pre-processing
- Exploratory Data Analysis
- Modeling
- Outlier removal and re-shaping dataframe
- Linear Regression
- Random Forest Regression
- Take-aways
- References
Data comes via the UCI dataset: https://archive.ics.uci.edu/ml/datasets/Facebook+metrics
1. Question: Which posts have the highest engagement, and can we predict which posts will get higher engagement?
- higher Facebook engagement leads to more downloads and overall engagement with the app
- There have been no changes in FB's feed algorithm since the data was posted (it has been already)
- Page total likes: how many likes the page had when post went live
- Type: what kind of post (video, link, status, photo)
- Category: not sure, 3 levels
- Post Month
- Post Weekday
- Post Hour
- Paid
- like
- share
- comment
- total interactions
- Post metrics (recorded after posting, EDA only)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("~/Desktop/Facebook_metrics/dataset_Facebook.csv",delimiter=';')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes 500 non-null int64
Type 500 non-null object
Category 500 non-null int64
Post Month 500 non-null int64
Post Weekday 500 non-null int64
Post Hour 500 non-null int64
Paid 499 non-null float64
Lifetime Post Total Reach 500 non-null int64
Lifetime Post Total Impressions 500 non-null int64
Lifetime Engaged Users 500 non-null int64
Lifetime Post Consumers 500 non-null int64
Lifetime Post Consumptions 500 non-null int64
Lifetime Post Impressions by people who have liked your Page 500 non-null int64
Lifetime Post reach by people who like your Page 500 non-null int64
Lifetime People who have liked your Page and engaged with your post 500 non-null int64
comment 500 non-null int64
like 499 non-null float64
share 496 non-null float64
Total Interactions 500 non-null int64
dtypes: float64(3), int64(15), object(1)
memory usage: 74.3+ KB
Page total likes | Type | Category | Post Month | Post Weekday | Post Hour | Paid | Lifetime Post Total Reach | Lifetime Post Total Impressions | Lifetime Engaged Users | Lifetime Post Consumers | Lifetime Post Consumptions | Lifetime Post Impressions by people who have liked your Page | Lifetime Post reach by people who like your Page | Lifetime People who have liked your Page and engaged with your post | comment | like | share | Total Interactions | |
0 | 139441 | Photo | 2 | 12 | 4 | 3 | 0.0 | 2752 | 5091 | 178 | 109 | 159 | 3078 | 1640 | 119 | 4 | 79.0 | 17.0 | 100 |
1 | 139441 | Status | 2 | 12 | 3 | 10 | 0.0 | 10460 | 19057 | 1457 | 1361 | 1674 | 11710 | 6112 | 1108 | 5 | 130.0 | 29.0 | 164 |
2 | 139441 | Photo | 3 | 12 | 3 | 3 | 0.0 | 2413 | 4373 | 177 | 113 | 154 | 2812 | 1503 | 132 | 0 | 66.0 | 14.0 | 80 |
Because columns 7 to 15 ("Lifetime post total impressions","Lifetime Engaged users", etc.) are recorded after posting, they will not be used for modeling. However, they can give useful information about post reach, and we will be looking at them in EDA.
# axis=1,inplace=True)
We need to fill the values that are blank with 0:
Goal: predict Total Interactions based on features
I have excluded comment, like, and share and instead focused on Total Interactions, which is what we will be modeling for.
An outlier is easily visible for total interactions, at around 6000.
count 499.000000
mean 177.945892
std 323.398742
min 0.000000
25% 56.500000
50% 101.000000
75% 187.500000
max 5172.000000
Name: like, dtype: float64
Photo 426
Status 45
Link 22
Video 7
Name: Type, dtype: int64
#df.drop(df.columns[1],axis=1,inplace = True)
df['Type'] = df['Type'].apply(lambda x: str(x))
plotdf = df.drop(df.columns[7:15],axis =1)
<seaborn.axisgrid.PairGrid at 0x14ad25050>
<matplotlib.axes._subplots.AxesSubplot at 0x17bba0f90>
sns.distplot(df['Page total likes'],bins=20)
plt.title("Distribution of Total Page Likes")
<matplotlib.text.Text at 0x169333f10>
plt.xlabel("# of Likes",fontsize=15)
plt.title('Distribution of Like/Post',fontsize=20)
#plt.savefig('NumLikes.png', bbox_inches='tight')
<matplotlib.text.Text at 0x1770f2450>
<matplotlib.axes._subplots.AxesSubplot at 0x17c0b6f10>
sns.countplot(df['Post Month'],palette='viridis')
#plt.savefig('monthCount.png', bbox_inches='tight')
<matplotlib.axes._subplots.AxesSubplot at 0x175cb7cd0>
from collections import OrderedDict
f, ax = plt.subplots()
sns.countplot(df['Post Weekday'],palette='viridis')
handles = ["Su", "M", "Tu", "W", "Th", "F",'Sa']
labels = [0,1,2,3,4,5,6]
plt.xticks(labels, handles)
sns.despine(offset = 5, trim = True)
plt.title("Frequency of Posts by Weekday")
#plt.savefig('dayCount.png', bbox_inches='tight')
<matplotlib.text.Text at 0x16d6b99d0>
f, ax = plt.subplots()
sns.countplot(df['Post Hour'],palette='viridis',)
#handles = ["Su", "M", "Tu", "W", "Th", "F",'Sa']
#labels = [0,1,2,3,4,5,6]
#plt.xticks(labels, handles)
sns.despine(offset = 5, trim = True)
plt.title("Frequency of Posts by Hour")
#plt.savefig('hourCount.png', bbox_inches='tight')
<matplotlib.text.Text at 0x14aee7fd0>
sns.boxplot(x='Post Weekday',y='like',data=df,palette='viridis')
plt.title("Distribution of Likes/Post by Weekday Posted")
plt.ylabel("# of Likes")
#plt.savefig('dayBox.png', bbox_inches='tight')
<matplotlib.text.Text at 0x16b00ee50>
sns.boxplot(x='Post Weekday',y='like',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Post Likes by Weekday Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x131562290>
sns.countplot(x='Post Weekday',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Weekday Posted")
<matplotlib.text.Text at 0x12e0c2f90>
Facebook uses their own prestoDB, where the Day of the week (1-7), where Sunday is the first day of the week.
sns.boxplot(x='Post Hour',y='like',data=df,palette='viridis')
plt.title("Distribution of Post Likes by Hour Posted")
plt.ylabel("# of Likes")
plt.savefig('hourBox.png', bbox_inches='tight')
sns.boxplot(x='Post Hour',y='like',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Hour Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x130e8e990>
sns.countplot(x='Post Hour',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Hour Posted")
<matplotlib.text.Text at 0x12af34850>
sns.stripplot(x='Post Hour',y='like',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Hour Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x12fe58290>
sns.boxplot(x='Post Month',y='like',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Month Posted")
plt.ylabel("# of Likes")
plt.savefig('monthBox.png', bbox_inches='tight')
When did the majority of posts go live?
sns.boxplot(x='Post Month',y='like',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Month Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x136070110>
sns.lmplot(x='Page total likes',y='like',
plt.ylabel("# of Likes")
plt.title('# Page Likes vs. # Post Likes')
#plt.savefig('pageLikeScatter.png', bbox_inches='tight')
<matplotlib.text.Text at 0x1697a1750>
Does the number of likes per post increase as the total page like increases? It depends. There is a weak positive trend for paid posts but no increase in non-paid posts.
plt.ylabel("# of Likes")
plt.title("Number of Likes vs. Post Category")
plt.savefig('catCount.png', bbox_inches='tight')
plt.ylabel("# of Likes")
plt.title("Number of Likes vs. Post Category")
plt.savefig('catBox.png', bbox_inches='tight')
plt.ylabel("# of likes")
plt.legend(loc='upper left')
plt.title("Likes/Post by Category and if Posts were Paid")
plt.savefig('catPaidCount.png', bbox_inches='tight')
- Categories 2 and 3 peformed similarly, while Category 1 performed the worst
- The most posts were of category 1 posts
- Paid Category 1 posts saw the largest increase, followed by 2 and then 3.
plt.title("Likes/Post by Post Type")
plt.ylabel("# of likes")
#plt.legend(loc='upper left')
plt.savefig('typeBox.png', bbox_inches='tight')
plt.title("Number of Posts by Type")
plt.ylabel("# of likes")
#plt.legend(loc='upper left')
plt.savefig('typeCount.png', bbox_inches='tight')
plt.title("Total Interactions by Post Type")
plt.ylabel("# of likes")
plt.savefig('typePaidBox.png', bbox_inches='tight')
plt.title("Frequency of Post Type by Paid")
plt.savefig('typePaidCount.png', bbox_inches='tight')
Video posts had the highest mean, median, and percentiles. Photo posts had the largest range, from 0 to 6000.
- On average video posts had higher engagement
- Photo posts had the largest range
- This suggests that total interactions can depend on the photo posted
- Links performed the worst, with the lowest mean, range, and median
- Status on average performed 2nd best in terms on mean engagement, with the second largest range.
- No difference in paid link posts
- Paid Status Posts had higher likes on average when compared to non-paid status posts
- Possibly due to how well status updates integrate in newsfeed
- Paid Photo Posts performed marginally better than non-paid posts on average, though non-paid posts had a larger range (more outliers)
plt.title("likes/Post by Paid/Non-Paid Posts")
plt.ylabel("# of likes")
plt.savefig('paidBox.png', bbox_inches='tight')
from numpy import median
plt.title("likes/Post by Paid/Non-Paid Posts")
plt.ylabel("# of Posts")
plt.savefig('paidCount.png', bbox_inches='tight')
- There were more non-paid posts than paid posts
- Paid posts performed 30 likes by median better than non-paid posts
Index([u'Page total likes', u'Type', u'Category', u'Post Month',
u'Post Weekday', u'Post Hour', u'Paid', u'comment', u'like', u'share',
u'Total Interactions', u'Video', u'Status', u'Photo', u'Cat_1',
timePivot = pd.pivot_table(df,aggfunc='median',
columns='Post Hour',
index='Post Weekday',
timePivot = timePivot[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
cmap= sns.cubehelix_palette(8, start=.5, rot=-.75,as_cmap=True)
sns.heatmap(timePivot,cbar=False,cmap=cmap,annot=True, fmt='g')
#plt.pcolor(lnch_pivot,cmap=plt.cm.Blues, alpha=0.8)
plt.ylabel('Post Weekay',fontsize=20)
plt.xlabel('Post Hour',fontsize=20)
plt.title('Median Likes Per Post by Post Weekday and Hour',fontsize=20)
plt.savefig('medianLikeHeatmap.png', bbox_inches='tight')
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
To remove the decent amount of outliers, I will remove nay variables that are above the 90th percentile.
After, I will span the categorical columns out into binary features. To avoid multicolinearity, I will be taking n-1 columns for each feature.
outlierCut = np.percentile(df['like'],90)
df = df[df['like']<outlierCut]
scaler = StandardScaler()
scaler.fit(df['Page total likes'])
df['Page total likes'] = scaler.transform(df['Page total likes'])
//anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
The function below will translate the weekdays to their labels, rather than 1-7.
def Weekday(x):
if x == 1:
return 'Su'
elif x== 2:
return 'Mo'
elif x == 3:
return 'Tu'
elif x == 4:
return 'We'
elif x == 5:
return 'Th'
elif x ==6:
return 'Fr'
elif x == 7:
return "Sa"
df['Weekday'] = df['Post Weekday'].apply(lambda x: Weekday(x))
dayDf = pd.get_dummies(df['Weekday'])
df = pd.concat([df,dayDf],axis=1)
hours = list(range(0,18))
for i in hours:
hours[i] = str(hours[i])
hours[i]='hr_'+ hours[i]
#print str(hours[i])
hourDf = pd.get_dummies(df['Post Hour'],prefix='hr_')
df = pd.concat([df,hourDf],axis=1)
monthDf = pd.get_dummies(df['Post Month'],prefix='Mo')
df = pd.concat([df,monthDf],axis=1)
df['Video'] = pd.get_dummies(df['Type'])['Video']
df['Status'] = pd.get_dummies(df['Type'])['Status']
df['Photo'] = pd.get_dummies(df['Type'])['Photo']
df['Cat_1'] = pd.get_dummies(df['Category'])[1]
df['Cat_2'] = pd.get_dummies(df['Category'])[2]
#To avoid multicollinearity with the post types I am not including Links.
Page total likes | Type | Category | Post Month | Post Weekday | Post Hour | Paid | Lifetime Post Total Reach | Lifetime Post Total Impressions | Lifetime Engaged Users | ... | Mo_8 | Mo_9 | Mo_10 | Mo_11 | Mo_12 | Video | Status | Photo | Cat_1 | Cat_2 | |
0 | 1.012339 | Photo | 2 | 12 | 4 | 3 | 0.0 | 2752 | 5091 | 178 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 1.012339 | Status | 2 | 12 | 3 | 10 | 0.0 | 10460 | 19057 | 1457 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
2 | 1.012339 | Photo | 3 | 12 | 3 | 3 | 0.0 | 2413 | 4373 | 177 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
4 | 1.012339 | Photo | 2 | 12 | 2 | 3 | 0.0 | 7244 | 13594 | 671 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
5 | 1.012339 | Status | 2 | 12 | 1 | 9 | 0.0 | 10472 | 20849 | 1191 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
5 rows × 66 columns
x = df[['Page total likes','Paid','Video','Status','Photo',
'hr__17','hr__1','hr__2','hr__3','hr__4','hr__5', 'hr__6','hr__7','hr__8',
y = df['like']
The train test split will be with a test size of 0.3, and a random state of 50. With 500 samples a 30% split should not lead to overfitting.
x_train,x_test,y_train, y_test = train_test_split(x,
y, test_size=0.3,
Index([u'Page total likes', u'Paid', u'Video', u'Status', u'Photo', u'Cat_1',
u'Cat_2', u'Mo', u'Tu', u'Sa', u'We', u'Th', u'Fr', u'hr__17', u'hr__1',
u'hr__2', u'hr__3', u'hr__4', u'hr__5', u'hr__6', u'hr__7', u'hr__8',
u'hr__9', u'hr__10', u'hr__11', u'hr__12', u'hr__13', u'hr__14',
u'hr__15', u'hr__16', u'Mo_1', u'Mo_2', u'Mo_12', u'Mo_4', u'Mo_5',
u'Mo_6', u'Mo_7', u'Mo_8', u'Mo_9', u'Mo_11', u'Mo_10'],
reg = linear_model.LinearRegression(normalize=True)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
array([ 6.62755296e+01, 2.04249966e+01, 1.20935659e+02,
6.47837430e+01, 3.58486420e+01, -3.69391903e+01,
-3.01933766e+01, -2.22099310e+01, -9.81411164e+00,
-1.91510126e+01, -5.87872308e+00, -7.15162930e+00,
-1.60445105e+01, 7.74818880e+01, 1.07194064e+02,
-5.88115195e-01, 2.49142215e+01, 2.58633358e+01,
4.69715758e+01, 4.78092128e+00, 1.18189111e+01,
1.19654452e+01, 3.45213517e+01, 2.97217808e+01,
2.58060841e+01, -8.82700496e+00, 3.86708467e+01,
1.76495671e+00, 2.43896549e+01, -2.68267979e-14,
9.09873691e+01, 8.90868100e+01, -1.18923199e+02,
-1.26603031e+01, -8.41581204e+00, -9.03806384e+01,
-8.31484193e+01, -1.06783529e+02, -9.10567480e+01,
-1.11633228e+02, -1.28826344e+02])
pred = reg.predict(x_test)
pred_train = reg.predict(x_train)
testScore = r2_score(y_pred=pred,y_true=y_test)
trainScore = r2_score(y_pred=pred_train,y_true=y_train)
rfResults = pd.DataFrame()
rfResults['Score'] = [trainScore,testScore]
rfResults['Step'] = ['train','test']
Score | Step | |
0 | 0.216025 | train |
1 | 0.046039 | test |
plt.title('R^2 Scores')
The linear regression model performed poorly overall.
- Slight overfitting: .216 r^2 in the train set to 0.046 r2 in the test set
from sklearn.ensemble import RandomForestRegressor
For the Random Forest we will use a min samples split of 10, as to avoid overfitting.
rf = RandomForestRegressor(n_estimators=500,min_samples_split=10)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=10, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr
predicted_train = rf.predict(x_train)
predicted_test = rf.predict(x_test)
test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)
print('Test data R-2 score: {}').format(test_score)
print('Test data Spearman correlation: {}').format(spearman[0])
print('Test data Pearson correlation: {}').format(pearson[0])
train_score = r2_score(y_train, predicted_train)
spearmanTrain = spearmanr(y_train, predicted_train)
pearsonTrain = pearsonr(y_train, predicted_train)
print(' ')
print('Train data R-2 score: {}').format(train_score)
print('Train data Spearman correlation: {}').format(spearmanTrain[0])
print('Train data Pearson correlation: {}').format(pearsonTrain[0])
Test data R-2 score: 0.135590561118
Test data Spearman correlation: 0.448279100475
Test data Pearson correlation: 0.37984956077
Train data R-2 score: 0.594805178695
Train data Spearman correlation: 0.828225597588
Train data Pearson correlation: 0.817565023318
RFperf = pd.DataFrame()
RFperf['Score'] = [train_score,test_score]
RFperf['Step'] = ['train','test']
Score | Step | |
0 | 0.594805 | train |
1 | 0.135591 | test |
plt.title('R^2 Scores')
We had solid performance in the test set, with: - .59 R^2 value - .819 Spearman Correlation
But the model fell apart when using the test set, showing clear signs of overfitting: - .13 R^2 value - .447 Spearman Correlation
predicted_test = rf.predict(x_test)
fI = pd.DataFrame()
fI['Variable'] = list(x_train.columns)
fI['Importance'] = rf.feature_importances_
Variable | Importance | |
0 | Page total likes | 0.207723 |
5 | Cat_1 | 0.107980 |
1 | Paid | 0.048507 |
40 | Mo_10 | 0.044828 |
9 | Sa | 0.043304 |
26 | hr__13 | 0.041345 |
8 | Tu | 0.030655 |
36 | Mo_7 | 0.030090 |
17 | hr__4 | 0.029611 |
11 | Th | 0.027883 |
23 | hr__10 | 0.026175 |
12 | Fr | 0.025307 |
34 | Mo_5 | 0.021988 |
6 | Cat_2 | 0.021105 |
22 | hr__9 | 0.020185 |
topVars= list(fI.sort_values(by='Importance',ascending=False)[0:15]['Variable'])
['Page total likes',
x = df[topVars]
x_train,x_test,y_train, y_test = train_test_split(x,
y, test_size=0.3,
rf = RandomForestRegressor(n_estimators=500,min_samples_split=15)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=15, min_weight_fraction_leaf=0.0,
n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr
predicted_train = rf.predict(x_train)
predicted_test = rf.predict(x_test)
test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)
#print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print('Test data R-2 score: {}').format(test_score)
print('Test data Spearman correlation: {}').format(spearman[0])
print('Test data Pearson correlation: {}').format(pearson[0])
train_score = r2_score(y_train, predicted_train)
spearmanTrain = spearmanr(y_train, predicted_train)
pearsonTrain = pearsonr(y_train, predicted_train)
print(' ')
#print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print('Train data R-2 score: {}').format(train_score)
print('Train data Spearman correlation: {}').format(spearmanTrain[0])
print('Train data Pearson correlation: {}').format(pearsonTrain[0])
Test data R-2 score: 0.0735779002202
Test data Spearman correlation: 0.366331840555
Test data Pearson correlation: 0.30490941908
Train data R-2 score: 0.463470449251
Train data Spearman correlation: 0.709508194683
Train data Pearson correlation: 0.71481692673
The model performed substantially worse when taking the top 15 features by importance from the old model.
After iterating through a random forest using the most important variables and seeing no improvement, this suggests that the data here is not rich enough to sufficiently predict likes based only on the information here.
- Paid posts on average have higher engagement
- Paid Status/Photo and Category 1 and 2 posts have the highest improvement so
- more paid posts of these types
- Video Posts have the highest average engagement, and its not close
- more Video posts
- Posting Monday in the mid-morning and Wednesday in the early morning
- Identify and post during times of high engagement
- Place importance on increasing Page Likes
- higher probability that a post has “high” engagement Need a richer dataset to predict accurately
- identify potentially useful data points and collect them
(Moro et al., 2016) Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.