Skip to content

Latest commit

 

History

History
1640 lines (1117 loc) · 34.2 KB

FB Metrics Analysis.md

File metadata and controls

1640 lines (1117 loc) · 34.2 KB

Analyzing and Predicting Facebook Post Engagement

Project Structure

  1. Project Question and Dataset Description
  2. Loading and viewing dataset
  3. Pre-processing
  4. Exploratory Data Analysis
  5. Modeling
    • Outlier removal and re-shaping dataframe
    • Linear Regression
    • Random Forest Regression
  6. Take-aways
  7. References

Data comes via the UCI dataset: https://archive.ics.uci.edu/ml/datasets/Facebook+metrics

1. Question: Which posts have the highest engagement, and can we predict which posts will get higher engagement?

Assumptions:

  • higher Facebook engagement leads to more downloads and overall engagement with the app
  • There have been no changes in FB's feed algorithm since the data was posted (it has been already)

Dataset info

Variables
  • Page total likes: how many likes the page had when post went live
  • Type: what kind of post (video, link, status, photo)
  • Category: not sure, 3 levels
  • Post Month
  • Post Weekday
  • Post Hour
  • Paid
  • like
  • share
  • comment
  • total interactions
  • Post metrics (recorded after posting, EDA only)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2. Loading and Viewing the data

df = pd.read_csv("~/Desktop/Facebook_metrics/dataset_Facebook.csv",delimiter=';')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 19 columns):
Page total likes                                                       500 non-null int64
Type                                                                   500 non-null object
Category                                                               500 non-null int64
Post Month                                                             500 non-null int64
Post Weekday                                                           500 non-null int64
Post Hour                                                              500 non-null int64
Paid                                                                   499 non-null float64
Lifetime Post Total Reach                                              500 non-null int64
Lifetime Post Total Impressions                                        500 non-null int64
Lifetime Engaged Users                                                 500 non-null int64
Lifetime Post Consumers                                                500 non-null int64
Lifetime Post Consumptions                                             500 non-null int64
Lifetime Post Impressions by people who have liked your Page           500 non-null int64
Lifetime Post reach by people who like your Page                       500 non-null int64
Lifetime People who have liked your Page and engaged with your post    500 non-null int64
comment                                                                500 non-null int64
like                                                                   499 non-null float64
share                                                                  496 non-null float64
Total Interactions                                                     500 non-null int64
dtypes: float64(3), int64(15), object(1)
memory usage: 74.3+ KB
df.head(3)
Page total likes Type Category Post Month Post Weekday Post Hour Paid Lifetime Post Total Reach Lifetime Post Total Impressions Lifetime Engaged Users Lifetime Post Consumers Lifetime Post Consumptions Lifetime Post Impressions by people who have liked your Page Lifetime Post reach by people who like your Page Lifetime People who have liked your Page and engaged with your post comment like share Total Interactions
0 139441 Photo 2 12 4 3 0.0 2752 5091 178 109 159 3078 1640 119 4 79.0 17.0 100
1 139441 Status 2 12 3 10 0.0 10460 19057 1457 1361 1674 11710 6112 1108 5 130.0 29.0 164
2 139441 Photo 3 12 3 3 0.0 2413 4373 177 113 154 2812 1503 132 0 66.0 14.0 80

Because columns 7 to 15 ("Lifetime post total impressions","Lifetime Engaged users", etc.) are recorded after posting, they will not be used for modeling. However, they can give useful information about post reach, and we will be looking at them in EDA.

#df.drop(df.columns[7:15],
#        axis=1,inplace=True)

We need to fill the values that are blank with 0:

Goal: predict Total Interactions based on features

I have excluded comment, like, and share and instead focused on Total Interactions, which is what we will be modeling for.

An outlier is easily visible for total interactions, at around 6000.

Pre-processing

df['like'].describe()
count     499.000000
mean      177.945892
std       323.398742
min         0.000000
25%        56.500000
50%       101.000000
75%       187.500000
max      5172.000000
Name: like, dtype: float64
df['Type'].value_counts()
Photo     426
Status     45
Link       22
Video       7
Name: Type, dtype: int64
df['Paid'].value_counts()
df['like'].fillna(0,inplace=True)
df['share'].fillna(0,inplace=True)
df['Paid'].fillna(0,inplace=True)
#df.drop(df.columns[1],axis=1,inplace = True)

EDA

df['Type'] = df['Type'].apply(lambda x: str(x))
plotdf = df.drop(df.columns[7:15],axis =1)
sns.pairplot(data=plotdf)
<seaborn.axisgrid.PairGrid at 0x14ad25050>

png

plt.figure(figsize=(12,10))
sns.heatmap(df.corr(),cmap='viridis',annot=True,cbar=False)
<matplotlib.axes._subplots.AxesSubplot at 0x17bba0f90>

png

Univariate- Page Likes and Post Likes

sns.distplot(df['Page total likes'],bins=20)
plt.title("Distribution of Total Page Likes")
<matplotlib.text.Text at 0x169333f10>

png

plt.figure(figsize=(14,8))
sns.distplot(df[df['like']<1000]['like'],bins=20,color='dodgerblue',hist_kws={'alpha':0.6})
plt.xlim(0,1000)
plt.xlabel("# of Likes",fontsize=15)
plt.ylabel('Frequency',fontsize=15)
plt.title('Distribution of Like/Post',fontsize=20)
#plt.savefig('NumLikes.png', bbox_inches='tight')
<matplotlib.text.Text at 0x1770f2450>

png

Analysis of TIme Categories

sns.countplot(df['Category'],palette='viridis')
<matplotlib.axes._subplots.AxesSubplot at 0x17c0b6f10>

png

sns.countplot(df['Post Month'],palette='viridis')
#plt.savefig('monthCount.png', bbox_inches='tight')
<matplotlib.axes._subplots.AxesSubplot at 0x175cb7cd0>

png

from collections import OrderedDict
sns.set_style("whitegrid")

f, ax = plt.subplots()
sns.countplot(df['Post Weekday'],palette='viridis')
handles = ["Su", "M", "Tu", "W", "Th", "F",'Sa']
labels = [0,1,2,3,4,5,6]
plt.xticks(labels, handles)
ax.set_ylabel("Frequency")
sns.despine(offset = 5, trim = True)
plt.title("Frequency of Posts by Weekday")
#plt.savefig('dayCount.png', bbox_inches='tight')
<matplotlib.text.Text at 0x16d6b99d0>

png

sns.set_style("whitegrid")

f, ax = plt.subplots()
sns.countplot(df['Post Hour'],palette='viridis',)
#handles = ["Su", "M", "Tu", "W", "Th", "F",'Sa']
#labels = [0,1,2,3,4,5,6]
#plt.xticks(labels, handles)
ax.set_ylabel("Frequency")
sns.despine(offset = 5, trim = True)
plt.title("Frequency of Posts by Hour")
#plt.savefig('hourCount.png', bbox_inches='tight')
<matplotlib.text.Text at 0x14aee7fd0>

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Weekday',y='like',data=df,palette='viridis')
plt.ylim(0,750)
plt.title("Distribution of Likes/Post by Weekday Posted")
plt.ylabel("# of Likes")
#plt.savefig('dayBox.png', bbox_inches='tight')
<matplotlib.text.Text at 0x16b00ee50>

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Weekday',y='like',hue='Paid',data=df,palette='viridis')
plt.ylim(0,2200)
plt.title("Distribution of Post Likes by Weekday Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x131562290>

png

plt.figure(figsize=(8,6))
sns.countplot(x='Post Weekday',hue='Paid',data=df,palette='viridis')
plt.title("Distribution of Total Interactions by Weekday Posted")
<matplotlib.text.Text at 0x12e0c2f90>

png

Facebook uses their own prestoDB, where the Day of the week (1-7), where Sunday is the first day of the week.

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Hour',y='like',data=df,palette='viridis')
plt.ylim(0,750)
plt.title("Distribution of Post Likes by Hour Posted")
plt.ylabel("# of Likes")
plt.savefig('hourBox.png', bbox_inches='tight')

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Hour',y='like',hue='Paid',data=df,palette='viridis')
plt.ylim(0,1750)
plt.title("Distribution of Total Interactions by Hour Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x130e8e990>

png

plt.figure(figsize=(8,6))
sns.countplot(x='Post Hour',hue='Paid',data=df,palette='viridis')
#plt.ylim(0,2200)
plt.title("Distribution of Total Interactions by Hour Posted")
<matplotlib.text.Text at 0x12af34850>

png

plt.figure(figsize=(8,6))
sns.stripplot(x='Post Hour',y='like',data=df,palette='viridis')
plt.ylim(0,2200)
plt.title("Distribution of Total Interactions by Hour Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x12fe58290>

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Month',y='like',data=df,palette='viridis')
plt.ylim(0,750)
plt.title("Distribution of Total Interactions by Month Posted")
plt.ylabel("# of Likes")
plt.savefig('monthBox.png', bbox_inches='tight')

png

When did the majority of posts go live?

plt.figure(figsize=(8,6))
sns.boxplot(x='Post Month',y='like',hue='Paid',data=df,palette='viridis')
plt.ylim(0,2200)
plt.title("Distribution of Total Interactions by Month Posted")
plt.ylabel("# of Likes")
<matplotlib.text.Text at 0x136070110>

png

Total Page Likes vs. Post likes

sns.lmplot(x='Page total likes',y='like',
           hue='Paid',data=df,fit_reg=True,
          scatter_kws={'alpha':0.6},size=(6))
plt.legend()
plt.ylim(0,1000)
plt.ylabel("# of Likes")
plt.title('# Page Likes vs. # Post Likes')
#plt.savefig('pageLikeScatter.png', bbox_inches='tight')
<matplotlib.text.Text at 0x1697a1750>

png

Does the number of likes per post increase as the total page like increases? It depends. There is a weak positive trend for paid posts but no increase in non-paid posts.

Category vs. Likes

plt.figure(figsize=(8,6))
sns.countplot(x='Category',data=df,palette='viridis')
plt.ylabel("# of Likes")
plt.title("Number of Likes vs. Post Category")
plt.savefig('catCount.png', bbox_inches='tight')

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Category',y='like',data=df,palette='viridis')
plt.ylim(0,750)
plt.ylabel("# of Likes")
plt.title("Number of Likes vs. Post Category")
plt.savefig('catBox.png', bbox_inches='tight')

png

plt.figure(figsize=(8,6))
sns.boxplot(x='Category',y='like',data=df,hue='Paid',palette='viridis')
plt.ylim(0,750)
plt.ylabel("# of likes")
plt.legend(loc='upper left')
plt.title("Likes/Post by Category and if Posts were Paid")
plt.savefig('catPaidCount.png', bbox_inches='tight')

png

Observations:

  • Categories 2 and 3 peformed similarly, while Category 1 performed the worst
  • The most posts were of category 1 posts
  • Paid Category 1 posts saw the largest increase, followed by 2 and then 3.

Post Type vs. Likes

plt.figure(figsize=(10,6))
sns.boxplot(x='Type',y='like',data=df,palette='viridis')
plt.ylim(0,750)
#sns.despine(offset=4,bottom=True)
plt.title("Likes/Post by Post Type")
plt.ylabel("# of likes")
#plt.legend(loc='upper left')
plt.savefig('typeBox.png', bbox_inches='tight')

png

plt.figure(figsize=(10,6))
sns.countplot(x='Type',data=df,palette='viridis')
#plt.ylim(0,750)
#sns.despine(offset=4,bottom=True)
plt.title("Number of Posts by Type")
plt.ylabel("# of likes")
#plt.legend(loc='upper left')
plt.savefig('typeCount.png', bbox_inches='tight')

png

plt.figure(figsize=(10,6))
sns.boxplot(x='Type',y='like',hue='Paid',data=df,palette='viridis')
plt.ylim(0,750)
#sns.despine(offset=4,bottom=True)
plt.title("Total Interactions by Post Type")
plt.ylabel("# of likes")
plt.savefig('typePaidBox.png', bbox_inches='tight')

png

plt.figure(figsize=(10,6))
sns.countplot(x='Type',hue='Paid',data=df,palette='viridis')
plt.ylim(0,320)
#sns.despine(offset=4,bottom=True)
plt.title("Frequency of Post Type by Paid")
plt.ylabel("Frequency")
plt.savefig('typePaidCount.png', bbox_inches='tight')

png

Video posts had the highest mean, median, and percentiles. Photo posts had the largest range, from 0 to 6000.

Take-aways:

  • On average video posts had higher engagement
  • Photo posts had the largest range
  • This suggests that total interactions can depend on the photo posted
  • Links performed the worst, with the lowest mean, range, and median
  • Status on average performed 2nd best in terms on mean engagement, with the second largest range.
  • No difference in paid link posts
  • Paid Status Posts had higher likes on average when compared to non-paid status posts
    • Possibly due to how well status updates integrate in newsfeed
  • Paid Photo Posts performed marginally better than non-paid posts on average, though non-paid posts had a larger range (more outliers)

Paid Posts vs. Likes

plt.figure(figsize=(10,6))
sns.boxplot(x='Paid',y='like',data=df,palette='viridis')
plt.ylim(0,500)
#sns.despine(offset=4,bottom=True)
plt.title("likes/Post by Paid/Non-Paid Posts")
plt.ylabel("# of likes")
plt.savefig('paidBox.png', bbox_inches='tight')

png

from numpy import median
print(median(df[df['Paid']==0]['like']))
print(median(df[df['Paid']==1]['like']))
96.0
128.0
plt.figure(figsize=(10,6))
sns.countplot(x='Paid',data=df,palette='viridis')
#sns.despine(offset=4,bottom=True)
plt.title("likes/Post by Paid/Non-Paid Posts")
plt.ylabel("# of Posts")
plt.savefig('paidCount.png', bbox_inches='tight')

png

Observations:

  • There were more non-paid posts than paid posts
  • Paid posts performed 30 likes by median better than non-paid posts

Which days and hours performed the best?

df.columns
Index([u'Page total likes', u'Type', u'Category', u'Post Month',
       u'Post Weekday', u'Post Hour', u'Paid', u'comment', u'like', u'share',
       u'Total Interactions', u'Video', u'Status', u'Photo', u'Cat_1',
       u'Cat_2'],
      dtype='object')
timePivot = pd.pivot_table(df,aggfunc='median',
                           columns='Post Hour',
                           index='Post Weekday',
                           values='like')
timePivot = timePivot[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]]
plt.figure(figsize=(18,10))
cmap= sns.cubehelix_palette(8, start=.5, rot=-.75,as_cmap=True)
sns.heatmap(timePivot,cbar=False,cmap=cmap,annot=True, fmt='g')
#plt.pcolor(lnch_pivot,cmap=plt.cm.Blues, alpha=0.8)
plt.yticks(np.arange(7),['Sa','F','Th','W','Tu','M','Su'],fontsize=15)
plt.xticks(fontsize=15)
plt.ylabel('Post Weekay',fontsize=20)
plt.xlabel('Post Hour',fontsize=20)
plt.title('Median Likes Per Post by Post Weekday and Hour',fontsize=20)
plt.savefig('medianLikeHeatmap.png', bbox_inches='tight')

png

Modeling: Can we predict the number of likes a post will get?

from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler

Removing Outliers and Binarizing Categorical Variables

To remove the decent amount of outliers, I will remove nay variables that are above the 90th percentile.

After, I will span the categorical columns out into binary features. To avoid multicolinearity, I will be taking n-1 columns for each feature.

Removing Outliers

outlierCut = np.percentile(df['like'],90)
outlierCut
330.10000000000002
df = df[df['like']<outlierCut]
scaler = StandardScaler()
scaler.fit(df['Page total likes'])
#print(scaler.mean_)
df['Page total likes'] = scaler.transform(df['Page total likes'])
//anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
//anaconda/lib/python2.7/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

The function below will translate the weekdays to their labels, rather than 1-7.

def Weekday(x):
    if x == 1:
        return 'Su'
    elif x== 2:
        return 'Mo'
    elif x == 3:
        return 'Tu'
    elif x == 4:
        return 'We'
    elif x == 5:
        return 'Th'
    elif x ==6:
        return 'Fr'
    elif x == 7:
        return "Sa"

df['Weekday'] = df['Post Weekday'].apply(lambda x: Weekday(x))

Spanning out categorical columns

dayDf = pd.get_dummies(df['Weekday'])
df = pd.concat([df,dayDf],axis=1)
hours = list(range(0,18))
#hours
for i in hours:
    hours[i] = str(hours[i])
    hours[i]='hr_'+ hours[i]
    #print str(hours[i])
hourDf = pd.get_dummies(df['Post Hour'],prefix='hr_')
df = pd.concat([df,hourDf],axis=1)
monthDf = pd.get_dummies(df['Post Month'],prefix='Mo')
df = pd.concat([df,monthDf],axis=1)
df['Video'] = pd.get_dummies(df['Type'])['Video']
df['Status'] = pd.get_dummies(df['Type'])['Status']
df['Photo'] = pd.get_dummies(df['Type'])['Photo']
df['Cat_1'] = pd.get_dummies(df['Category'])[1]
df['Cat_2'] = pd.get_dummies(df['Category'])[2]
#To avoid multicollinearity with the post types I am not including Links.
df.head()
Page total likes Type Category Post Month Post Weekday Post Hour Paid Lifetime Post Total Reach Lifetime Post Total Impressions Lifetime Engaged Users ... Mo_8 Mo_9 Mo_10 Mo_11 Mo_12 Video Status Photo Cat_1 Cat_2
0 1.012339 Photo 2 12 4 3 0.0 2752 5091 178 ... 0 0 0 0 1 0 0 1 0 1
1 1.012339 Status 2 12 3 10 0.0 10460 19057 1457 ... 0 0 0 0 1 0 1 0 0 1
2 1.012339 Photo 3 12 3 3 0.0 2413 4373 177 ... 0 0 0 0 1 0 0 1 0 0
4 1.012339 Photo 2 12 2 3 0.0 7244 13594 671 ... 0 0 0 0 1 0 0 1 0 1
5 1.012339 Status 2 12 1 9 0.0 10472 20849 1191 ... 0 0 0 0 1 0 1 0 0 1

5 rows × 66 columns

Train Test Split

#
x = df[['Page total likes','Paid','Video','Status','Photo',
    'Cat_1','Cat_2','Mo','Tu','Sa',"We",'Th','Fr',
       'hr__17','hr__1','hr__2','hr__3','hr__4','hr__5', 'hr__6','hr__7','hr__8',
        'hr__9','hr__10','hr__11','hr__12','hr__13','hr__14','hr__15','hr__16','Mo_1',
       'Mo_2','Mo_12','Mo_4','Mo_5','Mo_6','Mo_7','Mo_8','Mo_9','Mo_11','Mo_10']]
y = df['like']

The train test split will be with a test size of 0.3, and a random state of 50. With 500 samples a 30% split should not lead to overfitting.

x_train,x_test,y_train, y_test = train_test_split(x,
                                                  y, test_size=0.3,
                                                  random_state=50)
y_test.count()
135
x_test.columns
Index([u'Page total likes', u'Paid', u'Video', u'Status', u'Photo', u'Cat_1',
       u'Cat_2', u'Mo', u'Tu', u'Sa', u'We', u'Th', u'Fr', u'hr__17', u'hr__1',
       u'hr__2', u'hr__3', u'hr__4', u'hr__5', u'hr__6', u'hr__7', u'hr__8',
       u'hr__9', u'hr__10', u'hr__11', u'hr__12', u'hr__13', u'hr__14',
       u'hr__15', u'hr__16', u'Mo_1', u'Mo_2', u'Mo_12', u'Mo_4', u'Mo_5',
       u'Mo_6', u'Mo_7', u'Mo_8', u'Mo_9', u'Mo_11', u'Mo_10'],
      dtype='object')

Linear Regression

reg = linear_model.LinearRegression(normalize=True)
reg.fit(x_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
len(pred)
135
reg.coef_
array([  6.62755296e+01,   2.04249966e+01,   1.20935659e+02,
         6.47837430e+01,   3.58486420e+01,  -3.69391903e+01,
        -3.01933766e+01,  -2.22099310e+01,  -9.81411164e+00,
        -1.91510126e+01,  -5.87872308e+00,  -7.15162930e+00,
        -1.60445105e+01,   7.74818880e+01,   1.07194064e+02,
        -5.88115195e-01,   2.49142215e+01,   2.58633358e+01,
         4.69715758e+01,   4.78092128e+00,   1.18189111e+01,
         1.19654452e+01,   3.45213517e+01,   2.97217808e+01,
         2.58060841e+01,  -8.82700496e+00,   3.86708467e+01,
         1.76495671e+00,   2.43896549e+01,  -2.68267979e-14,
         9.09873691e+01,   8.90868100e+01,  -1.18923199e+02,
        -1.26603031e+01,  -8.41581204e+00,  -9.03806384e+01,
        -8.31484193e+01,  -1.06783529e+02,  -9.10567480e+01,
        -1.11633228e+02,  -1.28826344e+02])

Model Validation

pred = reg.predict(x_test)
pred_train = reg.predict(x_train)
testScore = r2_score(y_pred=pred,y_true=y_test)
trainScore = r2_score(y_pred=pred_train,y_true=y_train)
rfResults = pd.DataFrame()
rfResults['Score'] = [trainScore,testScore]
rfResults['Step'] = ['train','test']
rfResults
Score Step
0 0.216025 train
1 0.046039 test
sns.pointplot(y=rfResults['Score'],x=rfResults['Step'])
plt.ylim([-.1,1])
plt.title('R^2 Scores')
plt.savefig('LRScores.png',bbox_inches='tight')

png

The linear regression model performed poorly overall.

  • Slight overfitting: .216 r^2 in the train set to 0.046 r2 in the test set

Let's try a different approach- a Random Forest Regressor.

Random Forest Regression

from sklearn.ensemble import RandomForestRegressor

For the Random Forest we will use a min samples split of 10, as to avoid overfitting.

rf = RandomForestRegressor(n_estimators=500,min_samples_split=10)
rf.fit(x_train,y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=10, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Model Validation

from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr
predicted_train = rf.predict(x_train)
predicted_test = rf.predict(x_test)
test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

print('Test data R-2 score: {}').format(test_score)
print('Test data Spearman correlation: {}').format(spearman[0])
print('Test data Pearson correlation: {}').format(pearson[0])

train_score = r2_score(y_train, predicted_train)
spearmanTrain = spearmanr(y_train, predicted_train)
pearsonTrain = pearsonr(y_train, predicted_train)

print(' ')

print('Train data R-2 score: {}').format(train_score)
print('Train data Spearman correlation: {}').format(spearmanTrain[0])
print('Train data Pearson correlation: {}').format(pearsonTrain[0])
Test data R-2 score: 0.135590561118
Test data Spearman correlation: 0.448279100475
Test data Pearson correlation: 0.37984956077
 
Train data R-2 score: 0.594805178695
Train data Spearman correlation: 0.828225597588
Train data Pearson correlation: 0.817565023318
RFperf = pd.DataFrame()
RFperf['Score'] = [train_score,test_score]
RFperf['Step'] = ['train','test']
RFperf
Score Step
0 0.594805 train
1 0.135591 test
sns.pointplot(y=RFperf['Score'],x=RFperf['Step'],color='Red')
plt.ylim([-.1,1])
plt.title('R^2 Scores')
plt.savefig('RFScores.png',bbox_inches='tight')

png

We had solid performance in the test set, with: - .59 R^2 value - .819 Spearman Correlation

But the model fell apart when using the test set, showing clear signs of overfitting: - .13 R^2 value - .447 Spearman Correlation

Feature Importance

predicted_test = rf.predict(x_test)

fI = pd.DataFrame()
fI['Variable'] = list(x_train.columns)
fI['Importance'] = rf.feature_importances_
fI.sort_values(by='Importance',ascending=False)[0:15]
Variable Importance
0 Page total likes 0.207723
5 Cat_1 0.107980
1 Paid 0.048507
40 Mo_10 0.044828
9 Sa 0.043304
26 hr__13 0.041345
8 Tu 0.030655
36 Mo_7 0.030090
17 hr__4 0.029611
11 Th 0.027883
23 hr__10 0.026175
12 Fr 0.025307
34 Mo_5 0.021988
6 Cat_2 0.021105
22 hr__9 0.020185
topVars= list(fI.sort_values(by='Importance',ascending=False)[0:15]['Variable'])
topVars
['Page total likes',
 'Cat_1',
 'Paid',
 'Mo_10',
 'hr__13',
 'Sa',
 'Mo_7',
 'Th',
 'hr__10',
 'hr__4',
 'Tu',
 'Fr',
 'Video',
 'Cat_2',
 'hr__1']

Can we improve the model by using only the top 15 features?

x = df[topVars]
x_train,x_test,y_train, y_test = train_test_split(x,
                                                  y, test_size=0.3,
                                                  random_state=50)
rf = RandomForestRegressor(n_estimators=500,min_samples_split=15)
rf.fit(x_train,y_train)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=15, min_weight_fraction_leaf=0.0,
           n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
from sklearn.metrics import r2_score
from scipy.stats import spearmanr, pearsonr
predicted_train = rf.predict(x_train)
predicted_test = rf.predict(x_test)
test_score = r2_score(y_test, predicted_test)
spearman = spearmanr(y_test, predicted_test)
pearson = pearsonr(y_test, predicted_test)

#print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print('Test data R-2 score: {}').format(test_score)
print('Test data Spearman correlation: {}').format(spearman[0])
print('Test data Pearson correlation: {}').format(pearson[0])

train_score = r2_score(y_train, predicted_train)
spearmanTrain = spearmanr(y_train, predicted_train)
pearsonTrain = pearsonr(y_train, predicted_train)

print(' ')

#print(f'Out-of-bag R-2 score estimate: {rf.oob_score_:>5.3}')
print('Train data R-2 score: {}').format(train_score)
print('Train data Spearman correlation: {}').format(spearmanTrain[0])
print('Train data Pearson correlation: {}').format(pearsonTrain[0])
Test data R-2 score: 0.0735779002202
Test data Spearman correlation: 0.366331840555
Test data Pearson correlation: 0.30490941908
 
Train data R-2 score: 0.463470449251
Train data Spearman correlation: 0.709508194683
Train data Pearson correlation: 0.71481692673

The model performed substantially worse when taking the top 15 features by importance from the old model.

Modeling Conclusion

After iterating through a random forest using the most important variables and seeing no improvement, this suggests that the data here is not rich enough to sufficiently predict likes based only on the information here.

Takeaways

  1. Paid posts on average have higher engagement
  2. Paid Status/Photo and Category 1 and 2 posts have the highest improvement so
    • more paid posts of these types
  3. Video Posts have the highest average engagement, and its not close
    • more Video posts
  4. Posting Monday in the mid-morning and Wednesday in the early morning
    • Identify and post during times of high engagement
  5. Place importance on increasing Page Likes
    • higher probability that a post has “high” engagement Need a richer dataset to predict accurately
    • identify potentially useful data points and collect them

References

(Moro et al., 2016) Moro, S., Rita, P., & Vala, B. (2016). Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, 69(9), 3341-3351.