Skip to content

daniel-lima-lopez/Tiktok-data-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tik tok data analysis and content classification

This project provides a concrete example of a typical Machine Learning engineering problem. The dataset from TikTok, contains video data, including interaction analysis and transcripts. The objective is to classify each video as either a claim or an opinion.

Tasks performed in this project include:

  1. Exploratory analysis of the dataset, including numerical, categorical, and text attributes.
  2. Implementation of a data preprocessing pipeline, incorporating Keras preprocessing layers.
  3. Implementation of a neural network to perform the classification
  4. Training the model and identifying the optimal set of weights
  5. Evaluation of the model
  6. Examples of use of the trained model

Note: yo can execute the code in the file example.ipynb.

1. Exploratory data analysis

We start by loading basic libraries, then we read the data set. Note that columns # and vide_id are eliminated from the analysis, since they do not provide relevant information for the classification.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
# read dataset
data_path = 'tiktok_dataset.csv'
data = pd.read_csv(data_path)

# drop columns
data = data.drop(['#', 'video_id'], axis=1)

data.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
claim_status video_duration_sec video_transcription_text verified_status author_ban_status video_view_count video_like_count video_share_count video_download_count video_comment_count
0 claim 59 someone shared with me that drone deliveries a... not verified under review 343296 19425 241 1 0
1 claim 32 someone shared with me that there are more mic... not verified active 140877 77355 19034 1161 684
2 claim 31 someone shared with me that american industria... not verified active 902185 97690 2858 833 329
3 claim 25 someone shared with me that the metro of st. p... not verified active 437506 239954 34812 1234 584
4 claim 19 someone shared with me that the number of busi... not verified active 56167 34987 4110 547 152

Subsequently, we can observe that there are no null or missing values ​​in the dataset:

data.isnull().sum()
claim_status                0
video_duration_sec          0
video_transcription_text    0
verified_status             0
author_ban_status           0
video_view_count            0
video_like_count            0
video_share_count           0
video_download_count        0
video_comment_count         0
dtype: int64

First, the following figure shows the distribution of values ​​for the attribute video_duration_sec, for the claim and opinion classes. As expected, both classes have a similar distribution, so this attribute does not contain relevant information for classification, so it is eliminated from the data set.

sns.histplot(data, x='video_duration_sec', hue='claim_status', kde=True)
<Axes: xlabel='video_duration_sec', ylabel='Count'>

png

# drop column
data = data.drop('video_duration_sec', axis=1)

When performing a basic statistical analysis, we realize that the numerical attributes contain values ​​in a wide range, and also have a high standard deviation, which makes the learning process in neural networks difficult. Consequently, it is necessary to perform a normalization process on these attributes.

data.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
video_view_count video_like_count video_share_count video_download_count video_comment_count
count 19084.000000 19084.000000 19084.000000 19084.000000 19084.000000
mean 254708.558688 84304.636030 16735.248323 1049.429627 349.312146
std 322893.280814 133420.546814 32036.174350 2004.299894 799.638865
min 20.000000 0.000000 0.000000 0.000000 0.000000
25% 4942.500000 810.750000 115.000000 7.000000 1.000000
50% 9954.500000 3403.500000 717.000000 46.000000 9.000000
75% 504327.000000 125020.000000 18222.000000 1156.250000 292.000000
max 999817.000000 657830.000000 256130.000000 14994.000000 9599.000000

Finally, the following figure illustrates the relationship between pairs of numerical attributes, where each class is denoted in different colors. Note that opinion instances have lower values ​​in all attributes, while complaint instances have higher values. This information reveals that numerical (video_view_count, video_like_count, video_share_count, video_download_count and video_comment_count) attributes contain valuable information for classification. Therefore, in this example they are considered as part of the analysis, in addition to including information from video transcripts (video_transcription_text) and author data (verified_status, author_ban_status).

sns.set_theme(style='ticks')
g = sns.pairplot(data, hue='claim_status', corner=True)

# logaritmic scale
g.set(xscale="log")
g.set(yscale="log")
<seaborn.axisgrid.PairGrid at 0x2478335b460>

png

2. Data preprocessing pipeline

Regarding the preprocessing pipeline, a function is implemented to transform a Pandas Dataframe into a Tensorflow Dataset, note that this process includes data batching. This function is then applied to data splitting, including training, validation, and test sets.

def df_to_dataset(dataframe, y_label, batch=1):
  dataframe = dataframe.copy()
  
  # extract labels and transform to integers
  labels = dataframe.pop(y_label).values
  _, labels = np.unique(labels,  return_inverse=True)
  
  # Dataset from tensor and labels
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) 
  return ds.batch(batch) # batch

# dat split
train, test = train_test_split(data, test_size=0.2, random_state=1)
train, val = train_test_split(train, test_size=0.2, random_state=1)

# transform Dataframe into Dataset
BATCH_SIZE = 16
train = df_to_dataset(train, y_label='claim_status', batch=BATCH_SIZE)
test = df_to_dataset(test, y_label='claim_status', batch=BATCH_SIZE)
val = df_to_dataset(val, y_label='claim_status', batch=BATCH_SIZE)

print('Number of instances on:')
print(f'- train: {BATCH_SIZE*len(train)}')
print(f'- val: {BATCH_SIZE*len(val)}')
print(f'- test: {BATCH_SIZE*len(test)}')
Number of instances on:
- train: 12224
- val: 3056
- test: 3824

Note: in this step, the label discretization process transform the clases:

$$claim\rightarrow 0\quad \text{and}\quad opinion\rightarrow 1$$

A function is implemented to prepare the data preprocessing pipeline, depending on the characteristics of the data. For numeric values, a Normalization layer is used to keep the values ​​in a normal distribution. For categorical attributes, a StringLookup layer is used, which transforms the input values ​​into a one-hot encoding representation. Finally, for strings, a TextVectorization layer is used to transform the input text into a numeric representation, where each integer represents a word in a vocabulary containing the most frequently used words in the dataset.

# input pipeline
def preprocessing_layer(name, dataset, type, max_tokens=10000, output_length=20):
    # extract column name
    feature_ds = dataset.map(lambda x, y: x[name])
    
    # preprocessing options
    if type=='numeric':
        auxIn = tf.keras.Input(shape=(1,), name=name, dtype='int64')
        layer = tf.keras.layers.Normalization(axis=None)
        layer.adapt(feature_ds)
    elif type=='categorical':
        auxIn = tf.keras.Input(shape=(1,), name=name, dtype='string')
        layer = tf.keras.layers.StringLookup(num_oov_indices=0, output_mode='one_hot')
        layer.adapt(feature_ds)
    elif type=='text':
        auxIn = tf.keras.Input(shape=(1,), name=name, dtype='string')
        layer = tf.keras.layers.TextVectorization(max_tokens=max_tokens, output_sequence_length=output_length)
        layer.adapt(feature_ds)
    
    encoded = layer(auxIn)

    # return both, the input and preprocessing layer
    return auxIn, encoded

This function is applied to all attributes and is adapted to the information in the training set. Note that all features receive information with an input layer, which is then evaluated with its respective preprocessing layer.Consequently, neural networks can process raw inputs, as the preprocessing process is performed during model evaluation.

inputs = []
encoded_inputs = []

# continuous columns
numerical = ['video_view_count', 'video_like_count', 'video_share_count', 'video_download_count',
             'video_comment_count']
for ni in numerical:
    auxIn, auxEn = preprocessing_layer(name=ni, dataset=train, type='numeric')
    inputs.append(auxIn)
    encoded_inputs.append(auxEn)

# categorical columns
categorical = ['verified_status', 'author_ban_status']
for ci in categorical:
    auxIn, auxEn = preprocessing_layer(name=ci, dataset=train, type='categorical')
    inputs.append(auxIn)
    encoded_inputs.append(auxEn)

# transcription text
textIn, textEn = preprocessing_layer(name='video_transcription_text', dataset=train, type='text')
textEn = tf.cast(textEn, tf.float32) # cast a float 32
inputs.append(textIn)
encoded_inputs.append(textEn)

3. Implementation of classification model

A neural network is implemented to perform the classification task. Note that the model evaluates its inputs with the preprocessing layer and then the results are evaluated with layers of fully-conected neurons. In this case, the text attribute (video_transcription_text) is treated separately from the rest of the features. Then, at the end of the network, all the features are concatenated to perform the classification.

concat_encoded = tf.keras.layers.Concatenate(axis=-1)(encoded_inputs)

# numerical model
nums = concat_encoded[:,:-20]
nd1 = tf.keras.layers.Dense(32, 'relu')(nums)
nd2 = tf.keras.layers.Dense(16, 'relu')(nd1)

# text model
text = concat_encoded[:,-20:]
td1 = tf.keras.layers.Dense(32, 'relu')(text)
td2 = tf.keras.layers.Dense(16, 'relu')(td1)

# concat
full_concat = tf.keras.layers.Concatenate(axis=-1)([nd2, td2])

# classification
out = tf.keras.layers.Dense(units=1, activation='sigmoid')(full_concat)

# model definition
model = tf.keras.Model(inputs, out)

# plor model
tf.keras.utils.plot_model(model, show_shapes=False, rankdir="LR", dpi=300)

png

4. Model training

Regarding the model traning, a custom Callback is implemented, which, during training, identifies the epoch in which the architecture obtains the highest accuracy on the validation set. This callback is then used during training to identify the configuration that performs best with information not included in the training (val). In this way, overfitting problems are avoided.

class MaxAccEpoch(tf.keras.callbacks.Callback):
    def __init__(self, epochs):
        super().__init__()
        self.epochs = epochs # number of epochs
        self.val_loss = [] # loss functions data

        self.max_epoch = 0
        self.max_val_acc = 0.0
        self.max_weights = None

    def on_epoch_end(self, epoch, logs=None):
        # when the Callback indentifies a new max val accuracy
        if logs.get('val_acc') > self.max_val_acc:
            self.max_epoch = epoch
            self.max_val_acc = logs.get('val_acc')
            self.max_weights = self.model.get_weights() # copy weights
        self.val_loss.append(logs.get('val_loss'))

        return super().on_epoch_end(epoch, logs)

    def on_train_end(self, logs=None):
        return super().on_train_end(logs)
    
EPOCHS = 20
cb = MaxAccEpoch(EPOCHS) # callback

# model compilation
model.compile(loss=tf.keras.losses.BinaryCrossentropy(),
              optimizer='adam', metrics=['acc'])

# trainin
metrics = model.fit(x=train, validation_data=val, epochs=EPOCHS, callbacks=cb)
Epoch 1/20
764/764 [==============================] - 11s 11ms/step - loss: 2.3888 - acc: 0.8577 - val_loss: 0.1115 - val_acc: 0.9686
Epoch 2/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0904 - acc: 0.9758 - val_loss: 0.0535 - val_acc: 0.9895
Epoch 3/20
764/764 [==============================] - 9s 11ms/step - loss: 0.0609 - acc: 0.9838 - val_loss: 0.0445 - val_acc: 0.9902
Epoch 4/20
764/764 [==============================] - 9s 11ms/step - loss: 0.0525 - acc: 0.9859 - val_loss: 0.0373 - val_acc: 0.9905
Epoch 5/20
764/764 [==============================] - 9s 12ms/step - loss: 0.0482 - acc: 0.9864 - val_loss: 0.0325 - val_acc: 0.9915
Epoch 6/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0445 - acc: 0.9878 - val_loss: 0.0302 - val_acc: 0.9921
Epoch 7/20
764/764 [==============================] - 8s 10ms/step - loss: 0.0412 - acc: 0.9887 - val_loss: 0.0318 - val_acc: 0.9921
Epoch 8/20
764/764 [==============================] - 8s 10ms/step - loss: 0.0402 - acc: 0.9891 - val_loss: 0.0331 - val_acc: 0.9918
Epoch 9/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0388 - acc: 0.9893 - val_loss: 0.0378 - val_acc: 0.9925
Epoch 10/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0361 - acc: 0.9898 - val_loss: 0.0466 - val_acc: 0.9925
Epoch 11/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0346 - acc: 0.9898 - val_loss: 0.0395 - val_acc: 0.9925
Epoch 12/20
764/764 [==============================] - 8s 10ms/step - loss: 0.0335 - acc: 0.9901 - val_loss: 0.0433 - val_acc: 0.9921
Epoch 13/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0327 - acc: 0.9906 - val_loss: 0.0313 - val_acc: 0.9928
Epoch 14/20
764/764 [==============================] - 9s 11ms/step - loss: 0.0312 - acc: 0.9902 - val_loss: 0.0283 - val_acc: 0.9931
Epoch 15/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0313 - acc: 0.9903 - val_loss: 0.0285 - val_acc: 0.9928
Epoch 16/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0291 - acc: 0.9909 - val_loss: 0.0296 - val_acc: 0.9928
Epoch 17/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0288 - acc: 0.9915 - val_loss: 0.0293 - val_acc: 0.9931
Epoch 18/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0262 - acc: 0.9918 - val_loss: 0.0359 - val_acc: 0.9925
Epoch 19/20
764/764 [==============================] - 8s 11ms/step - loss: 0.0277 - acc: 0.9912 - val_loss: 0.0336 - val_acc: 0.9921
Epoch 20/20
764/764 [==============================] - 8s 10ms/step - loss: 0.0251 - acc: 0.9925 - val_loss: 0.0627 - val_acc: 0.9889

Below are the results obtained during the training of the model, including the results on the train and val sets.

fig, axs = plt.subplots(1,2)
fig.set_size_inches((10,4))

labs = ['acc', 'loss']

for i, li in enumerate(labs):
    axs[i].plot(metrics.history[li], label='train')
    axs[i].plot(metrics.history[f'val_{li}'], label='val')

    axs[i].set_title(f'Model {li}')
    axs[i].set_ylabel(li)
    axs[i].set_xlabel('epoch')

handles, labels = axs[-1].get_legend_handles_labels()
fig.legend(handles, labels, bbox_to_anchor=(0.5,-0.15), loc='outside lower center', ncol=2, labelspacing=0., fontsize=14)

plt.show()

png

Note that the accuracy of the value set starts to decrease in last training epochs, which is attributed to the overfitting problem. For this reason, the callback is used to retrieve the configuration of the model that performed the best.

print(f'Optimal EPOCH: {cb.max_epoch}')
model.set_weights(cb.max_weights)
Optimal EPOCH: 13

5. Evaluation of the model

Finally, under the optimal configuration, the model is evaluated. This analysis includes the accuracy, the F1-Score and the confusion matrix obtained with the predictions made under the test set.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import f1_score

# accuracy on test set
test_acc = model.evaluate(test)

y_test = []
for _, label in test:
    y_test += list(np.array(label))


preds = model.predict(test)
preds = np.round(preds)
preds = np.reshape(preds, (preds.shape[0],))

# confusion matrix
cm = confusion_matrix(y_test, preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

# f1 score
f1 = f1_score(y_test, preds)

print(f'Loss on Test set: {test_acc[0]}')
print(f'Accuracy on Test set: {test_acc[1]}')
print(f'F1-Score on test set: {f1}')
239/239 [==============================] - 2s 8ms/step - loss: 0.0529 - acc: 0.9914
239/239 [==============================] - 1s 5ms/step
Loss on Test set: 0.05289480462670326
Accuracy on Test set: 0.9913544654846191
F1-Score on test set: 0.9912582781456953

png

The accuracy metric reveals that the model is able to generalize almost perfectly with new information (test set). Furthermore, the confusion matrix shows that there is a low rate of false positives and false negatives, which is reflected in a high F1-score. Therefore, we can conclude that the model is capable of making reliable predictions.

Finally, the obtained model is saved for possible implementation in some application.

model.save('final_model')
INFO:tensorflow:Assets written to: final_model\assets

6. Example of use

Finally, a simple example of using the pre-trained model is included. First, the model is loaded with its configuration.

reloaded_model = tf.keras.models.load_model('final_model')

Then, given an input instance, the model is used to make predictions. Note the format of the input, which consists of raw data. Since the model includes the preprocessing pipeline, the model is prepared to receive this type of input, making it easy to deploy without the concern of preprocessing the information beforehand.

def predict_with_model(model, inputs, labels, batch=False):
    if batch:
        inputs = {name: tf.convert_to_tensor(value) for name, value in inputs.items()}
    else: # only one instance
        inputs = {name: tf.convert_to_tensor([value]) for name, value in inputs.items()}
    preds = np.array(model.predict(inputs))
    
    # index labels
    inds = np.round(preds)
    outs = [labels[i[0]] for i in inds]
    return outs

# test with raw input
raw_input = {'video_view_count': 20000, 'video_like_count': 200, 'video_share_count':200, 'video_download_count':20,
             'video_comment_count':50, 'verified_status':'not verified', 'author_ban_status':'active',
             'video_transcription_text': "my colleagues' point of view is that 90 percent of goods are shipped by ocean freight"}

# labels names
LABELS = {0:'claim', 1:'opinion'}

# class prediction with pre-trained model
pred = predict_with_model(reloaded_model, raw_input, LABELS)
print(f'Prediction: {pred}')
1/1 [==============================] - 0s 56ms/step
Prediction: ['opinion']

In addition, the prediction method supports multi-instance inference, as illustrated below:

# raw inputs
raw_input = {'video_view_count': [20000,10000],
             'video_like_count': [200,500],
             'video_share_count':[200,400],
             'video_download_count':[20,30],
             'video_comment_count':[50,1000],
             'verified_status':['not verified', 'not verified'],
             'author_ban_status':['active','active'],
             'video_transcription_text': ["my colleagues' point of view is that 90 percent of goods are shipped by ocean freight",
                                          "a friend learned  from the news that at birth, baby pandas are smaller than mice"]}

labels = {0:'claim', 1:'opinion'}

# class predictions with pre-trained model
preds = predict_with_model(reloaded_model, raw_input, LABELS, batch=True)
print(f'Predictions: {preds}')
1/1 [==============================] - 0s 70ms/step
Predictions: ['opinion', 'claim']

About

Example of a typical Machine Learning engineering problem

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published