An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality Recognition
This is the official code repo of An Open-source Benchmark of Deep Learning Models for Audio-visual Apparent and Self-reported Personality Recognition (https://arxiv.org/abs/2210.09138).
The accepted supplementary material is provided in Personality_benchmark_Supplementary_.pdf
In this project, seven visual models, six audio models and five audio-visual models have been reproduced and evaluated. Besides, seven widely-used visual deep learning models, which have not been applied to video-based personality computing before, have also been employed for benchmark. Detailed description can be found in our paper.
All benchmarked models are evaluated on: the ChaLearn First Impression dataset and the ChaLearn UDIVA self-reported personality dataset
This project is currently under active development. Documentation, examples, and tutorial will be progressively detailed
Setup project: you can use either Conda or Virtualenv/pipenv to create a virtual environment to run this program.
# create and activate a virtual environment
virtualenv -p python38 venv
source venv/bin/activate
pip install deep_personality
# clone current repo
git clone DeepPersonality
cd DeepPersonality
# install required packages and dependencies
pip install -r requirements.txt
The datasets we used for benchmark are Chalearn First Impression and UDIVA.
-
The former contains 10, 000 video clips that come from 2, 764 YouTube users for apparent personality recognition(impression), where each video lasts for about 15 seconds with 30 fps.
-
The latter, for self-reported personality, records 188 dyadic interaction video clips between 147 voluntary participants, with total 90.5h of recordings. Every clip contains two audiovisual files, where each records a single participant’s behaviours.
-
Each video in both datasets is labelled with the Big-Five personality traits.
To meet various requirements from different models and experiments, we extract raw audio file and all frames from a video and then extract face images from each full frame, termed as face frames.
For quick start and demonstration, we provide a tiny Chalearn 2016 dataset containing 100 videos within which 60 for training, 20 for validation and 20 for test. Please find the process methods in dataset preparation.
For your convenience, we provide the processed face image frames dataset for Chalearn 2016 since that dataset is publicly available, which indicates we can make our processed data open to the community.
We employ a build-from-config manner to conduct an experiment. After setting up the environments and preparing the data needed, we can have a quick start by the following command line:
# cd DeepPersonality # top directory
python ./tools/run_exp.py --config path/to/exp_config.yaml
For quick start with tiny ChaLearn 2016 dataset,
if you prepare the data by the instructions in above section, the following command will launch an experiment for bimodal-resnet18 model
.
# cd DeepPersonality # top directory
python ./tools/run_exp.py --config config/demo/bimodal_resnet18.yaml
Detailed arguments description are presented in command line interface file.
For quick start demonstration, please find the Colab Notebook: QuickStart
For experiments start from raw video processing, please find this Colab Notebook: StartFromDataProcessing
We use pipeline config files and registration mechanism to organize our experiments.
In the framework user can design and config their own spatial-temporal data preprocessing approaches.
If user want to add their own models or algorithms for experiments, please reference the Colab Notebook TrainYourModel
To allow users to use different data augmentation strategies, the framework provides a command line argument "--set DATA_LOADER.TRANSFORM < augmentation strategy >" to choose the data augmentation strategy. The details are listed as follows:
-
standard_frame_transform
: the default data augmentation strategy. -
strong_frame_transform
: a stronger data augmentation strategy with multiple image transforms. -
customized_frame_transform
: a customized data augmentation strategy, which can be defined by users.
Step 1: Resizing the input image (or cropped face image) as a 3-channel rectangle whose shorter edge is 256 pixles long. For example, if an image has a size of (width: 720, height: 512), the resized image will have the size of (width: 360, height: 256);
Step 2: the resized image will be filpped horizontally at a probablity of 0.5;
Step 3: a image patch of the size (width: 224, height: 244) is cropped from the filpped image;
Step 4: the pixel values (P) in the cropped image will be normalized to a numerical range of (-1, 1) by the formulation of
where the
Besides the first two augmentaion steps in standar_frame_transform
strategy ( resize
and horiziontally flip
), this strategy also sequentially employs random rotation
, color jitter
, and random resized crop
followed by the same image normalization operation (Step 4) described in standard_frame_tansform
.
random rotation
: the image is rotated by a randomly selected angle with the range of (-5, 5).
color jitter
: the brightness, contrast and saturation of the image are randomly changed, where the ranges of scaling/jittering factors for the brightness, contrast and saturation are (0.9, 1.1).
random resized crop
: a patch (whose size is 0.8 to 1 of the original image) is randomly cropped from the image, which is then resized into the size of (width: 224, height: 224).
The command line for using the strong data augmentation strategy is:
python ./tools/run_exp.py --config config/demo/(model.yaml, e.g., bimodal_resnet18.yaml) \
--set DATA_LOADER.TRANSFORM strong_frame_transform
First, defining your own data transform function, and registering it to the registry.
# step 1: importing the registry for data transforms:
from dpcv.data.transforms.build import TRANSFORM_REGISTRY
# step 2: defining your own data transform function and registering it to the registry
@TRANSFORM_REGISTRY.register()
def my_frame_transform(cfg):
"""
define your own data transform function
"""
# take the transforms functions from torchvision for your reference
import torchvision.transforms as transforms
transforms = transforms.Compose([
transforms.CenterCrop((112, 112)),
transforms.ToTensor(),
])
return transform
The command line for using the customized frame transform strategy is:
python ./tools/run_exp.py --config config/demo/bimodal_resnet18.yaml \
--set DATA_LOADER.TRANSFORM my_frame_transform
This framework facilitates user to train models with metadata or additional data. The guideline is provided below:
-
Step 1: Preparing the additional data
- Preparing the metadata or additional data in a csv file.
- The first column should be the video file name.
-
Step 2: Modifing the training config file:
# file: config/demo/hrnet_use_other_data.yaml # modify the training config file DATA: ROOT: "datasets/chalearn2021" # the path to the metadata or additional data TRAIN_OTHER_DATA: "datasets/chalearn21_metadata/metadata_train.csv" VALID_OTHER_DATA: "datasets/chalearn21_metadata/metadata_valid.csv" TEST_OTHER_DATA: "datasets/chalearn21_metadata/metadata_test.csv" DATA_LOADER: # the name of the data_loader used for adding additional data NAME: "all_true_personality_with_other_data_loader" MODEL: # the name of the model used for training with additional data NAME: "hr_net_true_personality_with_meta_data" (please selecting and modifing your model) USE_OTHER_DATA: True # the dimension of the additional data OTHER_DATA_DIM: 3 (please customize the number here (the dimension of the metadata/additional data))
-
Step 3: Training the model
python ./tools/run_exp.py --config config/demo/hrnet_use_other_data.yaml
Model | Modal | ChaLearn2016 cfgs | ChaLearn2016 weights |
---|---|---|---|
DAN | visual | cfg | weight |
CAM-DAN+ | visual | cfg | weight |
ResNet | visual | cfg | weight |
HRNet | visual | cfg-frame/cfg-face | weight-frame/weight-face |
SENet | visual | cfg-frame/cfg-face | weight |
3D-ResNet | visual | cfg-frame/cfg-face | weight |
Slow-Fast | visual | cfg-frame/cfg-face | weight |
TPN | visual | cfg-frame/cfg-face | weight |
Swin-Transformer | visual | cfg-frame/cfg-face | weight |
VAT | visual | cfg-frame/cfg-face | weight |
Interpret Audio CNN | audio | cfg | weight |
Bi-modal CNN-LSTM | audiovisual | cfg | weight |
Bi-modal ResNet | audiovisual | cfg | weight |
PersEmoN | audiovisual | cfg | weight |
CRNet | audiovisual | cfg | weight |
Amb-Fac | audiovisual | cfg-frame, cfg-face | weight |
The models that we have or will reproduced:
- Deep bimodal regression of apparent personality traits from short video sequences
- Bi-modal first impressions recognition using temporally ordered deep audio and stochastic visual features
- Deep impression: Audiovisual deep residual networks for multimodal apparent personality trait recognition
- Cr-net: A deep classification-regression network for multimodal apparent personality analysis
- Interpreting cnn models for apparent personality trait regression
- On the use of interpretable cnn for personality trait recognition from audio
- Persemon: a deep network for joint analysis of apparent personality, emotion and their relationship
- A multi-modal personality prediction system
- Squeeze-and-excitation networks
- Deep high-resolution representation learning for visual recognition
- Swin transformer: Hierarchical vision transformer using shifted windows
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet
- Slowfast networks for video recognition
- Temporal pyramid network for action recognition
- Video action transformer network
If you use our code for a publication, please kindly cite it as:
@article{liao2024open,
title={An open-source benchmark of deep learning models for audio-visual apparent and self-reported personality recognition},
author={Liao, Rongfan and Song, Siyang and Gunes, Hatice},
journal={IEEE Transactions on Affective Computing},
year={2024},
publisher={IEEE}
}
- 2022/10/17 - Paper submission and make project publicly available.
- Test Pip install
- Description of adding new models
- Model zoo
- Notebook tutorials