This project aims to develop a neural network to track an object using YOLO V3 and LSTM. YOLO V3 has a key role to detect objects in an image while LSTM deals with their location in each frame as a historical data. In this project, three models are trained and estimated. YOTMCLS is utilizing coordinates and image feature from YOLO V3 as an input data. YOTMPMO does not use image feature data. In this model, coordinates are converted into probability map as an input data. YOTMMLP is designed that Cx, Cy, W and H are fed to its own LSTM network separately.
The input of YOTMCLS consists of the image feature and coordinates of an image from the YOLO output. The location of an object is predicted from LSTM.
The coordinates are converted into probability map and then fed to LSTM. However, there is no way to reconvert the probability map into coordinates. Thus, this project proposes the way to convert the output of LSTM into coordinates using the below equation.
There are four LSTMs for each coordinate in the YOTMMLP model, so Cx, Cy, W, H are independently predicted. Because separating coordinates makes the prediction model simple, it is expected that the performance will improve.
Python 3.7
PyTorch 1.3
To train YOT, 27 of TB-100 data from http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html are used.
60% frames of each video clip are used to train the networks and 20% frames of them are used to validate them. The IoT scores of training and validating sets of YOLO outputs are 0.641 and 0.646.
The default value of coordinates of a predicted object from YOLO V3 is (0, 0, 0, 0, 0) when the object is not detected. However, using Cx=0 and Cy=0 may make a bias because (0, 0) means left-top in an image. In this project, (0.5, 0.5, 0, 0, 0) is used as a default value for undetected objects.
This model does not show good performance. The image feature seems to reduce the performance due to the complexity .
This model shows poor performance with overfitting.
With 64 of hidden size, YOTMMLP shows good performance.
Demo videos are available.
Ground truth is also sequential data, so training with ground truth and YOLO output will be expected to improve the performance. With 32 of the hidden size of LSTM, this model shows slightly better performance than YOTMMLP trained without ground truth.