September 2019
tl;dr: use a combination of appearance metric and bbox for tracking.
The problem with sort is the frequent ID switches as sort uses a simple motion model and does not handle occluded tracks well.
Deep sort uses the appearance features to track objects through longer periods of occlusion. In reality the cost only consists of appearance metrics, although bbox distance is used as a gating process.
deepSORT extracts features from image patches, while PointTrack extracts features from 2D point cloud.
- Same as sort:
- Kalman filter with constant velocity motion and linear observation model where bounding box as direct observations of the object state. --> maybe take a look at the Bayesian object detection?
- For each track we track the age of the track where it is not associated
-
Difference from sort:
- instead of using IOU metric, a weighted sum of Mahalanobis distance (between predicted bbox and measurement) with appearance distance, between i-th track and j-th detection
- the bbox distance
$d^{(1)} = (d_j - y_i)^T S_i^{-1} (d_j - y_i)$ , where (y, S) is the i-th track in measurement space, and d is the j-th detection. - the appearance distance
$d^{(2)} = min{1 - r_j^T r_k| r_k \in R_i }$ - Cost matrix
$c_{i,j} = \lambda d^{(1)} + (1-\lambda) d^{(2)}$ (in practice \lambda is set to 0 where there is substantial camera motion) - Gate matrix
$b_{i,j} = \mathbb{1}[d^{(1)} \le t^{(1)}] \times \mathbb{1}[d^{(2)} \le t^{(2)}]$ (even though \lambda=0, the Mahalabobis distance is still used to discard impossible measurements/detections)
- Prioritize more recently seen objects (smaller age)
- Deep appearance feature: l2 normalized feature, trained with triplet loss on re-ID database. See paper from the same author at WACV 2018 (triplet loss is still SOTA on Re-ID).
- ID switch is among the smallest and around half of sort.
- However the FP is much higher than sort, mainly due to maximum age A_max = 30 frames.
Code available at this repo.