This repository contains an implementation of the rectangular histogram of oriented gradients feature descriptor (R-HOG) using integral histograms. The integral histogram representation allows to quickly compute HOG features in subregions of an image in constant time. This is particularly useful if the features in an image must be computed repeatedly, e.g., in a sliding window manner.
HOG features may be seen as a special case of the Scale-invariant Feature Transform (SIFT) computed over a dense grid of keypoints where each block is additionally contrast-normalized.
- C++ templated implementation
- Python support for 32, 64, and 80 bit floating point precision
- Unrestricted input size (e.g., OpenCV as of version 4.5.5 requires the input to be a multiple of the block size)
- Support for arbitrary integer (8 bit to 64 bit, both signed and unsigned) and floating point input (e.g., OpenCV requires 8-bit unsigned integer input)
- Masking support (i.e., spatial exclusion of gradient magnitudes from contributing to features)
For a complete summary of differences between HOGpp and existing implementations, refer to the feature matrix below.
- C++17 compiler
- Boost 1.70
- CMake 3.15
- Eigen 3.4.0
- fmt 6.0
- OpenCV 4.0
- pybind11 2.6.2 (version 2.9.0 is required for use with Visual Studio 17 2022 and above)
More recent versions of the above are expected to work as well.
In Python:
from hogpp import IntegralHOGDescriptor
desc = IntegralHOGDescriptor()
# Load image
image = # ...
# Precompute the gradient histograms. This needs to be done only once for each image.
desc.compute(image)
# Extract the feature descriptor of a region of interest. The method can be
# called multiple times for different subregions of the above image. Note the
# use of matrix indexing along each axis opposed to Cartesian coordinates.
roi = (0, 0, 128, 64) # top left (row, column) size (height, width)
X = desc(roi)
The following feature matrix summarizes the differences between existing implementations.
Library | Signed Orientations | Custom Gradients | Masking | Arbitrary Input Size | Implementation |
---|---|---|---|---|---|
HOGpp | ✔️ | ✔️ | ✔️ | ✔️ | C++ |
OpenCV | ✔️ | ✖ | ✖ | ✖ | C++ |
scikit-image | ✖ | ✖ | ✖ | ✔️ | Cython/Python |
When using HOGpp, one should be aware of subtle differences between the integral histogram implementation and the one originally proposed by Dalal & Triggs.
In general, computing R-HOG consists of the following steps:
- (optional) gamma correction
- gradient computation
- orientation binning within a cell
- down-weighting of pixels using a Gaussian with respect to their position within a block
- trilinear interpolation of magnitude votes between neighboring bins in both orientation and position
- block normalization
Provided these steps, R-HOG extracted using an integral histogram is slightly inferior to the original formulation. The reason for this being that neither pixel down-weighting using a Gaussian nor trilinear interpolation can be performed efficiently within the integral histogram framework. However, the integral histogram R-HOG formulation is substantially faster while being a sufficiently close approximation to the original R-HOG formulation.
Despite the above limitations, our evaluation on the INRIA person
dataset and the comparison against
OpenCV's HOGDescriptor
indicates that particularly the Gaussian down-weighting
does not necessarily improve the generalization ability
of the associated classifiers.
For a comparison of both approaches, the interested reader should refer to:
Qiang Zhu, Mei-Chen Yeh, Kwang-Ting Cheng, & Avidan, S. (2006). Fast Human Detection Using a Cascade of Histograms of Oriented Gradients. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 2, pp. 1491–1498). IEEE. DOI: 10.1109/CVPR.2006.119
HOGpp implementation was validated by applying it to the task of pedestrian detection. For the most part, the experiments by Dalal & Triggs were replicated with few alterations.
More specifically, we trained a linear support vector machine (SVM) in the primal using stochastic gradient descent (SGD) on features extracted from cropped annotations of the INRIA person dataset. We then quantitatively compared the performance of the obtained classifier against models trained on descriptors extracted using OpenCV and scikit-image.
The following figure provides an intuition of the steps involved in training a pedestrian classifier and its use on HOG features for predicting the corresponding class.
On a high level, HOG features describe the silhouette of a pedestrian which is eventually used in a way that is similar to how template matching works albeit accounting for some pose variations.
For comparison purposes, we trained all the classifiers using same fixed set of HOG parameters producing a 3780-dimensional feature vector. Specifically, the parameters employed were:
- 9 orientation bins constructed from
unsigned
gradients - cell size of 8×8 pixels
- overlapping blocks consisting of 16×16 pixels (or equivalently, 2×2 cells)
l2-hys
block normalization clipped at 0.2
We then trained an initial SVM classifier using 5-fold stratified inner cross-validation while optimizing the regularization term penalty using grid search. 20% of the samples of each training split were additionally used as a validation split to allow for early stopping.
After obtaining the initial model, we used each classifier to perform an exhaustive search for false positives (i.e., hard mining) and retrained the classifiers by including the hard mined samples.
We used confidence based sampling opposed to random sampling to subsample the large set of false positives. Specifically, up to 30 most confident false positives (i.e., samples farthest away from the decision boundary) were selected as hard negatives.
The following plot summarizes the performance of refined models at various thresholds.
Overall, the HOGpp based model outperforms models that use OpenCV and scikit-image HOG descriptors.
A detailed look at additional classification metrics, however, shows that HOGpp achieves a lower precision compared to other two implementations. Yet, the recall and consequently the F₁ score are considerably higher thereby outperforming both implementations.
Implementation | Precision | Recall | F₁ score | Accuracy |
---|---|---|---|---|
hogpp | 95.45% | 90.75% | 93.04% | 97.20% |
skimage | 96.95% | 83.53% | 89.74% | 96.06% |
cv2 | 98.32% | 79.46% | 87.89% | 95.48% |
It is also important to consider the number of hard negatives produced by each of the HOG descriptor implementations. The following table provides an overview of the corresponding absolute numbers.
Implementation | Hard negatives |
---|---|
hogpp | 30584 |
cv2 | 31113 |
skimage | 33433 |
In this specific application, the initial model obtained from HOGpp descriptors generates the least number of false positives usable for further refinement. While the overall number of training samples is lowest, the HOGpp model still achieves the best performance in terms of the F₁ score and ROC AUC. At the same time, this indicates that the initial HOGpp model already generalizes better than OpenCV and scikit-image based models.
Due to the probabilistic nature of the learning process, particularly the number of hard negatives can vary depending on the chosen seed. Therefore, the corresponding numbers should be taken with a grain of salt because at times the OpenCV based model can produce fewer hard negatives than HOGpp. This observation, however, does not affect the generalization ability of the refined models on this task.
The following bar plot summarizes the average runtime of individual HOG implementations for extracting the descriptor of a single 128×64 (height×width) region of interest (ROI) within a larger image as performed during hard mining.
The runtime of the precompute
stage applicable only to HOGpp is negligible and
can therefore be hardly observed in the bar plot. As such, the extract
stage
is computationally more expensive. Nevertheless, HOGpp outperforms both
implementations in terms of the average cumulative runtime for a single ROI
consuming around 32 μs.
The speed up factor achieved by HOGpp with respect to OpenCV and scikit-image implementations is as follows:
cv2 | skimage | |
---|---|---|
hogpp | ×2.4 | ×7.3 |
As always, the provided results are specific to the described experiment, environment, and the setup used to evaluate the models, and therefore should not be extrapolated to different tasks without validation.
Porikli, F. (2005). Integral histogram: a fast way to extract histograms in Cartesian spaces. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 829–836). IEEE. DOI: 10.1109/CVPR.2005.188
Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Vol. 1, pp. 886–893). IEEE. DOI: 10.1109/CVPR.2005.177
Dollar, P., Tu, Z., Perona, P., & Belongie, S. (2009). Integral Channel Features. In Proceedings of the British Machine Vision Conference 2009 (Vol. 30, pp. 91.1-91.11). British Machine Vision Association. DOI: 10.5244/C.23.91
This document and all figures are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
HOGpp is provided under the Apache License 2.0.