Leverage Metaflow, PyTorch, AWS S3, Elasticsearch, FastAPI and Docker to create a production-ready facial recognition solution. It demonstrates the practical use of deep metric learning to recognize previously unseen faces without prior training.
Streamlit demo:
Note that due to the small size of training data (~5k images), the recognition power might not be optimal. It is worth noting that production facial recognition systems are typically trained on millions of samples.
The (locally) implemented architecture, as is:
Overview of the used techniques and network architectures.
I'm using a dataset called "Labeled Faces in the Wild" (LFW) that contains +13000 images of faces collected from the web. Sample images:
A large set of data augmentations are available. I used a high level of image augmentations, since we don't really have a lot of images. Here is an example of pre and post augmentation:
The dataset was train/dev/test randomly splitted by person_id, meaning different people are in different splits. I only included people that had at least two images, since thats the minimum for triplet mining (at least one positive sample per anchor).
Data was divided as follows:
- Training split: 1008 people (5299 images)
- Dev split: 336 people (1695 images)
- Test split: 336 people (2170 images)
I used an EffientNet (b0) that was trained (by paper authors) on ImageNet-1k and unlabeled JFT-300m using Noisy Student semi-supervised learning.
Here is a diagram of how this works in practice:
Since my objective here is to build a system capable of recognizing whatever person's face, traditional methods like multiclass classification were discarded.
I finetuned this model via metric learning by online mined triplets, leveraging the SuperTriplets python library. By doing this, the model learnt embeddings where capable of differentiating faces via cosine similarity.
Online Mined Hard Triplets Loss
The concept of triplets is central to this technique. A triplet consists of three samples: an anchor, a positive, and a negative instance. In the supervised mode with multimodal data, an anchor could be an image of a dog and its caption, the positive sample another image of a different dog and its caption, while the negative sample could be an image of a different animal, let's say a cat, and its caption.
The goal of a model updated by a triplet loss is to correctly discriminate between positive and negative instances while also ensuring that the embeddings of positive instances are closer together than those of negative instances. However, randomly selecting triplets during training can lead to slow convergence and suboptimal results. To address this, we employ online hard triplet mining, where we dynamically select the hardest triplets during each training iteration. This focuses the training process on the most informative and challenging instances, leading to more robust representations.
Here is a scheme of one triplet loss optimization step (single triplet):
Optimization settings
Here are the settings used during training:
- Criterion: BatchHardTripletLoss (standard batch hard triplet loss, with a margin param)
- Distance optimized during training: Euclidean
- Optimizer: AdamW
- Learning rate: 1e-3
- Weight decay: 1e-2
- Max epochs: 10
- Early stopping with frequent evaluation: saves best model, stops training given a patience of 3
- Training batch size: 32
- Image size: 224x224
- Image normalization: imagenet defaults
- Linear projection (last layer) output dimension: 300
Calibrating Face Matching Probabilities
I also trained an scikit-learn.isotonic.IsotonicRegression
estimator to calibrate face matching probabilities given the cosine similarity scores to make outputs as reliable as possible. It was fitted using the dev split, and tested on the test set:
Overview of the inference stack.
FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. It is used in this project to build the interface between the trained models/stack and potential users.
On startup, the API will connect to Metaflow/S3/Elasticsearch and load the encoder, probability calibration model and every other configuration needed for production (e.g. image preprocessing transforms, elasticsearch client).
There are two main endpoints in the facial-recognition-api
:
Uses the trained encoder to calculate the vector representation of an input image. Indexes a document in Elasticsearch with the person's name and embeddings, in the form of:
{
"embeddings": [0.12, 0.523, 0.32, 0.96, 0.04, 0.77],
"name": "elon musk"
}
Returns status and Elasticsearch message to the indexing tentative:
{
"error": false,
"msg": "created"
}
Uses the trained encoder to calculate the vector representation of an input image. Uses this as a query to search Elasticsearch for the closest known (indexed) person. With the closest person similarity score in hands, the probability calibrator is invoked, and the calibrated prediction is returned:
{
"error": false,
"results": {
"pred": "elon musk",
"proba": 0.95
}
}
Elasticsearch is a distributed search and analytics engine built on Apache Lucene. It is used here to store (and index) and search for faces, given embeddings. I configured an index with approximate nearest neighbor similarity search via cosine similarity enabled.
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.
Embeddings of Elon Musk and Mark Zuckerberg faces in two dimensions.
Similarity search is the most general term used for a range of mechanisms which share the principle of searching (typically, very large) spaces of objects where the only available comparator is the similarity between any pair of objects.
Nearest neighbor search algorithms aim to find the closest data point(s) to a given query point from a dataset. In some cases, particularly when dealing with very large datasets, exact nearest neighbor search can be very computationally expensive (both in space and time, see "curse of dimensionality"), sometimes even impractical.
To speed up the search, approximation methods are often used. It does reduce the quality of the search results, but in some domains, finding an approximate nearest neighbor is an acceptable solution.
You will need:
- python3.10, python3.10-venv
- docker and docker-compose
- make
- Clone this repo and follow the instructions there to start Metaflow stack:
https://github.com/gabrieltardochi/metaflow-docker-deployment
- Run
make dev-venv
to create your venv. Also activate it withsource .venv/bin/activate
- Download and setup everything with
python project_setup.py
- Up your local architecture stack (S3 and Elasticsearch) by running
docker-compose -f docker-compose-infra.yaml up -d
. You might need tosudo chown 1000:1000 es-data
and rerun if Elasticsearch fails to start. - Now you are ready to train with custom params. Check
python metaflow_train.py --help
- Once trained, update
.env
with the correct Metaflow training run - Build and run our facial recognition api by running:
docker-compose -f docker-compose-api.yaml up -d --build
- Docs are available at
localhost:8080/docs
- Now you can also run the Streamlit Webapp with
streamlit run streamlit_webapp.py
- Index new faces or search interacting with the WebApp at
localhost:8501
Theres a utility script ./index_people.sh
in case you want to batch index some people faces (people in sample-faces/index
). You might need to chmod +x index_people.sh
in order to run it.
Here are solutions for some common situations you might find yourself in:
- Check if ES is running fine:
curl http://localhost:${ES_API_PORT} -ku 'admin:admin'
- Delete the ES index:
curl -X DELETE "localhost:${ES_API_PORT}/${ES_INDEX}"
- Manually create the ES index:
docker-compose -f docker-compose-infra.yaml up facial-recognition-es-create-index
- Get ES index metrics (such as number of docs):
curl -X GET "localhost:${ES_API_PORT}/${ES_INDEX}/_stats?pretty"