Paper: FaceNet A Unified Embedding for Face Recognition and Clustering
Authors: Florian Schroff (Google), Dimitry Kalenichenko (Google) and James Philbin (Google)
Area: Computer Vision, Clustering, Classification, Deep Learning
Year: 2015
Highlighted Paper
Background:
Key Contributions:
- Learns a mapping from face images to a compact Euclidean space where distances directly correspond to measure of face similarity. The method uses a CNN to directly optimize the embedding itself.
- To train - use triplets of roughly aligned matching/non-matching face patches generated using an online triplet mining method.
Why is this novel? Unlike previous approaches where a final classification layer is used to predict the class, here we are leaning an embedding which can then be used for various classification purposes.
Model
Triplet Loss:
$$\sum_{i}^{N}[\lVert f(x_{i}^{a}) - f(x_{i}^{p}) \rVert_{2}^{2} - \lVert f(x_{i}^{a}) - f(x_{i}^{n}) \rVert_{2}^{2} + \alpha)]$$
\(a\) is the anchor, \(p\) is a positive and \(n\) is a negative. You essentially want to optimize for making the positive closer to the anchor than the negative.
Triplet Selection:
We focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the \(argmin\) and \(argmax\) within that mini-batch.
CNN Architecturse:
Existing architectures - Inception model and the Zeiler&Fergus model.
Some implementations:
tbmoon's implementation using PyTorch
timseler's implementation
Datasets
Labeled Faces in the Wild (LFW) courtsey of UMass
YouTube Faces DB