Paper: FaceNet A Unified Embedding for Face Recognition and Clustering
Authors: Florian Schroff (Google), Dimitry Kalenichenko (Google) and James Philbin (Google)
Area: Computer Vision, Clustering, Classification, Deep Learning
Year: 2015
Highlighted Paper

Background:

  • Euclidean space, distance and vector norms
  • Nearest neighbor/k-means/clustering
  • CNNs: Stanford's CS231N course notes, Deep learning book chapter 9
  • Important architectures for this paper:
    1. Zeiler&Fergus
    2. Google LeNet Inception Model
  • Key Contributions:

    1. Learns a mapping from face images to a compact Euclidean space where distances directly correspond to measure of face similarity. The method uses a CNN to directly optimize the embedding itself.
    2. To train - use triplets of roughly aligned matching/non-matching face patches generated using an online triplet mining method.

    Why is this novel? Unlike previous approaches where a final classification layer is used to predict the class, here we are leaning an embedding which can then be used for various classification purposes.

    Model Triplet Loss:
    $$\sum_{i}^{N}[\lVert f(x_{i}^{a}) - f(x_{i}^{p}) \rVert_{2}^{2} - \lVert f(x_{i}^{a}) - f(x_{i}^{n}) \rVert_{2}^{2} + \alpha)]$$ \(a\) is the anchor, \(p\) is a positive and \(n\) is a negative. You essentially want to optimize for making the positive closer to the anchor than the negative.

    Triplet Selection:
    We focus on the online generation and use large mini-batches in the order of a few thousand exemplars and only compute the \(argmin\) and \(argmax\) within that mini-batch.

    CNN Architecturse:
    Existing architectures - Inception model and the Zeiler&Fergus model.

    Some implementations:
    tbmoon's implementation using PyTorch
    timseler's implementation

    Datasets
    Labeled Faces in the Wild (LFW) courtsey of UMass
    YouTube Faces DB