In this article, we delve into the realm of deep metric learning, a technique used to improve image classification models by leveraging semantic context from large language models like BERT and ROBERTA. By incorporating these models’ ability to encode proper semantic context and model semantic relationships correctly, deep metric learning can enhance the performance of image classification models.
The article begins by explaining the context of using a gradient based on the relative difficulty of both positive and negative examples in the loss formulation of the Proxy-Anchor loss. This leads to improved intra-class compactness and inter-class separability when selecting informative triplets of samples, which is crucial for creating highly separable embedding spaces.
α, a hyperparameter, plays a significant role in selecting informative triplets based on different factors such as dataset, network architecture, and so on. However, computing the similarity of feature representations for the entire dataset using the current network model can become computationally infeasible when dealing with large datasets. Fortunately, sample mining or sampling is often done in mini-batches B by computing similarities only with a subset of samples in S’.
The next step involves performing a sub-proxy aggregation on S’ by summing the cosine similarities of all proxies that belong to the same class in C. This results in a final similarity matrix S of dimensions B x C, which can be highly sparse with many zero entries for small values of K.
To address this issue, a masked softmax operation is used, where Mij = 0 if Sij = 0 and Mij = 1 if Sij ≠ 0. This ensures that the denominator doesn’t become inflated when computing probabilities. The cross entropy loss is then computed over Pij as:
LCE = (yi · log(Pij)) + (1 – yi) · log(1 – Pij).
In summary, by leveraging large language models to encode proper semantic context and model semantic relationships correctly, deep metric learning can improve the performance of image classification models. By incorporating these models’ ability to encode semantic context, we can create highly separable embedding spaces that capture the underlying structure of the data. The masked softmax operation is used to address computational issues when dealing with large datasets, ensuring accurate probability computation without inflating the denominator.
Computer Science, Computer Vision and Pattern Recognition