Demystifying Unsupervised
Semantic Correspondence Estimation


Mehmet Aygun
Oisin Mac Aodha

University of Edinburgh
In European Conference on Computer Vision, 2022

 [Paper]  [Code]

Abstract

We explore semantic correspondence estimation through the lens of unsupervised learning. We thoroughly evaluate several recently proposed unsupervised methods across multiple challenging datasets using a standardized evaluation protocol where we vary factors such as the backbone architecture, the pre-training strategy, and the pre-training and finetuning datasets. To better understand the failure modes of these methods, and in order to provide a clearer path for improvement, we provide a new diagnostic framework along with a new performance metric that is better suited to the semantic matching task. Finally, we introduce a new unsupervised correspondence approach which utilizes the strength of pre-trained features while encouraging better matches during training. This results in significantly better matching performance compared to current state-of-the-art methods.




Evaluation Framework

Our paper introduce a detailed evaluation framework to understand failure cases of semantic correspondence methods along with a new metric. We define additional error metrics to analyze performance of different methods in more detail. A visual overview is illustrated in Fig.2.



As our goal is to estimate semantic correspondence, we should aim to match with the correct semantic part. As a result, we propose a new version of PCK which penalizes swap errors as well. Under this metric, to make a correct prediction, a point both needs to match close to the corresponding keypoint and the closest keypoint should be the same semantic keypoint.





Key Results

Overall, our proposed ASYM approach obtains better scores than other un-supervised methods on all datasets, independent of the choice of backbone or pre-training method, with the exception of the AFLW face dataset. Compared to LEAD, our proposed adaptation improves performance on datasets where the visual diversity is high (i.e. non-face datasets). EQ and DVE perform poorly on the datasets where the visual appearance is high across instances, but it is worth noting that these method were originally designed for the end-to-end trained setting.



Detailed error analysis: For the unsupervised methods, we see that the most common error type is miss across all methods. While ASYM reduces misses compared to other unsupervised methods, it is not as good as the supervised approaches. Compared to the more sophisticated supervised approaches it generates more swaps. We argue that while more supervision might help to reduce misses, in order to reduce swaps, better matching mechanisms are needed. Finally, we can see that our new PCK metric is reduced by 20\% compared to the original PCK metric in all cases. For some applications these errors might not affect the end performance drastically, while for others, this disparity could be significant.





Paper

    Aygun and Mac Aodha

Demystifying Unsupervised Semantic Correspondence Estimation

ECCV, 2022. [Paper] [Bibtex]



Acknowledgements

Thanks to Hakan Bilen and Omiros Pantazis for their valuable feedback. This work was in part supported by the Turing 2.0 `Enabling Advanced Autonomy' project funded by the EPSRC and the Alan Turing Institute. This webpage template was borrowed from colorful folks.