Abstract: Recent works have shown that, when trained at scale, unimodal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from ...