Abstract: Recent works have shown that, when trained at scale, unimodal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results