ImageBind one of the recent multi-modal works beyond audio-visual learning. It proposes an idea of using one modality as an anchor which most likely has the highest amount of paired data available. This technique allows unpaired disparate modalities to also jointly arrive a similar representation and allows emergent behaviors while overcoming the issue of unavailability of all modalities to be present for every data sample.
|You need large amount of paired (complete) data for a true multimodal learning||They use image as an anchor and allow learning image-paired with any modality for learning.|
ImageBind can outperform the following settings (claimed)
One positive pair and all others are negative (infoNCE)
infoNCE(Noise-Contrastive Estimation) loss :
Good explanation of many types of SSL losses here
Emergent alignment (zero-shot learning?) — When you have not trained a model for an ability but it performs well (outperforms task-specific baselines) when tested on that ability.