Paper Link
Authors
Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency
Summary
- Multimodal machine learning is about integrating different information sources (called modalities) and build a more performant and robust model. The authors did a comprehensive survey on recent progress in multimodal ML and unified these advances under the same taxonomy.
- The authors did a historical review on the applications of multimodal ML such as audio-visual speech recognition and image captioning, they categorized the applications and use that to motivate their discussion on the common technical challenges faced by these applications. This paper summarized 5 major challenges including representation, translation, alignment, fusion, and co-learning.
- This paper provides an in-depth and comprehensive survey on the above-summarized challenges faced by the Multimodal ML field. For each challenge, the authors reviewed previous attempts in approaching them and formally defined a common taxonomy for each challenge. Representation is defined over Joint or Coordinated representation, Translation is unified into example-based and generative approaches, Alignment can be divided into explicit and implicit, Fusion could be model-agnostic or model-based, Co-learning could be parallel, non-parallel or hybrid based on the resources available to that modality.