Modality: Modality refers to the way in which something expressed or perceived.
A dictionary definition: with multiple modalities.
A research-oriented definition: MultiModal is the science of heterogeneous and interconnected data.
Information present in different modalities will often show diverse qualities, structures and representations.
Dimensions of Heterogeneity – Examples:
- Structure: static, temporal, spatial, hierarchical, invariances
- Representation space: discrete, continuous, interpretable
- Information: entropy, density, information overlap, range
- Granularity: sampling rate, resolution, precision
- Noise: uncertainty, signal-to-noise ratio, missing data
- Relevance: task relevance, context dependence
- Connections
- Cross-modal interactions
Dimensions of Cross-modal Interactions:
- Additive, multiplicative, non-additive
- Bimodal, trimodal, high-modal
- Equivalence, correspondence, dependency
- Dominance, entailment, divergence
- Modulation, attention, transfer
- Causality, influences, directionality
Four eras of multimodal research:
- The “behavioral” era (1970s until late 1980s)
- language and Gestures
- The “computational” era (late 1980s until 2000)
- Audio-Visual Speech Recognition (AVSR)
- Multimodal/multisensory interfaces
- Multimedia Computing
- The “interaction” era (2000 - 2010)
- Modeling Human Multimodal Interaction: AMI Project, CHIL Project, CALO Project(Siri), SSP Project.
- The “deep learning” era (2010s until ...)
- Main focus of this tutorial: last 5 years
Multimodal Machine Learning (ML) is the study of computer algorithms that learn and improve through the use and experience of data from multiple modalities
Multimodal Artificial Intelligence (AI) studies computer agents able to demonstrate intelligence capabilities such as understanding, reasoning and planning, through multimodal experiences, and data
Multimodal AI is a superset of Multimodal ML
Definition: Learning representations that reflect cross-modal interactions between individual elements, across different modalities
This is a core building block for most multimodal modeling problems!
Definition: Identifying and modeling cross-modal connections between all elements of multiple modalities, building from the data structure
Most modalities have internal structure with multiple elements
Definition: Combining knowledge, usually through multiple inferential steps, exploiting multimodal alignment and problem structure
Definition: Learning a generative process to produce raw modalities that reflects cross-modal interactions, structure and coherence
Definition: Transfer knowledge between modalities, usually to help the target modality which may be noisy or with limited resources
Definition: Empirical and theoretical study to better understand heterogeneity, cross-modal interactions and the multimodal learning process