Recently, multi-modal applications bring a need for a human-like understanding of the perception differences across modalities. For example, while something might have a clear image in a visual context, it might be perceived as too technical in a textual context. Such differences related to a semantic gap make a transfer between modalities or a combination of modalities in multi-modal processing a difficult task. Imageability as a concept from Psycholinguistics gives promising insight to the human perception of vision and language. In order to understand cross- modal differences of semantics, we create and analyze a cross- modal dataset for imageability. We estimate three imageability values grounded in 1) a visual space from a large set of images, 2) a textual space from Web-trained word embeddings, and 3) a phonetic space based on word pronunciations. A subset of the corpus is evaluated with an existing imageability dictionary to ensure a basic generalization, but otherwise targets finding cross-modal differences and outliers. We visualize the dataset and analyze it regarding outliers and differences for each modality. As additional source of knowledge, part-of-speech and etymological origin of all words are estimated and analyzed in context of the modalities. The dataset of multi-modal imageability values and an interactive browser will be made publicly available.
Type: 4th IEEE International Conference on Multimedia Information Processing and Retrieval (MIPR2021)
Publication date: September 2021
Links: [ github ] [ supplemental visualizations ]