D-ro Marc A. Kastner

Pri mi

IPA-CLIP: Integrating Phonetic Priors into Vision and Language Pretraining

Reen al la antaŭa paĝo

Aŭtoroj: Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Keisuke Doman, Yasutomo Kawanishi, Ichiro Ide

Resumo:

Recently, large-scale Vision and Language (V&L) pretraining has become the standard backbone of many multimedia systems. While it has shown remarkable performance even in unseen situations, it often performs in ways not intuitive to humans. Particularly, they usually do not consider the pronunciation of the input, which humans would utilize to understand language, especially when it comes to unknown words. Thus, this paper inserts phonetic prior into Contrastive Language-Image Pretraining (CLIP), one of the V&L pretrained models, to make it consider the pronunciation similarity among its pronunciation inputs. To achieve this, we first propose a phoneme embedding that utilizes the phoneme relationships provided by the International Phonetic Alphabet (IPA) chart as a phonetic prior. Next, by distilling the frozen CLIP text encoder, we train a pronunciation encoder employing the IPA-based embedding. The proposed model named IPA-CLIP comprises this pronunciation encoder and the original CLIP encoders (image and text). Quantitative evaluation reveals that the phoneme distribution on the embedding space represents phonetic relationships more accurately when using the proposed phoneme embedding. Furthermore, in some multimodal retrieval tasks, we confirm that the proposed pronunciation encoder enhances the performance of the text encoder and that the pronunciation encoder handles nonsense words in a more phonetic manner than the text encoder. Finally, qualitative evaluation verifies the correlation between the pronunciation encoder and human perception regarding pronunciation similarity.

Tipo: arXiv pre-print 2303.03144

Dato de publikigo: March 2023

DOI: 10.48550/arXiv.2303.03144

Linkoj: [ preprint ]


Se vi havas demandojn aŭ komentojn pri ĉi tiu esplorado, bonvolu lasi komenton sube aŭ sendi al mi retpoŝton. Mi respondos rapide.
© 2013-2023 Marc A. Kastner. Powered by KirbyCMS. Some rights reserved. Privacy policy.