Most content summarization models are targeted to summarize the text content of a set of texts, and it is still challenging to summarize the visual content of a set of images. In this presentation, we propose a method for summarization of the visual content of an image set by combining scene graphs of multiple images and generating a single caption that describes the image set. We also present a method to find a common context word to improve the description of the image set by incorporating ConceptNet. In this method, we build word relations of different words, such as synonym words and category words, to find the representative word in each word relation. The proposed method is evaluated on the MSCOCO dataset compared with other text generation methods, showing a promising direction for this research.
Type: Poster at MIRU Symposium (画像の認識・理解シンポジウム)
Publication date: July 2022