Abstract
When looking at an image, humans shift their attention toward interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem intuitively, we develop a visual analytics system, CapVis, to look into visual attention and image captioning, two types of subjective annotations that are relatively task-free and natural. Using these annotations, we propose a word-weighting scheme to extract visual and verbal saliency ranks to compare against each other. In our approach, a number of low-level and semantic-level features relevant to visual-verbal saliency consistency are proposed and visualized for a better understanding of image content. Our method also shows the different ways that a human and a computational model look at and describe images, which provides reliable information for a captioning model. Experiment also shows that the visualized feature can be integrated into a computational model to effectively predict the consistency between the two modalities on an image dataset with both types of annotations.
Original language | English (US) |
---|---|
Article number | a8 |
Journal | ACM Transactions on Intelligent Systems and Technology |
Volume | 10 |
Issue number | 1 |
DOIs | |
State | Published - Nov 2018 |
Bibliographical note
Funding Information:This work is an extended version of a previously accepted conference paper, H. Liang et al. Visual-Verbal Consistency of Image Saliency, IEEE International Conference on Systems, Man, and Cybernetics, 2017. This work is supported by the National Science Foundation of China under grant 61702457 and grant 61602409, a University of Minnesota Department of Computer Science and Engineering Start-up Fund (QZ). This work was done when Haoran Liang was a visiting student in the Zhao Lab. Authors’ addresses: H. Liang and R. Liang (corresponding author), Department of Information Engineering, Zhejiang University of Technology, 288 Liuhe Rd, Xihu District, Hangzhou, 310013, PR China; emails: {haoran, rhliang}@zjut.edu.cn; M. Jiang and Q. Zhao, Department of Computer Science and Engineering, University of Minnesota, MN, 55455, USA; emails: {mjiang, qzhao}@cs.umn.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 Association for Computing Machinery. 2157-6904/2018/11-ART10 $15.00 https://doi.org/10.1145/3200767
Publisher Copyright:
© 2018 Association for Computing Machinery.
Keywords
- Image captioning
- Visual analytics
- Visual saliency