CAPVIs: Toward better understanding of visual-verbal saliency consistency

Haoran Liang, Ming Jiang, Ronghua Liang, Qi Zhao

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


When looking at an image, humans shift their attention toward interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem intuitively, we develop a visual analytics system, CapVis, to look into visual attention and image captioning, two types of subjective annotations that are relatively task-free and natural. Using these annotations, we propose a word-weighting scheme to extract visual and verbal saliency ranks to compare against each other. In our approach, a number of low-level and semantic-level features relevant to visual-verbal saliency consistency are proposed and visualized for a better understanding of image content. Our method also shows the different ways that a human and a computational model look at and describe images, which provides reliable information for a captioning model. Experiment also shows that the visualized feature can be integrated into a computational model to effectively predict the consistency between the two modalities on an image dataset with both types of annotations.

Original languageEnglish (US)
Article numbera8
JournalACM Transactions on Intelligent Systems and Technology
Issue number1
StatePublished - Nov 2018

Bibliographical note

Funding Information:
This work is an extended version of a previously accepted conference paper, H. Liang et al. Visual-Verbal Consistency of Image Saliency, IEEE International Conference on Systems, Man, and Cybernetics, 2017. This work is supported by the National Science Foundation of China under grant 61702457 and grant 61602409, a University of Minnesota Department of Computer Science and Engineering Start-up Fund (QZ). This work was done when Haoran Liang was a visiting student in the Zhao Lab. Authors’ addresses: H. Liang and R. Liang (corresponding author), Department of Information Engineering, Zhejiang University of Technology, 288 Liuhe Rd, Xihu District, Hangzhou, 310013, PR China; emails: {haoran, rhliang}; M. Jiang and Q. Zhao, Department of Computer Science and Engineering, University of Minnesota, MN, 55455, USA; emails: {mjiang, qzhao} Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from © 2018 Association for Computing Machinery. 2157-6904/2018/11-ART10 $15.00

Publisher Copyright:
© 2018 Association for Computing Machinery.


  • Image captioning
  • Visual analytics
  • Visual saliency


Dive into the research topics of 'CAPVIs: Toward better understanding of visual-verbal saliency consistency'. Together they form a unique fingerprint.

Cite this