When looking at an image, humans shift their attention towards interesting regions, making sequences of eye fixations. When describing an image, they also come up with simple sentences that highlight the key elements in the scene. What is the correlation between where people look and what they describe in an image? To investigate this problem, we look into eye fixations and image captions, two types of subjective annotations that are relatively task-free and natural. From the annotations, we extract visual and verbal saliency ranks to compare against each other. We then propose a number of low-level and semantic-level features relevant to the visual-verbal consistency. Integrated into a computational model, the proposed features effectively predict the consistency between the two modalities on a large dataset with both types of annotations, namely SALICON .
|Original language||English (US)|
|Title of host publication||2017 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||6|
|State||Published - Nov 27 2017|
|Event||2017 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017 - Banff, Canada|
Duration: Oct 5 2017 → Oct 8 2017
|Name||2017 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017|
|Other||2017 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2017|
|Period||10/5/17 → 10/8/17|
Bibliographical notePublisher Copyright:
© 2017 IEEE.
- Image caption
- Visual saliency
- Visual-verbal consistency