VisualHow: Multimodal Problem Solving

Jinhui Yang, Xianyu Chen, Ming Jiang, Shi Chen, Louis Wang, Qi Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Recent progress in the interdisciplinary studies of computer vision (CV) and natural language processing (NLP) has enabled the development of intelligent systems that can describe what they see and answer questions accordingly. However, despite showing usefulness in performing these vision-language tasks, existing methods still struggle in understanding real-life problems (i.e., how to do something) and suggesting step-by-step guidance to solve them. With an overarching goal of developing intelligent systems to assist humans in various daily activities, we propose VisualHow, a free-form and open-ended research that focuses on understanding a real-life problem and deriving its solution by incorporating key components across multiple modalities. We develop a new dataset with 20,028 real-life problems and 102,933 steps that constitute their solutions, where each step consists of both a visual illustration and a textual description that guide the problem solving. To establish better understanding of problems and solutions, we also provide annotations of multimodal attention that localizes important components across modalities and solution graphs that encapsulate different steps in structured representations. These data and annotations enable a family of new vision-language tasks that solve real-life problems. Through extensive experiments with representative models, we demonstrate their effectiveness on training and testing models for the new tasks, and there is significant scope for improvement by learning effective attention mechanisms. Our dataset and models are available at https://github.com/formidify/VisualHow.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PublisherIEEE Computer Society
Pages15606-15616
Number of pages11
ISBN (Electronic)9781665469463
DOIs
StatePublished - 2022
Event2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022 - New Orleans, United States
Duration: Jun 19 2022Jun 24 2022

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2022-June
ISSN (Print)1063-6919

Conference

Conference2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
Country/TerritoryUnited States
CityNew Orleans
Period6/19/226/24/22

Bibliographical note

Funding Information:
This work is supported by NSF Grant 1908711.

Publisher Copyright:
© 2022 IEEE.

Keywords

  • Datasets and evaluation
  • Vision + language

Fingerprint

Dive into the research topics of 'VisualHow: Multimodal Problem Solving'. Together they form a unique fingerprint.

Cite this