New Datasets and Models for Contextual Reasoning in Visual Dialog

Yifeng Zhang, Ming Jiang, Qi Zhao

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Visual Dialog (VD) is a vision-language task that requires AI systems to maintain a natural question-answering dialog about visual contents. Using the dialog history as contexts, VD models have achieved promising performance on public benchmarks. However, prior VD datasets do not provide sufficient contextually dependent questions that require knowledge from the dialog history to answer. As a result, advanced VQA models can still perform well without considering the dialog context. In this work, we focus on developing new datasets and models to highlight the role of contextual reasoning in VD. We define a hierarchy of contextual patterns to represent and organize the dialog context, enabling quantitative analyses of contextual dependencies and designs of new VD datasts and models. We then develop two new datasets, namely CLEVR-VD and GQA-VD, offering context-rich dialogs over synthetic and realistic images, respectively. Furthermore, we propose a novel neural module network method featuring contextual reasoning in VD. We demonstrate the effectiveness of our proposed datasets and method with experimental results and model comparisons across different datasets. Our code and data are available at

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2022 - 17th European Conference, 2022, Proceedings
EditorsShai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages18
ISBN (Print)9783031200588
StatePublished - 2022
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: Oct 23 2022Oct 27 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13696 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference17th European Conference on Computer Vision, ECCV 2022
CityTel Aviv

Bibliographical note

Funding Information:
This work is supported by NSF Grants 1908711 and 1849107.

Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.


Dive into the research topics of 'New Datasets and Models for Contextual Reasoning in Visual Dialog'. Together they form a unique fingerprint.

Cite this