Zero-shot Natural Language Video Localization

Jinwoo Nam, Daechul Ahn, Dongyeop Kang, Seong Jong Ha, Jonghyun Choi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

23 Scopus citations

Abstract

Understanding videos to localize moments with natural language often requires large expensive annotated video regions paired with language queries. To eliminate the annotation costs, we make a first attempt to train a natural language video localization model in zero-shot manner. Inspired by unsupervised image captioning setup, we merely require random text corpora, unlabeled video collections, and an off-the-shelf object detector to train a model. With the unpaired data, we propose to generate pseudo-supervision of candidate temporal regions and corresponding query sentences, and develop a simple NLVL model to train with the pseudo-supervision. Our empirical validations show that the proposed pseudo-supervised method outperforms several baseline approaches and a number of methods using stronger supervision on Charades-STA and ActivityNet-Captions.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1450-1459
Number of pages10
ISBN (Electronic)9781665428125
DOIs
StatePublished - 2021
Event18th IEEE/CVF International Conference on Computer Vision, ICCV 2021 - Virtual, Online, Canada
Duration: Oct 11 2021Oct 17 2021

Publication series

NameProceedings of the IEEE International Conference on Computer Vision
ISSN (Print)1550-5499

Conference

Conference18th IEEE/CVF International Conference on Computer Vision, ICCV 2021
Country/TerritoryCanada
CityVirtual, Online
Period10/11/2110/17/21

Bibliographical note

Funding Information:
Acknowledgement. This work was partly supported by NCSOFT, the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2019R1C1C1009283) and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-01842, Artificial Intelligence Graduate School Program (GIST)) and (No.2019-0-01351, Development of Ultra Low-Power Mobile Deep Learning Semiconductor With Compression/Decompression of Activation/Kernel Data, 20%), (No. 2021-0-02068, Artificial Intelligence Innovation Hub) and was conducted by Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD).

Publisher Copyright:
© 2021 IEEE

Fingerprint

Dive into the research topics of 'Zero-shot Natural Language Video Localization'. Together they form a unique fingerprint.

Cite this