Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph generation. In this paper, we introduce the problem of video storytelling, which aims at generating coherent and succinct stories for long videos. Video storytelling introduces new challenges, mainly due to the diversity of the story and the length and complexity of the video. We propose novel methods to address the challenges. First, we propose a context-aware framework for multimodal embedding learning, where we design a residual bidirectional recurrent neural network to leverage contextual information from past and future. The multimodal embedding is then used to retrieve sentences for video clips. Second, we propose a Narrator model to select clips that are representative of the underlying storyline. The Narrator is formulated as a reinforcement learning agent, which is trained by directly optimizing the textual metric of the generated story. We evaluate our method on the video story dataset, a new dataset that we have collected to enable the study. We compare our method with multiple state-of-the-art baselines and show that our method achieves better performance, in terms of quantitative measures and user study.
Bibliographical noteFunding Information:
Manuscript received July 11, 2018; revised December 23, 2018 and May 27, 2019; accepted July 9, 2019. Date of publication July 22, 2019; date of current version January 24, 2020. This work was supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Marco Bertini. (Corresponding author: Junnan Li.) J. Li is with the Graduate School for Integrative Sciences and Engineering, National University of Singapore, Singapore 119077 (e-mail: email@example.com).
© 1999-2012 IEEE.
- Video storytelling
- multimodal embedding learning
- sentence retrieval
- video captioning