Abstract
Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). Previous approaches for VLMMs involve Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and additional learnable parameters. Here, aligning video with text, and vice versa, remains a challenge, primarily due to the insufficient quality and quantity of multimodal instruction-tune data compared to that of text-only. This discrepancy often results in alignments that poorly ground the video content. To address this, we present a novel alignment strategy that employs a multimodal AI system equipped with Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. Our approach uniquely integrates detailed video descriptions as context into a multimodal AI system during preference feedback generation to enrich the understanding of video content, a process we call context-aware reward modeling. Empirical evaluations on various video benchmarks demonstrate that our VLM-RLAIF outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area. https://github.com/yonseivnl/vlm-rlaif.
Original language | English (US) |
---|---|
Title of host publication | Long Papers |
Editors | Lun-Wei Ku, Andre F. T. Martins, Vivek Srikumar |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 923-940 |
Number of pages | 18 |
ISBN (Electronic) | 9798891760943 |
State | Published - 2024 |
Event | 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Bangkok, Thailand Duration: Aug 11 2024 → Aug 16 2024 |
Publication series
Name | Proceedings of the Annual Meeting of the Association for Computational Linguistics |
---|---|
Volume | 1 |
ISSN (Print) | 0736-587X |
Conference
Conference | 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 |
---|---|
Country/Territory | Thailand |
City | Bangkok |
Period | 8/11/24 → 8/16/24 |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.