Abstract
Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a long-term context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity.
| Original language | English (US) |
|---|---|
| Title of host publication | Proceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
| Publisher | Institute of Electrical and Electronics Engineers Inc. |
| Pages | 3102-3112 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798350307184 |
| DOIs | |
| State | Published - 2023 |
| Event | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, France Duration: Oct 2 2023 → Oct 6 2023 |
Publication series
| Name | Proceedings of the IEEE International Conference on Computer Vision |
|---|---|
| ISSN (Print) | 1550-5499 |
Conference
| Conference | 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 |
|---|---|
| Country/Territory | France |
| City | Paris |
| Period | 10/2/23 → 10/6/23 |
Bibliographical note
Publisher Copyright:© 2023 IEEE.