TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Abstract

Text-and-Image-To-Image (TI2I), an extension of Text-To-Image (T2I), integrates image inputs with textual instructions to enhance image generation. Existing methods often partially utilize image inputs, focusing on specific elements like objects or styles, or they experience a decline in generation quality with complex, multi-image instructions. To overcome these challenges, we introduce Training-Free Text-and-Image-to-Image (TF-TI2I), which adapts cutting-edge T2I models such as SD3 without the need for additional training. Our method capitalizes on the MM-DiT architecture, in which we point out that textual tokens can implicitly learn visual information from vision tokens. We enhance this interaction by extracting a condensed visual representation from reference images, facilitating selective information sharing through Reference Contextual Masking—this technique confines the usage of contextual tokens to instruction-relevant visual information. Additionally, our Winner-Takes-All module mitigates distribution shifts by prioritizing the most pertinent references for each vision token. Addressing the gap in TI2I evaluation, we also introduce the FG-TI2I Bench, a comprehensive benchmark tailored for TI2I and compatible with existing T2I methods. Our approach shows robust performance across various benchmarks, confirming its effectiveness in handling complex image-generation tasks.

TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning in Text-to-Image Models

Augment your T2I models with arbitrary number of images as references in a Training-Free manner 🔥🔥🔥

Abstract

Illustration of our proposed TF-TI2I, designed to leverage multiple image references for generation and editing without additional training. The importance of image references is illustrated in the middle part, where excluding any reference significantly alters the output.

Qualitative comparison of Quad-references sub-tasks (left) and Single-reference sub-tasks (right) in FG-TI2I. The input Object, Texture, Action, and Background—are denoted as O, T, A, and B. We use red for text-only input and blue for reference-supported input.

Qualitative comparison of Quad-references sub-tasks in FG-TI2I. The input Object, Texture, Action, and Background—are denoted as O, T, A, and B. We use red for text-only input and blue for reference-supported input.

Qualitative results of Dual-references sub-tasks on FG-TI2I, we leave the textual prompt in Appendix.

Qualitative results of TF-TI2I on DreamBench (upper part) and Wild-TI2I (lower part), where the image input is denoted with blue, the textual prompts are abbreviated for conciseness.