FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision–Language Generation

Authors: Eric Tillmann Bill, Enis Simsar, Alessio Tonioni, Thomas Hofmann
Published in ArXiv, 2026

Abstract

Modern text-to-image diffusion models encode rich visual priors, but expose them only through one-way text-conditioned generation. Existing unified vision–language models derived from them recover bidirectional capability through large-scale joint pretraining or substantial retraining of the text pathway, discarding the strong image prior the text-to-image backbone already encodes. We introduce \emph{FullFlow}, a parameter-efficient recipe that upgrades a pretrained rectified-flow text-to-image model into a bidirectional vision–language generator by training only LoRA adapters and lightweight text heads. FullFlow keeps images in their native continuous flow and adds a discrete insertion process for text. Separate image and text timesteps turn inference into trajectory selection in a two-dimensional generative space, enabling textimage, imagetext, joint sampling, and partial-text prediction with a single backbone. On Stable Diffusion 3 (SD3) under an identical trainable-parameter count and matched LoRA rank, FullFlow improves textimage FID from to and imagetext CIDEr from to over a LoRA equivalent following the previous SOTA formulation (Dual Diffusion) at matched wall-clock training time, while reducing peak VRAM from \,GB to \,GB and raising throughput by on two RTX A5000 GPUs in under 24 hours, training only of the backbone parameters. The same recipe transfers to FLUX.1-dev and supports downstream VQA through partial-text generation. These results show that strong bidirectional vision–language capability can be unlocked from pretrained text-to-image flow models without full …

Paper