X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
In-context generation is a key component of large language models' (LLMs) open-task generalization capability. By leveraging a few examples as context, LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in auto-regressive vision-language models (VLMs) built upon LLMs have showcased impressive performance in text-to-image generation. However, the potential of in-context learning for general image generation tasks remains largely unexplored. To address this, we introduce X-Prompt, a purely auto-regressive large-vision language model designed to deliver competitive performance across a wide range of both seen and unseen image generation tasks, all within a unified in-context learning framework. X-Prompt incorporates a specialized design that efficiently compresses valuable features from in-context examples, supporting longer in-context token sequences and improving its ability to generalize to unseen tasks. A unified training task for both text and image prediction enables X-Prompt to handle general image generation with enhanced task awareness from in-context examples. Extensive experiments validate the model's performance across diverse seen image generation tasks and its capacity to generalize to previously unseen tasks.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into the fascinating world of AI image generation, specifically, a really cool new model called X-Prompt. I've got Zeyi Sun, one of the lead researchers behind this project, with me to break it all down. Zeyi, welcome to the show!
Guest: Thanks for having me, Leo! Excited to be here.
Host: So, X-Prompt. The name itself sounds intriguing. What's the core idea behind it?
Guest: The big idea is to make image generation more versatile and less reliant on specific, pre-defined tasks. Think about how large language models like GPT-3 can handle many different prompts—X-Prompt aims to do something similar for images. We want a model that can generate images, edit them, perform dense prediction tasks like semantic segmentation, and handle various low-level tasks all within a single framework, all guided by a few examples.
Host: That's ambitious! Most image generation models I've heard about specialize in one thing – text-to-image, maybe image-to-image translation, but rarely do they combine so many functionalities. What's the secret sauce?
Guest: It's a combination of things. First, we build on a purely autoregressive vision-language model, specifically Chameleon. This means the model predicts each part of the image sequentially, which is more compatible with in-context learning than diffusion models, which are the more common approach. However, Chameleon alone struggled with longer context lengths needed to represent multiple images. That's where X-Prompt’s innovations come in. We developed a novel compression mechanism to efficiently store information from in-context examples, reducing the need for massive context windows during training, enabling a more unified and generalized approach.
Host: So, you're compressing the information from example images to make the model more efficient and adaptable? That's smart! How does that compression work, exactly?
Guest: We use three types of tokens: In-Context Example Tokens (IE), X-Prompt Tokens (XP), and TODO Tokens (TD). The IE tokens represent the input example images, then we have learnable X-Prompt Tokens (XP). The key is that we use attention masking to disconnect the IE tokens directly from the TODO tokens (which are the model's output). This forces the model to use the learnable XP tokens as a compressed representation of the entire context. Think of the XP tokens as distilling the essence of the examples, the 'know-how' the model needs. This distilled knowledge is then used to predict the TODO tokens, effectively generating the target image.
Host: That makes sense. It's like creating a concise summary of the example images that the model can then use to generate new images, kind of like a visual prompt engineering approach. But it’s integrated into the model itself, which is quite different than other approaches that rely on external prompt generation techniques.
Guest: Exactly! And to further improve the model's ability to understand the task implied by the examples, we also introduced task augmentation. We don't just train it on the given examples. We create reversed versions of the tasks – for example, if we're training on deraining images, we also train on 'adding rain' to the same image, providing a better understanding of how the changes in images happen.
Host: That’s clever! That addresses a potential limitation I can see: how the model interprets the implicit task represented by the example images. It's not explicitly told what to do, it must figure it out from the examples, right? The reversed task helps the model learn the relationships involved.
Guest: Absolutely. We also supplement this with a text prediction task. We use a separate large language model, like QwenVL-2, to generate descriptive captions detailing the differences between images in an example pair. Then, we train X-Prompt to predict these captions too, reinforcing its understanding of the image transformations involved.
Host: So, you're essentially teaching the model to 'explain' what's happening in the images. This multi-faceted training seems to be key to X-Prompt's success. Now, let's talk about the experiments. You tested it on a wide range of tasks. What were some of the highlights?
Guest: We saw some really impressive results across diverse tasks. In text-to-image generation, we achieved competitive performance with other autoregressive models on the GenEval benchmark. On dense prediction tasks like semantic segmentation and depth estimation, it performed surprisingly well, even against specialized models, demonstrating strong generalization ability. The image editing tasks showed a fascinating ability to understand and apply editing instructions, especially when we incorporated our Retrieval-Augmented Image Editing (RAIE) technique, which pulls similar examples from a database.
Host: That's quite a feat! And what about the in-context learning aspect? How well did X-Prompt generalize to unseen tasks?
Guest: That was the most exciting part. Given just one example of a new, unseen task, X-Prompt showed significant improvement compared to not having any example at all. We tested it on things like low-light enhancement, deraining, and object addition/removal. The results clearly demonstrated that in-context learning is working very well.