Host: Hey everyone, and welcome back to the podcast! Today, we're diving deep into the fascinating world of generative AI, specifically focusing on a new model that's making waves in image and video creation. I'm really stoked about this one. I mean, we've seen AI generate images, and we've seen AI generate videos, but this aims for something more ambitious, something…joint.

Host: We're going to be unpacking a paper titled 'Goku: Flow Based Video Generative Foundation Models'. Sounds like a Dragon Ball Z reference, right? Anyway, don't let the name fool you, it is actually a significant leap forward. We will be exploring the core ideas behind Goku, covering everything from how the data is meticulously prepared to the intricate design of the model itself and the optimized infrastructure that enables it to run at scale. We’ll also delve into how it stacks up against other state-of-the-art models. It is supposed to be industry-leading, so that's pretty significant.

Host: So, let's start with the big picture. This paper introduces Goku, a family of what they call 'rectified flow Transformer models.' Rectified flow transformers... it already sounds complicated. But the goal is simple: to generate both images and videos at a quality level that's ready for real-world applications, hence the 'industry-leading performance'. And they're not just making claims; they're backing it up with some impressive numbers on various benchmarks.

Host: The abstract actually highlights four crucial elements that make Goku tick: first, the 'data curation pipeline.' This is all about collecting and cleaning the massive amounts of data needed to train such a powerful model. Think of it as the foundation upon which everything else is built. Second, it's the 'model architecture design.' This is the blueprint for how the AI itself is structured, including the choice of Transformer networks and how they're connected. It's the engine that drives the generation process. Third is 'flow formulation.' This is the mathematical approach used to guide the generation, using something called 'rectified flow' to smoothly transform random noise into meaningful images and videos. And finally, there's 'training infrastructure optimization.' This involves all the tricks and techniques used to make the training process as efficient and stable as possible, allowing them to handle the massive scale of the data and model. Basically, the pipeline to get everything running smoothly at scale. Without this, they couldn't train this thing properly.

Host: Let's start breaking it down, beginning with the Introduction. The paper itself emphasizes the rising importance of video generation. And they are right, video is everywhere. You know, the media, advertising, games and even simulators. Think about how much we consume in these forms. This is really only the start, and the applications will only continue to grow. And a lot of that has been driven by advances in generative algorithms and scalable model architectures and an ongoing expansion of computing capabilities, especially vast amounts of internet sourced data.

Host: The authors present Goku as their approach, a family of rectified flow transformer models for joint image and video generation, establishing a pathway toward industry grade performance. They focus on four key components: data curation, model architecture design, flow formulation and training infrastructure optimization, each rigorously refined to meet the demands of high quality, large scale video generation. Let's dive into those key components one by one. What do they say about this data curation?

Host: Okay, so the first thing they mention is a 'comprehensive data processing pipeline' designed to create large-scale, high-quality image and video-text datasets. This pipeline uses techniques like aesthetic scores to filter videos and images, OCR-driven content analysis, and even subjective evaluations. They're trying to make sure that the data is not only massive but also visually appealing and contextually relevant.

Host: The paper also mentions using multimodal large language models (MLLMs) to generate dense and contextually aligned captions. Then, they refine those captions using another large language model (LLM) to improve accuracy, fluency, and descriptive richness. So, they are taking captions from one LLM, and enhancing it with another. It's a clever way to get the most detailed and accurate descriptions possible. As a result, they've put together a training dataset of around 36 million video-text pairs and 160 million image-text pairs. That’s a HUGE amount of data. Apparently, it's enough to train industry-level generative models, according to them.

Host: Moving on to model architecture design. They've taken a 'pioneering step' by applying rectified flow formulation for joint image and video generation. They use the Goku model family, which includes Transformer architectures with 2 billion and 8 billion parameters. At the heart of Goku is a 3D joint image-video variational autoencoder (VAE). This VAE compresses image and video inputs into a shared latent space, allowing for a unified representation. This shared latent space is then combined with a full-attention mechanism, which allows the model to train on images and videos together. The result is high-quality, coherent outputs across both images and videos, all within a single framework.

Host: And to handle the massive scale of training Goku, they've built a robust infrastructure with advanced parallelism strategies to manage memory during long-context training. They're also using ByteCheckpoint for high-performance checkpointing and fault-tolerant mechanisms from MegaScale to ensure stability and scalability on large GPU clusters. Basically, they've optimized every aspect of the training process to handle the computational and data challenges of generative modeling with exceptional efficiency and reliability. It sounds like they threw everything and the kitchen sink at it.

Host: To prove Goku's worth, they tested it on both text-to-image and text-to-video benchmarks. For text-to-image, Goku-T2I performed strongly on benchmarks like T2I-CompBench, GenEval, and DPG-Bench, showing excellence in visual quality and text-image alignment. In text-to-video benchmarks, Goku-T2V achieved state-of-the-art performance on the UCF-101 zero-shot generation task. It also scored 84.85 on VBench, topping the leaderboard and outperforming several leading commercial text-to-video models. I am looking at figure 1, the images seem good, but they are cherry picked, so you can't entirely trust it.

Host: . In this section, the paper dives deeper into three core components of Goku: the image-video joint VAE, the Goku Transformer architecture, and the rectified flow formulation. They are designed to synergistically work together, creating a scalable framework for joint image and video generation. During training, each raw video input is encoded from the pixel space to a latent space using a 3D image-video joint VAE. Then, the encoded latents are organized into mini-batches with both video and image representations, helping the model learn a unified cross-modal representation. Finally, the rectified flow formulation is applied to these latents, using a series of Transformer blocks to effectively model complex temporal and spatial dependencies.

Host: Let's begin with this Image-Video Joint VAE. Apparently, earlier research has shown that diffusion and flow-based models can really improve efficiency and performance by modeling in latent space through a Variational Auto-Encoder (VAE). This echoes the Sora model from OpenAI, where the open-source community introduced 3D-VAE to explore spatio-temporal compression within latent spaces for video generation tasks. To get the benefits of latent space modeling across multiple media formats, including both images and videos, they use a jointly trained Image-Video VAE that handles both image and video data within one unified framework. So, it's all about leveraging the efficiency and performance gains of latent space modeling while handling both image and video in a cohesive way. Specifically, for videos, they apply a compression stride of 8x8x4 across height, width, and temporal dimensions. For images, the compression stride is 8x8 in spatial dimensions.

Host: Okay, let’s unpack the Transformer Architectures section. The Goku Transformer block builds upon GenTron, which itself is an extension of the class-conditioned diffusion transformer. Essentially, it's a specialized Transformer designed for text-to-image/video tasks. It's made up of a self-attention module (for capturing inter-token correlations), a cross-attention layer (to integrate textual conditional embeddings extracted from the Flan-T5 language model), a feed-forward network (FFN) for feature projection, and a layer-wise adaLN-Zero block that incorporates timestep information. They've also added some recent design enhancements to improve model performance and training stability, which we'll dive into shortly. So, it's a Transformer block with a few extra bells and whistles tailored for the specific task of generating images and videos from text.

Host: One of the key design choices they made was to use plain full attention. In Transformer-based video generative models, previous approaches typically combined temporal attention with spatial attention to extend text-to-image generation to video. While this can reduce computational cost, it's not ideal for modeling complex temporal motions. So, in Goku, they use full attention to model multi-modal tokens (both images and videos) within a unified network. However, this creates a challenge as the number of video tokens remaining after VAE processing can be quite high. This is especially true for high-frame-rate, long-duration videos. To deal with this, they leverage FlashAttention and sequence parallelism to optimize both GPU memory usage and computational efficiency.

Host: To enable joint training on images and videos of varying aspect ratios and lengths, they follow the approach from NaViT, packing both modalities into a single minibatch along the sequence dimension. This allows flexible mixing of training instances with different sequence lengths into a single batch, getting rid of the need for data buckets. During joint training, they apply 3D RoPE embeddings to image and video tokens. They extend 3D RoPE embeddings to image and video tokens, leveraging their extrapolation capability to accommodate varying resolutions. This is useful for handling diverse resolutions and video lengths.

Host: They also use Q-K Normalization. Training large-scale Transformers can lead to loss spikes, which may cause model corruption, or pure noise in generated images or videos. To fix this, they incorporate query-key normalization to stabilize the training process. Specifically, they apply RMSNorm to each query-key feature before attention computation, making the training dynamics smoother and more reliable.

Host: To handle varying computational demands and performance requirements, they've designed three model variants: Goku-1B, Goku-2B, and Goku-8B. The Goku-1B model is the smallest and serves as a lightweight option for pilot experiments. The Goku-2B variant has 28 layers with a model dimension of 1792 and 28 attention heads, providing a balance between computational efficiency and expressive capacity. The largest Goku-8B variant features 40 layers, a model dimension of 3072, and 48 attention heads, delivering high modeling capacity to get high generation quality. It is summarized in Table 1.

Host: Now we are at Flow-based Training. Their flow-based formulation is rooted in the rectified flow (RF) algorithm, where a sample is progressively transformed from a prior distribution (like a standard normal distribution) to the target data distribution. They do this by defining the forward process as a series of linear interpolations between the prior and target distributions. Given a real data sample x1 from the target distribution and a noise sample x0 from the prior distribution, a training example is made through linear interpolation: x_t = t * x_1 + (1 - t) * x_0, where t is the interpolation coefficient between 0 and 1.

Host: The model is trained to predict the velocity, which is the time derivative of x_t, or v_t = d(x_t) / dt. This guides the transformation of intermediate samples x_t towards the real data x_1 during inference. By setting up a direct, linear interpolation between data and noise, RF simplifies the modeling process, making for better theoretical properties, conceptual clarity, and faster convergence across data distributions. Goku takes an step by adopting a flow-based formulation for joint image-and-video generation. They conduct a pilot experiment to see how quickly flow-based training converges by performing class-conditional generation with Goku-1B on ImageNet-1K. They configured the model with 28 layers, an attention dimension of 1152, and 16 attention heads. They compared key metrics, like FID-50K and Inception Score (IS), for models trained using both the denoising diffusion probabilistic model (DDPM) and rectified flow. They said it demonstrates faster convergence than DDPM.

Goku: Flow Based Video Generative Foundation Models

Summary

Discussion