Summary

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.

Discussion

Host: Hey everyone, and welcome back to another episode of the podcast! Today, we're diving deep into the exciting world of GUI visual agents, and I'm thrilled to have you all listening. We're going to be discussing ShowUI, a really innovative model that's pushing the boundaries of how we interact with digital interfaces. It's a pretty dense paper, so buckle up, it's going to be a ride!

Guest: Thanks for having me, Leo! I'm excited to talk about ShowUI. It tackles some really challenging problems in the field, and the results are pretty impressive.

Host: Absolutely! So, before we get into the nitty-gritty details, can you give us a quick overview of what ShowUI is all about? I mean, what problem is it trying to solve, and how does it go about doing it?

Guest: Sure. ShowUI is essentially a vision-language-action model designed to make GUI assistants smarter and more efficient. Most current GUI agents rely heavily on text-based information, like HTML or accessibility trees. But humans primarily interact with GUIs visually. ShowUI changes that by directly processing screenshots, mimicking how humans understand and interact with these interfaces. They did this through three key innovations.

Host: That's a big leap forward. I've always been fascinated by how much more contextual information a visual representation offers. So, what are these three innovations you mentioned? Let's break them down one by one.

Guest: Okay, the first is UI-Guided Visual Token Selection. Think about it: screenshots are high-resolution images, resulting in tons of visual tokens. Processing all of them is computationally expensive. ShowUI cleverly addresses this by treating the screenshot as a graph. Patches (or parts) of the image with similar RGB values are grouped together. These groups represent redundant information. During self-attention, ShowUI selectively processes tokens only from the essential parts, massively reducing computational costs. The paper shows a 33% reduction in redundant tokens and a 1.4x speedup in training. Pretty neat, huh?

Host: That's incredibly smart! It's like the model learns to identify the visual 'noise' and ignore it, focusing only on the relevant information. It reminds me of those attention mechanisms in other models, but this seems to be taking it a step further by pre-processing the visual data before even feeding it to the attention layers. What about the second innovation?

Guest: The second is Interleaved Vision-Language-Action Streaming. GUI interactions aren't just about language; they involve actions like clicks, scrolls, typing. ShowUI unifies all these modalities—vision, language, and action—in a continuous stream. This helps the model understand the context of actions within a sequence of interactions. For example, in navigation, it remembers previous screenshots and actions to make more informed decisions about the next step. This interleaving is particularly crucial for multi-step tasks.

Host: So, it's not just processing a single image and instruction, but rather a sequence of actions and observations, similar to how a human would navigate a website or application. That makes intuitive sense. It's amazing how much context is lost when we break down the interaction into individual steps. What about the third innovation?

Guest: The final one focuses on the dataset. They didn't just throw together any data they could find; they carefully curated a high-quality, small-scale dataset. They realized that different data types, like web, mobile, and desktop data, have different properties. For example, in web data, visual elements like buttons are more informative than text, since most VLMs are already good at OCR. This focus on high-quality data allows ShowUI to achieve strong performance even with a comparatively smaller training set. They also addressed data imbalances using a resampling strategy to ensure fair representation of different data types.

Host: That's a critical point. Data quality is often overlooked, but it's fundamental to the success of any machine learning model. So, with these three innovations, what kind of results did ShowUI achieve? How does it compare to other models?

Guest: ShowUI, a remarkably lightweight 2B parameter model trained on just 256K data points, achieves state-of-the-art accuracy in zero-shot screenshot grounding (75.1%). That's a significant achievement, especially considering the model's size and the limited amount of training data. It even outperforms much larger models on several benchmarks. They evaluated it on grounding and navigation tasks across web, mobile, and online environments showing competitive performance.

Host: Wow, those are some impressive numbers! It seems like ShowUI really delivers on its promise of being a lightweight yet powerful model. The fact that it performs so well with limited data is particularly noteworthy. Let's delve a little deeper into the experiments and benchmarks. What datasets were used for evaluation, and what metrics were employed to assess the model's performance?

Guest: Certainly. For grounding, they used Screenspot, a benchmark focusing on zero-shot grounding across different devices. For navigation, they tested on Mind2Web (web), AITW (mobile), and MiniWob (online), each offering unique challenges. Metrics included accuracy for grounding, success rate for navigation, and other relevant metrics depending on the specific benchmark. I think the most crucial point is that they conducted extensive ablation studies to demonstrate the effectiveness of each component of ShowUI, showing exactly what contributions each innovation makes.

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Summary

Discussion