OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
Discussion
Host: Hey everyone, and welcome back to the podcast! Today, we're diving into some seriously fascinating stuff in the world of robotics. I'm your host, Leo, and I'm super excited about this topic because it touches on AI, vision, and of course, robots working in the real world. We've got a fantastic paper to unpack, something about making robots more generally capable in everyday environments. It’s a pretty big challenge, getting robots to do things as we humans do, but there’s some promising work being done.
Host: So, the paper is titled 'OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints.' It’s a mouthful, I know, but the core ideas are pretty cool. What’s particularly interesting is how they’re trying to bridge the gap between the high-level understanding of large language and vision models and the super precise movements a robot needs to make to, say, pour a cup of tea. I mean, think about how much we just take for granted, from knowing how to grasp a handle to knowing how far to tilt to not spill. It's not an easy problem for our robot friends. I guess, let's just jump into it?
Host: Alright, let's start with the very beginning, shall we? The introduction. The researchers here are basically stating the big challenge: making a robot that can manipulate things in unstructured environments is hard, like, really hard. We humans navigate chaos every day, but for robots, it’s a real mountain to climb. They mentioned that we have made great strides in Large Language Models (LLMs) and Vision-Language Models (VLMs), these AI systems that have learned a huge amount of commonsense knowledge from mountains of internet data. They have seen how cats act and that spoons are used to scoop things, but robots? they seem to lack the fine-grained spatial understanding needed to actually grasp that spoon and scoop the food properly.
Host: And that's the crux of the problem, right? You can have an AI that understands that 'pour liquid' involves tilting a container, but unless it knows the precise 3D position of the container, the cup, and the orientation, you’re just going to end up with a big mess. That's why many folks have tried to fine-tune these VLMs on robot-specific data, basically creating what they call Vision-Language-Action Models or VLAs. The goal being, let’s get AI models to learn directly from robot demonstrations but, there is a catch. There’s a two sided issue of data, it's expensive and time-consuming to gather high-quality robot data. And even when you do, the model becomes tied to that specific robot, which reduces generalizability. I mean, you teach a robot how to pour using a robot arm from brand A, and it would not be able to do it with a robotic arm from brand B.
Host: Exactly! It’s not a 'one-size-fits-all' solution. They point out the alternative, which is to break down robot actions into more basic interaction primitives. Imagine these as building blocks: a point, a direction, or a grasp and then use VLMs to figure out the spatial relationships between those building blocks. Traditional algorithms handle the execution once this high-level plan is defined. But here's where the older methods start to fall short. Existing approaches for generating these ‘primitive’ actions are often kind of random, and don't really consider the task at hand. It’s like throwing darts in the dark. They then rely on some manual rules to refine them, which, as you can imagine, isn't the most stable thing. This lack of a systematic approach to defining and using primitives is why they propose their unique approach.
Host: Right, and that leads us to their central idea, which is this ‘object-centric intermediate representation’. Instead of just seeing the world as a jumble of pixels, they’re focusing on individual objects. And, crucially, each object is defined not just by its shape but also by its ‘canonical space’. They argue that an object’s canonical space is naturally linked to its functional affordances, which are basically the things you can do with it. Like, a mug has a handle for grasping, a spout for pouring, and a base to place on a surface. All these things are defined within its canonical space which is unique to each object. They emphasize that you can describe the functionality of an object in this canonical space because its always consistent. Recent developments in pose estimation, especially universal pose estimation, have enabled the robots to canonicalize a very large range of different object types.
Host: So, if I'm understanding this correctly, they're not just seeing a generic 'cup' but a cup with specific, consistent interaction points and directions relative to a virtual coordinate system on the cup itself. Like, instead of thinking of the cup handle as some random collection of pixels, they are thinking about it as a consistent place to grasp, defined in a coordinate system attached to the cup. It's a way to bring more structure and meaning to how the robot perceives objects, and how it plans to interact with them. This way it does not have to relearn the concept of handles every time for each different object. And with their method using advanced 6D pose estimation, they can calculate how an object is positioned and oriented in space, along with mesh representations generated through a 3D network. This is where we start to see how it all comes together.
Host: Yeah, it’s like giving the robot a set of blueprints for how things work. Once they've canonicalized an object, they start looking at the potential interaction directions, which is a really clever part. They start by proposing interaction directions along the main axes of the objects, which aligns with the way humans often interact with things, a natural approach. Think about it, we often pick something up along its principal axis of symmetry. To figure out which of these axes is actually relevant for a given task, they use a VLM. They give the VLM a description of each axis and then a large language model (LLM) scores how relevant each description is to the task. It’s like asking the AI, “if you want to pour tea, which of these directions on the teapot seems most helpful?” It's a smart way to combine semantic reasoning with spatial understanding.
Host: Okay, so they are not randomly selecting directions, but intelligently selecting a few using LLMs and VLMs, which also helps reduce processing overhead. I mean, what's the point of looking for directions on a cup, when you only need to consider the handle to pick it up, it does not help if the model starts considering the bottom and the side of the cup. It’s like, give the robot the most relevant options to make its calculations easier. And then they have this ‘dual closed-loop, open-vocabulary robotic manipulation system.’ This sounds fancy. The closed-loop part means there is feedback involved that improves the system. I am thinking here of those control systems where you correct mistakes as you go instead of just executing based on an initial calculation, but dual? What’s the other loop for?
Host: Right, so the 'dual' refers to two interconnected feedback loops. One is the high-level planning loop. This loop is responsible for deciding the 'what' and 'how' of manipulation. They don’t just make a plan and go with it, they use their proposed ‘interaction rendering’ and ‘primitive resampling’ alongside their VLM to check and recheck the plausibility and correctness of each step. It's like creating a simulation of the action, showing it to the VLM, and asking, “does this look right? If not, try again”. This helps fix the errors that arise from the VLM. The other loop is the lower-level execution loop, it deals with the precise movement. This loop is all about getting the robot arm to move correctly and accurately. By continuously tracking the object's 6D pose, which is its position and orientation in space, they're making sure that the actual movement matches the desired outcome. This all is done in real-time. And, importantly, the whole thing is ‘open-vocabulary,’ which means it's designed to work with many objects and doesn’t require the VLM to be specifically trained on robotic data.
Host: Okay, so it's like a two-stage correction system, a safety net at every layer. The planning loop checks if the action makes sense in a simulated manner, and the execution loop ensures that the robot arm is moving correctly with up-to-date information, even if things change during execution. The authors are also very proud that this system is able to operate without fine-tuning of the VLM. In short, the VLM does not have to be retrained on robotic data, instead, the VLM works with its general knowledge, and is guided by the object primitives from the system. I think I am starting to understand this, let's look into the details, and for that, lets move into the related works section.
Host: . But, of course, that high-level plan still needs to be converted to a set of robot arm movements.
Host: And there’s the rub, right? As good as VLMs are at understanding the semantics of a task, they are trained on mostly 2D data, so they’re not great with the precise 3D spatial reasoning required for robotic manipulation. Some researchers have tried to fine-tune VLMs on robot datasets, which creates these specialized VLA models as we were talking earlier, but as we also discussed, that suffers from generalization and high cost for data collection. This is where the paper highlights, that other researchers have also tried to extract operation primitives and feed them to VLMs as prompts or directions. That, way the VLMs are only responsible for high level reasoning, and other motion planners can handle the low level control. However, this approach has a limitation of its own, the way the primitives are fed into VLM is very ambiguous. The 3D primitives have to be compressed into 2D images, or 1D text. The fact that VLM's themselves have a tendency to hallucinate things also adds more complexity to this problem. It’s like trying to describe a complex 3D structure using only words or 2D sketches, it’s hard to be precise. This is where OmniManip makes the claim that it addresses these issues using better understanding of the spatial aspect of the world and by mitigating the hallucination tendencies of these models.
Host: Absolutely. It's about how you represent the world to the AI, and that takes us to another interesting part of this section: representations for manipulation. This is like choosing the right language to talk to the robot. There’s been a lot of work using ‘keypoints’, which are essentially points on an object. These have the benefit of flexibility and are very good at modeling variations, but they typically require manual, task-specific annotations, which is not very scalable. And that’s why researchers have also tried converting keypoints into visual prompts, using VLMs to generate high-level plans. However, keypoints can be unstable, because of things like occlusion or difficulty in selecting good points, which brings us to 6D poses. 6D poses capture the position and orientation of an object, and they are more stable and robust under occlusion. And they can capture long range dependencies between objects as well. But, these methods need prior modeling of geometric relationships and may not provide fine grained geometry, for example, it’s good enough for picking a cup up, but not for more precise manipulations like inserting a pen into a specific hole.
Host: Exactly. Keypoints are good for capturing specific points of interaction, but they're not always reliable. Poses are great for overall object location, but not precise enough for fine grained manipulation. OmniManip combines the best of both worlds, according to the researchers. It uses the fine-grained geometric understanding of keypoints, but within the stable framework of the object’s 6D pose. It's about automatically identifying these important functional points and directions, within the objects coordinate system, and use that information for precise manipulation using VLMs. This also allows it to be more robust to occlusion, as if the primary point is hidden, it can still track the orientation of the object as a whole and derive the point based on the objects pose. So instead of just seeing points in an image, they're looking at where these points are relative to the canonical space of an object itself, a very smart move.
Host: Yeah, it’s all about creating a more structured, more meaningful representation for the robot. Now, let’s move to the juicy part, the ‘Method’ section. This is where the authors dive into the details of how their system actually works. They start by asking a few key questions. First, how do they formulate robotic manipulation using primitives as spatial constraints? Next, how do they extract these canonical interaction primitives in a generic, open vocabulary way? And, finally, why does their method achieve a dual closed-loop system? These are the core questions they aim to answer in this section.
Host: Okay, so let's tackle that first question: how do they formulate robotic manipulation with interaction primitives as spatial constraints? This all starts with decomposing the complex task, like pouring tea or assembling some object, into smaller, more manageable stages. They don't try to solve everything at once, but instead break it down into discrete steps. Each stage is defined by the interaction between objects. In the pouring tea example, you have a stage of grasping the teapot, and then the stage of pouring the tea into the cup. Each stage is formalized into an action, and the objects that are being acted upon and that are actively initiating the action. So, for pouring tea, the teapot is the object initiating the action, and the cup is the object being acted upon. This is a key part of their object centric view of the world.
Host: And within each of these stages, they define the interactions using object-centric primitives. Each primitive is defined by an ‘interaction point,’ which is where the interaction happens on the object and an ‘interaction direction’, which indicates the functional axis along with the interaction occurs. For instance, when grasping a teapot, the interaction point could be the center of the handle, and the direction could be along the handle. The primitives, as they say, are always defined in their respective canonical spaces and are consistent across various scenarios and they are reusable. It’s like a library of interaction points and directions that the robot can pull from. This representation, which they call ‘O’, is the interaction point ‘p’ and the interaction direction ‘v’ encapsulated together. This then leads to the definition of spatial constraints.
Host: Right. So, each stage is not just a sequence of actions, but a sequence of specific spatial relationships between objects. They are defining these relationships using spatial constraints, denoted with a ‘C’, which are divided into ‘distance constraints’ that control the distance between interaction points and ‘angular constraints’ which ensures the proper alignment of the interaction directions. For instance, when you’re pouring tea, the distance between the teapot spout and the cup rim, and the angle between the teapot's pouring direction and the cup's opening are the spatial constraints. So, when they are doing calculations, these are the actual metrics that they are considering. The overall spatial constraint of each stage is defined as an active object and a passive object along with the distance and angular constraints. After that, they form an optimization problem where they try to find the best execution strategy that conforms to all these constraints, a very elegant approach if you ask me.
Host: It's like building a precise puzzle where each piece, each object and its interaction primitive, has to fit just right. The second part of the method section asks, “How do you extract the interaction primitives in an open vocabulary way?” which is crucial because you want a robot that can work with new objects, instead of only the ones it has seen before. For this, they use both single-view 3D generation and object pose estimation. Once they have an RGB-D image, they use a 3D generation network to generate meshes for the object, then they use Omni6DPose to canonicalize the objects by estimating their pose, and that's how they create the object space for that object, which they will use for interaction. After that, they proceed to extract the task-relevant interaction primitives, specifically, the interaction point and the interaction direction.
Host: Let’s talk more about that interaction point extraction, which they call ‘grounding the interaction point’. They separate it into two main types: ‘Visible’ and ‘Tangible’ points, which are easy to find on an object, like a handle on a teapot, and the ‘Invisible’ or ‘Intangible’ points, like the center of the teapot opening. Now, for the visible points, they employ a technique called SCAFFOLD, which basically adds a grid to the image. The visible points are localized using the 2D image plane. For the invisible points, they use reasoning from multiple views of the object. The inference starts from the current view, and if there is any ambiguity, it moves to a different view, most commonly an orthogonal view, and that’s how it is able to better infer the invisible points on the object. This multi-view reasoning helps to more reliably pinpoint locations where interaction must occur, and for grasping tasks they create heatmaps from these interaction points, making the grasping more robust.
Host: That makes sense. If the target point is hidden, reasoning from other perspectives definitely makes sense. Now, how about ‘sampling the interaction direction’? As they explain, the principal axes of an object in its canonical space are often functionally relevant. So they start with these principal axes as the candidate interaction directions. The challenge is how to figure out which of these is actually relevant. Because, if you want to pour tea, the axis you are looking at, will probably be the one that aligns with the direction of the spout. For that, they use the VLM caption and LLM scoring mechanism. First the VLM gives a description of the candidate axes, and then the LLM will score the description according to its relevance to the task, which then gives them an ordered set of interaction directions. It’s quite clever, they’re not just blindly picking axes; they’re doing it based on task relevance. After all this, finally they have their interaction primitives with constraints, ordered by their potential for successfully accomplishing the task.
Host: Exactly. They have effectively narrowed down the problem from a large number of possibilities to a well-defined set of task-specific choices. This also means that the search space for the execution plan is now much smaller and more targeted, which will make the planning and the execution faster and more accurate. Now, let's look at that all-important ‘dual closed-loop’ part, which is the final question of this section. As we mentioned earlier, the interaction primitives along with their constraints are a good start for the system, but it is still an open loop inference, which limits robustness. And there are two reasons for that: first is the hallucination of large models, and second is the dynamic environment. If the robot is planning something, and an object moves in the real world, this will definitely cause the plan to fail. So, they introduce the ‘Resampling, Rendering and Checking (RRC)’ mechanism, which we talked about earlier, this is how they achieve their dual closed-loop system.
Host: Right, the RRC process is what allows them to implement ‘closed-loop planning’. In short, this mechanism uses feedback from the VLM to identify and fix any planning errors. This mechanism has two phases, an ‘initial phase’ where the system evaluates the constraints by rendering the interaction and submitting it to the VLM, it will return success, failure, or refinement. If successful, the system moves on to execution. If failure, it will move to the next constraint, and if refinement, it will move to the refinement phase. This happens through a while loop. In the refinement phase the system resamples the interaction directions to correct the alignment between the functional and geometric axes, and the system will start from the beginning again with the new resampled primitives. That’s the self-correction mechanism. The whole process is defined in an algorithm.
Host: So, it’s a dynamic, iterative process. The system is always checking its work and making adjustments, instead of just blindly following an initial plan, which is essential in the real world, where things rarely go exactly as planned. But that’s just the planning loop. What about the execution loop? Well, once the primitives and spatial constraints are defined, task execution is viewed as an optimization problem. Basically, they want to find the best end effector pose that minimizes the loss function, which is a combination of constraint loss, collision loss and path loss. The constraint loss ensures that the action follows the spatial constraints, the collision loss ensures that the robot arm does not bump into things, and path loss keeps the motion smooth. By minimizing these losses, the robot arm is able to adjust its pose dynamically.
Host: And that’s how they bridge the gap between the plan and the real world movement! But even with that, there can be variations in how the robot actually grasps the object or other environmental changes. This is why they use the 6D pose tracking in the execution loop. This means they continuously track the 6D poses of the objects in real-time, and they will update their plans as new information becomes available, which leads to continuous adjustment to the end effector and accurate execution in a closed loop. In essence, this real-time feedback from the pose tracker ensures the robot can adjust to unexpected changes, and that’s how they accomplish ‘closed-loop execution’ which is robust against uncertainty and unforeseen circumstances. It’s a very detailed explanation, and I think we should move into the experiment section to see how this all performs in practice.