While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce Sphinx: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, Sphinx learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, Sphinx learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, Sphinx tackles complex tasks in a sample-efficient, generalizable manner.
Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. Sphinx additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline.
Sphinx learns to switch between two modes (\(m_t\)) of execution: in waypoint mode, the waypoint policy predicts a waypoint (\(w_t\)) as an offset (\(\phi_t\)) to a salient point \(z_t\) (e.g., mug handle, coffee pod) using a point cloud. Once the waypoint is reached, Sphinx switches to a wrist-camera image-based Diffusion Policy which predicts dense actions (\(a_t\)) for precise manipulation around a salient point. On the right, Sphinx interleaves both modes of execution to complete a long-horizon coffee-making task guided by salient points () and mode switches ()
.
Sphinx's waypoint policy is a GPT-2 style Transformer backbone which takes downsampled point clouds as inputs and predicts: per-point salient probabilities, per-point waypoint offsets, and the end-effector rotation, gripper, and next mode as separate tokens. At test-time, we take the predicted point with highest saliency probability as the salient point, and use the offset corresponding to this point to recover a full waypoint action. Thus, during training, we only penalize offset prediction on high-probability salient points.
We design a data collection interface which allows a demonstrator to specify salient points and waypoints with simple click-and-drag interactions in a web UI, reachable with a controller, and dense actions using a SpaceMouse:
We compare Sphinx on 4 real and 2 simulated tasks
against several SOTA IL algorithms:
Method |
Waypoint Mode |
Dense Mode |
Point Cloud Input |
Image Input |
Salient Points |
Offset Action Parameterization |
---|---|---|---|---|---|---|
DP | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ |
3D DP | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ |
Fine-tuned OpenVLA | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ |
Hydra | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ |
Vanilla Waypoint | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ |
Vanilla Waypoint + Aux. SP | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ |
SPHINX | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Sphinx outperforms baselines across 4 real-world and 3 simulated tasks by 41.1% on average. By effectively leveraging mode switches, Sphinx can effectively execute long-range movements with high precision, while also better navigating difficult bottleneck states such as coffee pod insertion or placing the train on the bridge. Dense-only baselines struggle with long-horizon execution and spatial reasoning, while waypoint-only baselines cannot perform complex manipulation.
Other policies that omit salient point-offset prediction (Vanilla Waypoint) or do not exploit both point clouds and wrist images (HYDRA) exhibit more imprecision and are thus more error-prone.
We visualize the distribution of initial states across successful Sphinx rollouts, spanning a wide range of object positions and orientations across the table.
Using (calibrated) point clouds enables Sphinx's waypoint policy to be robust to new camera viewpoints. The wrist-based dense policy is unaffected by this change, as well as visual distractors in the surrounding scene.
⏳⌛ Time Elapsed: 0:00
Since Sphinx uses a controller to execute waypoint actions, we can arbitrarily change the max positional delta of the controller at test time to enable faster execution speeds than the rate data was collected at. By doubling the max delta of the controller at test-time, Sphinx can achieve a 1.7x speedup over Diffusion Policy, while achieving far higher success.