What's the Move? Hybrid Imitation Learning via Salient Points

What's the Move?
Hybrid Imitation Learning via Salient Points

^*Equal contribution
¹Stanford University, ²Physical Intelligence

Sphinx tackles long-horizon, precise tasks while generalizing to unseen spatial arrangements, visual distractors,
novel camera viewpoints, and execution speeds, even having only been trained on 20-60 demos.

Train Track

Coffee Making

Cup Stack (Visual Distractors)

Cup Stack (Novel Viewpoint)

Cup Stack (Novel Speed)

Drawer (Novel Height)

Abstract

While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce Sphinx: Salient Point-based Hybrid ImitatioN and eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, Sphinx learns to infer task-relevant points within a point cloud, or salient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, Sphinx learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, Sphinx tackles complex tasks in a sample-efficient, generalizable manner.

Our method achieves 86.7% success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by 41.1% on average across 440 real world trials. Sphinx additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a 1.7x speedup over the most competitive baseline.

SPHINX Method Overview

Sphinx learns to switch between two modes (\(m_t\)) of execution: in waypoint mode, the waypoint policy predicts a waypoint (\(w_t\)) as an offset (\(\phi_t\)) to a salient point \(z_t\) (e.g., mug handle, coffee pod) using a point cloud. Once the waypoint is reached, Sphinx switches to a wrist-camera image-based Diffusion Policy which predicts dense actions (\(a_t\)) for precise manipulation around a salient point. On the right, Sphinx interleaves both modes of execution to complete a long-horizon coffee-making task guided by salient points () and mode switches ()
.

Waypoint Policy Architecture

Sphinx's waypoint policy is a GPT-2 style Transformer backbone which takes downsampled point clouds as inputs and predicts: per-point salient probabilities, per-point waypoint offsets, and the end-effector rotation, gripper, and next mode as separate tokens. At test-time, we take the predicted point with highest saliency probability as the salient point, and use the offset corresponding to this point to recover a full waypoint action. Thus, during training, we only penalize offset prediction on high-probability salient points.

SPHINX Data Collection Interface

Ex: Teleoperating the Train Track Task

We design a data collection interface which allows a demonstrator to specify salient points and waypoints with simple click-and-drag interactions in a web UI, reachable with a controller, and dense actions using a SpaceMouse:

Evaluations

We compare Sphinx on 4 real and 2 simulated tasks against several SOTA IL algorithms:

Dense-Only Baselines: Diffusion Policy (DP) and 3D Diffusion Policy are state-of-the-art IL policies which only differ in their input modality (third person + wrist images vs. pointclouds, respectively). OpenVLA is a third-person image-based dense-only policy which is pre-trained on the large-scale Open X-Embodiment data and finetuned on our datasets.
Hybrid Baseline: HYDRA learns to switch amongst waypoint and dense modes, but critically only takes images (third-person + wrist) as input and does not predict salient points/offsets.
Waypoint-Only Baselines: We compare SPHINX against two variants of its waypoint policy which omit the dense mode entirely. Vanilla Waypoint predict waypoints directly given point cloud input, disregarding salient points entirely, and Vanilla Waypoint Aux. SP. predicts waypoints directly but adds salient prediction as an auxiliary task during training alone.

Method	Waypoint Mode	Dense Mode	Point Cloud Input	Image Input	Salient Points	Offset Action Parameterization
DP	✗	✓	✗	✓	✗	✗
3D DP	✗	✓	✓	✗	✗	✗
Fine-tuned OpenVLA	✗	✓	✗	✓	✗	✗
Hydra	✓	✓	✗	✓	✗	✗
Vanilla Waypoint	✓	✗	✓	✗	✗	✗
Vanilla Waypoint + Aux. SP	✓	✗	✓	✗	✓	✗
SPHINX	✓	✓	✓	✓	✓	✓

Hybrid Task Results

Click on any bar to see the videos!

Waypoint Task Results

Sphinx outperforms baselines across 4 real-world and 3 simulated tasks by 41.1% on average. By effectively leveraging mode switches, Sphinx can effectively execute long-range movements with high precision, while also better navigating difficult bottleneck states such as coffee pod insertion or placing the train on the bridge. Dense-only baselines struggle with long-horizon execution and spatial reasoning, while waypoint-only baselines cannot perform complex manipulation.

Other policies that omit salient point-offset prediction (Vanilla Waypoint) or do not exploit both point clouds and wrist images (HYDRA) exhibit more imprecision and are thus more error-prone.

SPHINX Generalization Capabilities

Spatial Generalization

Drawer

Cup Stack

Coffee Making

Train Track

We visualize the distribution of initial states across successful Sphinx rollouts, spanning a wide range of object positions and orientations across the table.

Visual Generalization

Visual Distractors

Novel Viewpoint

Using (calibrated) point clouds enables Sphinx's waypoint policy to be robust to new camera viewpoints. The wrist-based dense policy is unaffected by this change, as well as visual distractors in the surrounding scene.

Execution Speeds

⏳⌛ Time Elapsed: 0:00

Since Sphinx uses a controller to execute waypoint actions, we can arbitrarily change the max positional delta of the controller at test time to enable faster execution speeds than the rate data was collected at. By doubling the max delta of the controller at test-time, Sphinx can achieve a 1.7x speedup over Diffusion Policy, while achieving far higher success.

What's the Move?Hybrid Imitation Learning via Salient Points

Sphinx tackles long-horizon, precise tasks while generalizing to unseen spatial arrangements, visual distractors, novel camera viewpoints, and execution speeds, even having only been trained on 20-60 demos.

Train Track

Coffee Making

Cup Stack (Visual Distractors)

Cup Stack (Novel Viewpoint)

Cup Stack (Novel Speed)

Drawer (Novel Height)

Abstract

SPHINX Method Overview

Waypoint Policy Architecture

SPHINX Data Collection Interface

Ex: Teleoperating the Train Track Task

Evaluations

Hybrid Task Results

Click on any bar to see the videos!

Waypoint Task Results

SPHINX Generalization Capabilities

Spatial Generalization

Drawer

Cup Stack

Coffee Making

Train Track

Visual Generalization

Visual Distractors

Novel Viewpoint

Execution Speeds

What's the Move?
Hybrid Imitation Learning via Salient Points

Sphinx tackles long-horizon, precise tasks while generalizing to unseen spatial arrangements, visual distractors,
novel camera viewpoints, and execution speeds, even having only been trained on 20-60 demos.