GR00T N1.6: Latent Representations at the System 2→System 1 Interface

Abstract

GR00T N1.6 is a hierarchical Vision-Language-Action model with two coupled subsystems: a high-level System 2 (Eagle2 VLM backbone) that processes observations and language instructions, and a low-level System 1 (flow-matching diffusion policy) that generates actions. System 2 passes conditioning tokens to System 1 at each step — a bottleneck through which all semantic and visual information must flow.

We probe these interface tokens during rollouts on four tabletop manipulation tasks in SimplerEnv to ask: does the S2 → S1 conditioning representation organize around task identity or scene identity? We study nvidia/GR00T-N1.6-fractal, a model fine-tuned on the Fractal dataset that generalizes reliably across scene variants (see Experiment 1). If its latent space clusters by task — grouping same-task embeddings from different scenes closer than different-task embeddings from the same scene — the interface encodes what to do rather than where you are.

Architecture & Hook Point

GR00T N1.6 architecture: System 2 (VLM) → S2→S1 interface → System 1 (Diffusion Transformer)

Hook point — conditioning tokens extracted at the S2 → S1 interface

The conditioning tokens are extracted using a register_forward_hook on the final layer of the System 2 transformer before its output is consumed by System 1’s cross-attention. Each rollout step produces one embedding vector.

We collect embeddings across:

4 tasks — Pick Coke Can, Close Drawer, Open Drawer, Move Near
Scene variants — SimplerEnv provides per-task visual variants (object placement, background, lighting)
N steps per episode — sampled uniformly per rollout

Each embedding is labeled with its task and scene variant. Analysis then asks: which label dominates the latent geometry?

Hypotheses

H1 — Task clustering (expected)

The conditioning tokens cluster by task regardless of scene variant. Same-task embeddings drawn from different scene variants are closer to each other than different-task embeddings drawn from the same scene. The interface encodes what to do; language conditioning is geometrically active in the bottleneck.

H2 — Scene clustering (alternative)

The conditioning tokens cluster by visual scene. Scene identity dominates the interface embedding over task semantics. The model encodes where you are rather than what to do, and task structure is absent or weak in the bottleneck.

Why fractal? nvidia/GR00T-N1.6-fractal generalizes across scene variants in SimplerEnv (Experiment 1), so it is the right model to probe: any task-clustered geometry we find is a genuine property of the learned bottleneck, not an artifact of a model that fails everywhere.

Setup

Component	Detail
Simulation	SimplerEnv (MuJoCo, headless via EGL)
Robot	Google Robot (`OXE_GOOGLE` embodiment)
Model	`nvidia/GR00T-N1.6-fractal` — fine-tuned on OXE Fractal
Task (current)	Pick Coke Can — multi-task coming next
Action chunking	`ac1` — 1 action executed per backbone call, 8 predicted
Hook point	S2 → S1 conditioning output, every backbone call
Hardware	AWS g5.2xlarge — NVIDIA A10G 24 GB, 32 GB RAM

Data Structure

Each episode saves one .npz file alongside its video. The two key arrays per file:

backbone_features

(N_calls, 1, 107, 2048) float32

The S2 → S1 conditioning output at every backbone call. 107 tokens — vision patch tokens + language tokens concatenated. 2048 — embedding dimension. One row per backbone invocation during the episode.

actions_action.{x,y,z,roll,pitch,yaw,gripper}

(N_calls, 1, 8, 1) float32 × 7

The action chunk decoded by System 1 at each backbone call. 8 steps predicted per call; only step 0 is executed (ac1). 7 DOFs — Cartesian EEF (x, y, z, roll, pitch, yaw) + gripper.

For analysis we flatten each action chunk to a 56-dim vector (8 × 7 DOFs) and mean-pool the backbone tokens to a 2048-dim vector. Both are L2-normalised before computing cosine similarity.

Current Data

All runs are on Pick Coke Can with nvidia/GR00T-N1.6-fractal. Two seeds — same environment initialisation within each seed, independent rollouts across runs.

Seed	Runs	Success	N calls range	Notes
`seed42`	3	3 / 3	34 – 190	Initial exploration — no action data
`seed42`	3	3 / 3	34 – 190	Re-run with action chunks saved
`seed1`	20	17 / 3	19 – 300	Main batch — backbone features + action chunks

Preliminary Exploration

Single-task (Pick Coke Can), single-seed (seed1), 20 runs. These plots probe the relationship between action outputs and backbone representations before scaling to multiple tasks.

Episode length variability

Number of backbone calls per run. Each call = one observation processed, one 8-step action chunk output, one action executed. Failures hit the 300-call episode limit. Successful episodes vary from 19 to 231 calls — the model finds the object at very different speeds despite the same seed.

Action similarity vs Latent similarity

For every pair of runs we compute the full N_A × N_B cosine similarity matrix in both action space (56-dim chunk) and latent space (2048-dim mean-pooled backbone). The left scatter plots every individual call-pair; the right histogram shows how spread out each space is.

The key observation: action similarity spans nearly the full range (−0.55 – 1.0, σ = 0.37) while latent similarity stays tightly clustered (0.55 – 1.0, σ = 0.049). Pearson r between the two is 0.16 — the backbone produces near-identical representations regardless of the action chunk output.

Next: collect the same data across all 4 tasks and multiple scene variants, then run UMAP on the backbone representations coloured by task vs scene to answer H1 vs H2.

GR00T N1.6: Task vs Scene Geometryat the System 2 → System 1 Interface

Abstract

Architecture & Hook Point

Hypotheses

Setup

Data Structure

Current Data

Preliminary Exploration

Episode length variability

Action similarity vs Latent similarity

GR00T N1.6: Task vs Scene Geometry
at the System 2 → System 1 Interface