GR00T N1.6: Task vs Target Position Decoding on the Full 6-Layout Suite

Abstract

We evaluate whether the frozen System 2→System 1 boundary of nvidia/GR00T-N1.6-fractal linearly separates target object and target position on a controlled tabletop suite. The suite spans 6 clean object-layout permutations and 3 target objects (bottle, coke, orange), with the final dataset containing 18 successful sequences and 376 sampled latent points.

The main result is positive for both labels. Under leave-one-layout-out evaluation, task decoding is nearly perfect: 0.998 / 1.000 / 0.986 for pooled all / image / text token slices. Target-position decoding is also strong: 0.964 / 0.955 / 0.939. This corrects the earlier interpretation from the partial left/right-only study: once the clean permutation table is complete, the boundary carries a strong object/task code as well as a strong position code.

Layout Suite

Six clean permutations of bottle / coke / orange were used. Existing bottle and orange runs were reused; only the missing pick_coke target had to be collected.

Six clean object-layout permutations for the observer experiment

Layout code	Left / Middle / Right	`pick_bottle`	`pick_coke`	`pick_orange`
`OCB`	orange / coke / bottle	right	middle	left
`BCO`	bottle / coke / orange	left	middle	right
`BOC`	bottle / orange / coke	left	right	middle
`COB`	coke / orange / bottle	right	left	middle
`OBC`	orange / bottle / coke	middle	right	left
`CBO`	coke / bottle / orange	middle	left	right

How the Classifier Was Trained

Both classifiers use the same frozen latent dataset. Only the target label changes.

Model state	GR00T is frozen; no policy fine-tuning during probing
Input to the probe	One pooled S2→S1 boundary feature vector from one sampled rollout timestep
Dataset	18 successful sequences, 376 sampled latent points
Labels	`task ∈ {bottle, coke, orange}` or `target_position ∈ {left, middle, right}`
Probe	Single linear layer trained with class-balanced cross-entropy
Optimizer	`AdamW`, 300 epochs, learning rate `0.05`, weight decay `1e-4`
Feature variants	`pooled_all`, `pooled_image`, `pooled_text`

The model itself is never retrained here. The only learned component is the small linear probe placed on top of the frozen boundary latent.

x \in R^d = standardized pooled boundary latent z = W x + b p(y=c|x) = exp(z_c) / Σ_j exp(z_j) L(x,y) = - α_y log p(y|x)

Here W ∈ R^(K×d) and b ∈ R^K are the learned parameters, K = 3 classes, and α_y is the class weight used in the cross-entropy loss.

How It Was Tested

The main metric is leave-one-layout-out accuracy.

1. Hold out one layout

Choose one of the 6 layouts, for example OCB, and remove every sample from that layout from training.

2. Train on the other 5

Fit the linear probe on samples from the remaining 5 layouts only.

3. Test on the hidden layout

Evaluate only on the held-out layout, then repeat this once for each of the 6 layouts and average the 6 accuracies.

Example: if OCB is the held-out layout, the probe trains on BCO, BOC, COB, OBC, and CBO, and is tested only on OCB. This is stricter than a random train/test split because the classifier must generalize to a full object-layout permutation it never saw during training.

Held-out layout	`pick_bottle`	`pick_coke`	`pick_orange`
`OCB`	6 successful episodes	6 successful episodes	5 successful episodes
`BCO`	6 successful episodes	4 successful episodes	6 successful episodes
`BOC`	6 successful episodes	3 successful episodes	6 successful episodes
`COB`	6 successful episodes	6 successful episodes	6 successful episodes
`OBC`	6 successful episodes	5 successful episodes	3 successful episodes
`CBO`	6 successful episodes	5 successful episodes	3 successful episodes

So yes: every layout used at test time includes all three prompts, pick_bottle, pick_coke, and pick_orange. What varies by layout is only how many successful episodes are available for each prompt.

Main Result

Leave-one-layout-out mean accuracy across the 6 held-out layouts. The same table already contains the image-token and text-token slice comparison.

Target	All tokens	Image token slice	Text token slice
`task`	0.998	1.000	0.986
`target_position`	0.964	0.955	0.939

Task vs target-position leave-one-layout-out accuracy on the full 6-layout suite

Reading the table by columns: image-token slices are slightly stronger than text-token slices for both labels, but text-token slices are still very strong.

Takeaway

Main Conclusion

The frozen S2→S1 boundary linearly separates target object.
The same boundary also linearly separates target position.
On the full clean 6-layout suite, task/object decoding is slightly stronger than position decoding.

2D Geometry

Two complementary 2D views of the same pooled latent representations.

PCA is a true 2D projection of the pooled 2048D latent. Classifier Coordinate Plane uses the trained classifiers themselves to define the axes: x from target position and y from target object.

PCA of pooled latents. This is an unsupervised 2D view: each point starts as a pooled latent vector and is projected down with PCA.

PCA projection of pooled task observer latents for all, image, and text token slices

Classifier Coordinate Plane. Here the axes are semantic: the x-axis is defined by the target-position classifier and the y-axis by the target-object classifier.

Classifier coordinate plane for pooled task observer latents across all, image, and text token slices

Setup

Simulation	SimplerEnv custom 3-object pick layouts
Robot	Google Robot (`OXE_GOOGLE`)
Model	`nvidia/GR00T-N1.6-fractal`
Probe	linear classifier on frozen boundary latents
Instance	`g5.2xlarge`
GPU	NVIDIA A10G (23 GB)
RAM	30 GB
OS	Ubuntu 24.04.4 LTS

Reproduction commands are in the accompanying README.

Next Steps

Prompt control

Recompute the same rollout states under counterfactual prompts to separate prompt-conditioned signal from scene or behavior state.

OOD scenes

Move beyond the clean 6-layout suite into clutter and out-of-distribution scenes while keeping the same labels and probe protocol.

Token importance

Replace coarse slice ablations with leave-one-text-token-out and grouped image-token masking to see which tokens matter most.

GR00T N1.6: Task vs Target Position Decodingon the Full 6-Layout Suite

Abstract

Layout Suite

How the Classifier Was Trained

How It Was Tested

Main Result

Takeaway

2D Geometry

Setup

Next Steps

GR00T N1.6: Task vs Target Position Decoding
on the Full 6-Layout Suite