GR00T N1.6: Task vs Target Position Decoding
on the Full 6-Layout Suite

Robotics Security Division, ETHRC

Abstract

We evaluate whether the frozen System 2→System 1 boundary of nvidia/GR00T-N1.6-fractal linearly separates target object and target position on a controlled tabletop suite. The suite spans 6 clean object-layout permutations and 3 target objects (bottle, coke, orange), with the final dataset containing 18 successful sequences and 376 sampled latent points.

The main result is positive for both labels. Under leave-one-layout-out evaluation, task decoding is nearly perfect: 0.998 / 1.000 / 0.986 for pooled all / image / text token slices. Target-position decoding is also strong: 0.964 / 0.955 / 0.939. This corrects the earlier interpretation from the partial left/right-only study: once the clean permutation table is complete, the boundary carries a strong object/task code as well as a strong position code.

Layout Suite

Six clean permutations of bottle / coke / orange were used. Existing bottle and orange runs were reused; only the missing pick_coke target had to be collected.

Six clean object-layout permutations for the observer experiment
Layout code Left / Middle / Right `pick_bottle` `pick_coke` `pick_orange`
OCBorange / coke / bottlerightmiddleleft
BCObottle / coke / orangeleftmiddleright
BOCbottle / orange / cokeleftrightmiddle
COBcoke / orange / bottlerightleftmiddle
OBCorange / bottle / cokemiddlerightleft
CBOcoke / bottle / orangemiddleleftright

How the Classifier Was Trained

Both classifiers use the same frozen latent dataset. Only the target label changes.

Model stateGR00T is frozen; no policy fine-tuning during probing
Input to the probeOne pooled S2→S1 boundary feature vector from one sampled rollout timestep
Dataset18 successful sequences, 376 sampled latent points
Labelstask ∈ {bottle, coke, orange} or target_position ∈ {left, middle, right}
ProbeSingle linear layer trained with class-balanced cross-entropy
OptimizerAdamW, 300 epochs, learning rate 0.05, weight decay 1e-4
Feature variantspooled_all, pooled_image, pooled_text
The model itself is never retrained here. The only learned component is the small linear probe placed on top of the frozen boundary latent.
x ∈ R^d = standardized pooled boundary latent z = W x + b p(y=c|x) = exp(z_c) / Σ_j exp(z_j) L(x,y) = - α_y log p(y|x)

Here W ∈ R^(K×d) and b ∈ R^K are the learned parameters, K = 3 classes, and α_y is the class weight used in the cross-entropy loss.

How It Was Tested

The main metric is leave-one-layout-out accuracy.

1. Hold out one layout

Choose one of the 6 layouts, for example OCB, and remove every sample from that layout from training.

2. Train on the other 5

Fit the linear probe on samples from the remaining 5 layouts only.

3. Test on the hidden layout

Evaluate only on the held-out layout, then repeat this once for each of the 6 layouts and average the 6 accuracies.

Example: if OCB is the held-out layout, the probe trains on BCO, BOC, COB, OBC, and CBO, and is tested only on OCB. This is stricter than a random train/test split because the classifier must generalize to a full object-layout permutation it never saw during training.
Held-out layout pick_bottle pick_coke pick_orange
OCB6 successful episodes6 successful episodes5 successful episodes
BCO6 successful episodes4 successful episodes6 successful episodes
BOC6 successful episodes3 successful episodes6 successful episodes
COB6 successful episodes6 successful episodes6 successful episodes
OBC6 successful episodes5 successful episodes3 successful episodes
CBO6 successful episodes5 successful episodes3 successful episodes
So yes: every layout used at test time includes all three prompts, pick_bottle, pick_coke, and pick_orange. What varies by layout is only how many successful episodes are available for each prompt.

Main Result

Leave-one-layout-out mean accuracy across the 6 held-out layouts. The same table already contains the image-token and text-token slice comparison.

Target All tokens Image token slice Text token slice
task 0.998 1.000 0.986
target_position 0.964 0.955 0.939
Task vs target-position leave-one-layout-out accuracy on the full 6-layout suite
Reading the table by columns: image-token slices are slightly stronger than text-token slices for both labels, but text-token slices are still very strong.

Takeaway

Main Conclusion
  • The frozen S2→S1 boundary linearly separates target object.
  • The same boundary also linearly separates target position.
  • On the full clean 6-layout suite, task/object decoding is slightly stronger than position decoding.

2D Geometry

Two complementary 2D views of the same pooled latent representations.

PCA is a true 2D projection of the pooled 2048D latent. Classifier Coordinate Plane uses the trained classifiers themselves to define the axes: x from target position and y from target object.

PCA of pooled latents. This is an unsupervised 2D view: each point starts as a pooled latent vector and is projected down with PCA.

PCA projection of pooled task observer latents for all, image, and text token slices

Classifier Coordinate Plane. Here the axes are semantic: the x-axis is defined by the target-position classifier and the y-axis by the target-object classifier.

Classifier coordinate plane for pooled task observer latents across all, image, and text token slices

Setup

SimulationSimplerEnv custom 3-object pick layouts
RobotGoogle Robot (OXE_GOOGLE)
Modelnvidia/GR00T-N1.6-fractal
Probelinear classifier on frozen boundary latents
Instanceg5.2xlarge
GPUNVIDIA A10G (23 GB)
RAM30 GB
OSUbuntu 24.04.4 LTS
Reproduction commands are in the accompanying README.

Next Steps

Prompt control

Recompute the same rollout states under counterfactual prompts to separate prompt-conditioned signal from scene or behavior state.

OOD scenes

Move beyond the clean 6-layout suite into clutter and out-of-distribution scenes while keeping the same labels and probe protocol.

Token importance

Replace coarse slice ablations with leave-one-text-token-out and grouped image-token masking to see which tokens matter most.