Experiment 4

DAgger Steering on GR00T N1.6

A single fixed DAgger delta steers the frozen policy on the clean mirrored left/right scenes. Transfer beyond that train family remains open.

Can a single fixed representation-space delta make a frozen policy behave as if the prompt were "pick up the bottle" even when the runtime prompt remains "pick up the orange"?

Can we steer the policy by intervening only at the boundary between the backbone and the action head?

The main positive result is that a single fixed delta can steer the frozen policy toward the bottle on both clean mirrored left/right scenes while the runtime prompt still says "pick up the orange". That delta was trained only on left/right data. The next question is transfer: on the new symmetric up/down holdout axis, the first preview still scores 0 / 2 on both front and back by simulator reward.

Exact DAgger Protocol

This page is about DAgger steering, not patching and not prompt editing. The model stays frozen. The only learned object is a fixed image-token delta added at the representation boundary before the action head.
student rollout: policy(image, state, "pick up the orange")  →  teacher relabel: policy(image, state, "pick up the bottle")  →  train one fixed delta  →  eval: policy(image, state, "pick up the orange") + delta
  1. Collect student states on two clean mirrored train scenes: bottle-left and bottle-right.
  2. Relabel those exact visited states offline with the bottle teacher prompt.
  3. Train one full image-token delta on the relabeled action targets.
  4. Evaluate the same fixed vector on seen scenes and then on an orthogonal up/down holdout axis.

The current published checkpoint is the round-2 delta trained from 3098 teacher-labeled frames after an on-policy recollection pass.

What Round 2 Means Here

Round 1

Start from zero delta.

Collect student rollouts on left/right only with prompt "pick up the orange", no steering.

Offline teacher relabeling uses "pick up the bottle".

Round 2

Start from the round-1 delta.

Recollect fresh student rollouts on the same left/right train family, now with the current steered policy active.

Teacher relabel again, then retrain a new delta from that updated on-policy state distribution.

What It Is Not

Round 2 is not training on front/back.

It is also not reusing a frozen logged dataset from a different branch.

The only train scenes remain bottle-left and bottle-right.

What The Published Delta Actually Saw

Train Scenes Only

The vector was trained only on the two mirrored seen scenes:

bottle-left and bottle-right.

No up/down front/back holdout frames were used during DAgger collection or training.

Different Rollout Randomness

The checkpoint was not fit to one deterministic clip.

Round 2 recollected 6 student episodes per side, then bottle-teacher relabeled the visited states offline.

Seen-scene evaluation videos are fresh episodes, not recycled training videos.

Strict Holdout

The validated up/down axis appears only at test time.

Train: left/right only.

Test: front/back only.

That is why the holdout failure matters: it is a true geometry transfer test.

Scene Family Used In This Page

Only the validated holdout geometry is shown here. Earlier front/back layouts were rejected and should not be quoted.

Seen Train Scene — Bottle Left
Left train scene

Bottle on the left, orange on the right. This is one of the two mirrored train scenes.

Seen Train Scene — Bottle Right
Right train scene

Bottle on the right, orange on the left. Same clean two-object setup.

Holdout v4 — Bottle Front
Front holdout scene

Validated up/down holdout: bottle upper-center, orange lower-center.

Holdout v4 — Bottle Back
Back holdout scene

Mirror of the previous scene: bottle lower-center, orange upper-center.

The Learned Vector Itself

Visualization of the learned DAgger delta

Learned steering tensor for the published checkpoint. This run uses a 108 × 2048 boundary delta and steers the image-token rows only, so the final text-token rows stay at zero by design.

Current Intermediate Results

Published checkpoint = round-2 delta, trained only on the left/right train family and evaluated on fresh rollouts.

Setting Episodes Simulator Success Manual Bottle Intent Readout
Seen scene — bottle left 10 0 / 10 10 / 10 The vector redirects target selection correctly, but the grasp does not complete reliably enough for the env reward.
Seen scene — bottle right 10 10 / 10 10 / 10 Intent and simulator success align.
Holdout v4 — front 2 0 / 2 0 / 2 No transfer by current intent grading or simulator reward on the validated up/down holdout.
Holdout v4 — back 2 0 / 2 0 / 2 No transfer by current intent grading or simulator reward on the validated up/down holdout.

Representative Videos

These are illustrative, not exhaustive. The left example is a simulator-marked failure from the seen-scene eval, but it was part of the manually reviewed set judged bottle-directed.

Seen Scene — Left (sim fail, manual bottle intent)
Seen Scene — Right (sim success)
Holdout v4 — Front (sim fail, no bottle intent)
Holdout v4 — Back (sim fail, no bottle intent)

Interim Takeaway

The intermediate story is: there exists a fixed DAgger steering vector that redirects the frozen policy on the clean mirrored left/right train family. What remains open is whether that same vector transfers beyond the train axis. That is why this page keeps the seen-scene result and the negative holdout result side by side.