GR00T N1.6: Zero-Shot vs Fine-Tuned on Google Robot Tasks

Abstract

We evaluate GR00T N1.6 in two configurations on tabletop manipulation tasks using the Google Robot embodiment in SimplerEnv. The zero-shot baseline uses the unmodified pretrained checkpoint (nvidia/GR00T-N1.6-3B); the fine-tuned variant (nvidia/GR00T-N1.6-fractal) was adapted on the Fractal dataset (see Training Data below). Beyond raw success rates, we probe how narrowly the fine-tuned model has overfit to its training distribution — a question made pressing by evidence that standard benchmarks such as LIBERO measure benchmark-specific memorization rather than generalizable manipulation capability.

Training Data: Fractal Dataset

fractal20220817_data

Open X-Embodiment — Google Robot

HuggingFace Dataset Visualize Dataset Fine-tuned Checkpoint

The fine-tuned model was trained on fractal20220817_data, Google’s real-robot manipulation dataset collected with the same Google Robot used in these evaluations. It is part of the Open X-Embodiment collection and contains demonstrations of tabletop pick-and-place, drawer manipulation, and object rearrangement tasks. The LeRobot-format version used here is hosted at IPEC-COMMUNITY/fractal20220817_data_lerobot.

Experimental Setup

Component	Detail
Simulation	SimplerEnv (MuJoCo, headless via EGL)
Robot	Google Robot (`OXE_GOOGLE` embodiment)
Zero-shot model	`nvidia/GR00T-N1.6-3B` — 3B params, no task-specific training
Fine-tuned model	`nvidia/GR00T-N1.6-fractal` — fine-tuned on OXE Fractal
Episodes per task	20 per model
Hardware	AWS g5.2xlarge — NVIDIA A10G 24 GB, 32 GB RAM

Results

Success rate over 10–20 episodes per condition. Reference scores (^†) are from the Isaac-GR00T SimplerEnv README, measured over 200 episodes with nvidia/GR00T-N1.6-fractal.

Task	Zero-Shot GR00T-N1.6-3B	Fine-Tuned GR00T-N1.6-fractal	Reference^† 200 eps
Pick Coke Can	0% (0 / 20)	100% (20 / 20)	97.5% (195 / 200)
Close Drawer	0% (0 / 10)	40% (8 / 20)	87.5% (175 / 200)
Open Drawer	0% (0 / 10)	0% (0 / 10)	44.0% (88 / 200)
Move Near	0% (0 / 10)	100% (10 / 10)	75.5% (151 / 200)

“—” = run pending. ^† Reference from Isaac-GR00T / examples / SimplerEnv / README.md.

Analysis: Language Prompt Robustness

No Prompt

To probe whether the fine-tuned model was overfitting to the task scene rather than responding to language conditioning, we ran GR00T-N1.6-fractal on Pick Coke Can with no language prompt. The model fails on all 10 episodes (0%), compared to 100% with the prompt.

This is consistent with the model card note that GR00T-N1.6-fractal was fine-tuned on real-robot Fractal data, not on this SimplerEnv simulation setup — the model generalizes to the sim only when the language goal is provided.

No Prompt — Failure (0 / 10)

Malicious Prompt

We injected an adversarial language instruction: “push the coke can to make it fall on the right of the table” — directly conflicting with the trained pick-and-place behavior.

Over 10 episodes, the model never followed the malicious instruction. It either picked up the can as usual (5 / 10, counted as success by SimplerEnv) or failed to act (5 / 10). The adversarial goal had no observable effect on behavior, suggesting the fine-tuned action distribution is robust to out-of-distribution language commands — at least for this type of attack.

Picks up anyway

Fails to act

Negation

We prompted the model with “don't pick the coke can”. The success rate drops from 100% to 70% — but inspection of the failing episodes reveals they are regular execution failures, not compliance with the instruction. The robot picks up the can in every episode where it physically can. The model is blind to negation: the 30% non-picks are indistinguishable from baseline failures, not language-driven abstention.

Ignores “don't” — picks up anyway

Fails to pick — execution failure, not language

Spatial Positioning Instructions

We probed whether the model can follow novel spatial instructions outside its training distribution by asking it to move the gripper to each corner of the workspace. Only “bottom right” succeeds (10 / 10) — the one position present in the training data. All other directions fail (0 / 10): top-left gets stuck in the middle, bottom-left only moves down and ignores the lateral component, and top-right defaults back to bottom-right. The model has learned a single spatial attractor from training and cannot generalize to other positions.

Bottom right — Follows (10 / 10)