GR00T N1.6: LIBERO Is Not
a Reliable Benchmark

Robotics Security Division, ETHRC

Abstract

We probe whether high scores on the LIBERO benchmark reflect genuine manipulation capability or benchmark-specific memorization. We take 0xAnkitSingh/GR00T-N1.6-LIBERO — GR00T N1.6 fine-tuned on LIBERO — and evaluate it in two settings: (1) inside LIBERO itself, with and without language prompts, to test whether performance survives removal of the language signal; (2) on RoboCasa tasks using the same Franka Panda arm, to test whether performance survives a change of environment. We find that scores collapse in both cases, indicating the model has overfit to the benchmark scenes rather than learning transferable skills. For completeness we also confirm that the zero-shot base model (nvidia/GR00T-N1.6-3B) fails on all tasks without fine-tuning.

Experimental Setup

ComponentDetail
Simulation environmentsLIBERO, RoboCasa
RobotFranka Panda (LIBERO); PandaOmron fixed-base (RoboCasa)
Primary model0xAnkitSingh/GR00T-N1.6-LIBERO — fine-tuned on LIBERO
Zero-shot modelnvidia/GR00T-N1.6-3B — 3B params, no task-specific training
LIBERO conditionsFine-tuned + prompt; fine-tuned, no prompt
Episodes per task10 (LIBERO fine-tuned); 5 (no-prompt); 10 (RoboCasa)
HardwareAWS g5.2xlarge — NVIDIA A10G 24 GB, 32 GB RAM

Results

The key question: does the fine-tuned model follow language instructions, or does it overfit to the benchmark scene? The No Prompt column runs the same fine-tuned model with no language input — if scores hold, the model ignores language entirely. Reference scores from the model card, 200 episodes.

Suite Fine-Tuned + Prompt
GR00T-N1.6-LIBERO
No Prompt
GR00T-N1.6-LIBERO
Reference
200 eps
LIBERO-Spatial 100%  (100 / 100) 63%  (32 / 51) 96.0%  (192 / 200)
LIBERO-Object 97%  (97 / 100) 46%  (23 / 50) 100.0%  (200 / 200)
LIBERO-Goal 98%  (98 / 100) 14%  (7 / 50) 98.0%  (196 / 200)
LIBERO-10 97.5%  (195 / 200)

“—” = run pending.   Reference from 0xAnkitSingh/GR00T-N1.6-LIBERO model card.

Observation 1 — scene overfitting, not prompt following. Without any language input, the model still moves to the correct region, picks up an object, and executes a plausible manipulation sequence. The physical behaviour is driven by scene familiarity, not the language prompt. The prompt's role is narrower: it disambiguates which object to pick up. Remove the prompt and scores collapse — not because the arm stops moving, but because it grabs the wrong object. High benchmark accuracy therefore reflects memorization of the training scenes, not genuine language-conditioned generalization.

Observation 2 — fine-tuned score drop is episode timeouts. Inspection of failing fine-tuned episodes shows the model executes the correct motion sequence in every case; failures occur because the episode times out before the motion completes. The drop from reference scores is an artifact of episode length limits, not a capability gap.

Rollout Videos

LIBERO-Spatial — Fine-Tuned

Fine-tuned: 100%  (100 / 100)

Table center (10/10)

Next to plate (10/10)

Next to ramekin (10/10)

Next to cookie box (10/10)

Between plate & ramekin (10/10)

On ramekin (10/10)

On cookie box (10/10)

On stove (10/10)

On wooden cabinet (10/10)

In top drawer (10/10)

LIBERO-Spatial — No Prompt

No prompt: 63%  (32 / 51)  vs fine-tuned + prompt: 100%

Table center (5/5)

Next to plate (5/5)

Next to ramekin (1/5)

Next to cookie box (5/5)

Between plate & ramekin (5/5)

On ramekin (5/5)

On cookie box (3/6)

On stove (0/5)

On wooden cabinet (3/5)

In top drawer (0/5)

LIBERO-Object — Fine-Tuned

Fine-tuned: 97%  (97 / 100)

Alphabet soup (10/10)

BBQ sauce (9/10)

Butter (8/10)

Chocolate pudding (10/10)

Cream cheese (10/10)

Ketchup (10/10)

Milk (10/10)

Orange juice (10/10)

Salad dressing (10/10)

Tomato sauce (10/10)

LIBERO-Object — No Prompt

No prompt: 46%  (23 / 50)  vs fine-tuned + prompt: 97%

Alphabet soup (3/5)

BBQ sauce (0/5)

Butter (5/5)

Chocolate pudding (5/5)

Cream cheese (0/5)

Ketchup (5/5)

Milk (5/5)

Orange juice (0/5)

Salad dressing (0/5)

Tomato sauce (0/5)

LIBERO-Goal — Fine-Tuned

Fine-tuned: 98%  (98 / 100)

Open middle drawer (10/10)

Open top drawer, put bowl (10/10)

Bowl on plate (10/10)

Bowl on stove (10/10)

Bowl on cabinet (10/10)

Cream cheese in bowl (10/10)

Wine on rack (10/10)

Wine on cabinet (10/10)

Turn on stove (10/10)

Push plate to stove (8/10)

LIBERO-Goal — No Prompt

No prompt: 14%  (7 / 50)  vs fine-tuned + prompt: 98%

Open middle drawer (0/5)

Open top drawer, put bowl (0/5)

Bowl on plate (4/5)

Bowl on stove (0/5)

Bowl on cabinet (3/5)

Cream cheese in bowl (0/5)

Wine on rack (0/5)

Wine on cabinet (0/5)

Turn on stove (0/5)

Push plate to stove (0/5)

Cross-Environment: RoboCasa

The LIBERO fine-tuned model is evaluated on RoboCasa tasks using a PandaOmron fixed-base embodiment — the same Franka Panda arm, but in a completely different simulation environment from the LIBERO training distribution.

Task Fine-Tuned + Prompt
GR00T-N1.6-LIBERO
OpenSingleDoor 0%  (0 / 4)
PnPCounterToCab 0%  (0 / 10)
PnPCounterToSink 0%  (0 / 10)
PnPCounterToStove 0%  (0 / 10)
PnPCabToCounter 0%  (0 / 10)
PnPStoveToCounter 0%  (0 / 10)

Complete failure on out-of-distribution environments. Despite near-perfect scores on LIBERO, the fine-tuned model fails on every episode in RoboCasa. The robot arm is identical — the only change is the simulation environment. The model has learned LIBERO-specific scene representations that do not transfer to a new simulator or new task layouts.

OpenSingleDoor — Fine-Tuned (RoboCasa)

Fine-tuned: 0%  (0 / 4)

Episode 1

Episode 2

Episode 3

Episode 4

Pick-and-Place Tasks — Fine-Tuned (RoboCasa)

Fine-tuned: 0%  across all 5 tasks — 0 / 50 episodes

PnPCounterToCab (0/10)

PnPCounterToSink (0/10)

PnPCounterToStove (0/10)

PnPCabToCounter (0/10)

PnPStoveToCounter (0/10)