All Experiments

Experiment 1

Test Scene Overfitting in Robot Foundation Models

Determine whether robot foundation models genuinely generalize across scenes, or whether their benchmark performance is inflated by familiarity with evaluation environments.

Fine-tuned models overfit to the specific scenes, objects, and backgrounds present during training and evaluation. Reported benchmark numbers reflect benchmark-specific memorization rather than generalizable manipulation capability.

Sub-Experiments

1.1
GR00T N1.6 : Zero-Shot vs Fine-Tuned on SimplerEnv
Evaluate nvidia/GR00T-N1.6-3B (zero-shot) against nvidia/GR00T-N1.6-fractal (fine-tuned on Fractal) on four tabletop manipulation tasks. Probes language prompt robustness and scene overfitting. Uses the Google Robot embodiment.
In progress
1.2
GR00T N1.6 : LIBERO Is Not a Reliable Benchmark
Evaluate nvidia/GR00T-N1.6-3B (zero-shot) against 0xAnkitSingh/GR00T-N1.6-LIBERO (fine-tuned on LIBERO) on Franka Panda tabletop manipulation tasks.
In progress
Pending completion of sub-experiments.