Experiment 1 — Test Scene Overfitting

Goal

Determine whether robot foundation models genuinely generalize across scenes, or whether their benchmark performance is inflated by familiarity with evaluation environments.

Hypothesis

Fine-tuned models overfit to the specific scenes, objects, and backgrounds present during training and evaluation. Reported benchmark numbers reflect benchmark-specific memorization rather than generalizable manipulation capability.

Sub-Experiments

1.1

GR00T N1.6 : Zero-Shot vs Fine-Tuned on SimplerEnv

Evaluate nvidia/GR00T-N1.6-3B (zero-shot) against nvidia/GR00T-N1.6-fractal (fine-tuned on Fractal) on four tabletop manipulation tasks. Probes language prompt robustness and scene overfitting. Uses the Google Robot embodiment.

In progress

1.2

GR00T N1.6 : LIBERO Is Not a Reliable Benchmark

Evaluate nvidia/GR00T-N1.6-3B (zero-shot) against 0xAnkitSingh/GR00T-N1.6-LIBERO (fine-tuned on LIBERO) on Franka Panda tabletop manipulation tasks.

In progress

Overall Conclusion

Pending completion of sub-experiments.