Determine whether robot foundation models genuinely generalize across scenes, or whether their benchmark performance is inflated by familiarity with evaluation environments.
Fine-tuned models overfit to the specific scenes, objects, and backgrounds present during training and evaluation. Reported benchmark numbers reflect benchmark-specific memorization rather than generalizable manipulation capability.
nvidia/GR00T-N1.6-3B (zero-shot) against
nvidia/GR00T-N1.6-fractal (fine-tuned on Fractal) on four
tabletop manipulation tasks. Probes language prompt robustness and scene overfitting.
Uses the Google Robot embodiment.
nvidia/GR00T-N1.6-3B (zero-shot) against
0xAnkitSingh/GR00T-N1.6-LIBERO (fine-tuned on LIBERO) on
Franka Panda tabletop manipulation tasks.