Determine whether the latent representations at the System 2 → System 1 interface of hierarchical VLAs encode task semantics or scene identity, and whether fine-tuning on narrow datasets corrupts this geometric structure.
The S2 → S1 conditioning bottleneck exhibits task-clustered geometry: embeddings from the same task, collected across different scene variants, are closer to each other than embeddings from different tasks in the same scene. Geometric structure in the latent space tracks what to do, not where you are.
task and target_position
under leave-one-layout-out evaluation. With full permutation coverage,
task decoding is nearly perfect and position decoding is also strong.
nvidia/GR00T-N1.6-fractal during rollouts on four tabletop
manipulation tasks across scene variants.
UMAP projections and cosine similarity analysis characterize whether the
interface geometry clusters by task or by visual scene.
Uses the Google Robot embodiment in SimplerEnv.