Experiment 2 — Latent Representation Analysis in Hierarchical VLAs

Goal

Determine whether the latent representations at the System 2 → System 1 interface of hierarchical VLAs encode task semantics or scene identity, and whether fine-tuning on narrow datasets corrupts this geometric structure.

Hypothesis

The S2 → S1 conditioning bottleneck exhibits task-clustered geometry: embeddings from the same task, collected across different scene variants, are closer to each other than embeddings from different tasks in the same scene. Geometric structure in the latent space tracks what to do, not where you are.

Sub-Experiments

2.2

GR00T N1.6: Task vs Target Position Decoding on the Full 6-Layout Suite

On a controlled 6-layout, 3-object tabletop pick suite, linear probes on frozen System 2 → System 1 boundary latents decode both task and target_position under leave-one-layout-out evaluation. With full permutation coverage, task decoding is nearly perfect and position decoding is also strong.

First result

2.1

GR00T N1.6 : Latent Representations on SimplerEnv

Extract S2 → S1 conditioning tokens from nvidia/GR00T-N1.6-fractal during rollouts on four tabletop manipulation tasks across scene variants. UMAP projections and cosine similarity analysis characterize whether the interface geometry clusters by task or by visual scene. Uses the Google Robot embodiment in SimplerEnv.

In progress

Overall Conclusion

First result: on the clean full 6-layout suite, frozen System 2 → System 1 latents support near-perfect linear decoding of target object and strong decoding of target position. The next steps are prompt controls, token-importance analysis, and clutter / OOD scene transfers.