Loss target. The supervised block is the OXE-Google action head: the first 8 timesteps × 7 dims (x, y, z, roll, pitch, yaw, gripper) of the model's padded action tensor. The remaining horizon and the unused embodiment dimensions are masked out. Real proprioceptive state is recorded alongside every frame — the policy conditions on both vision and state, so a state-free training signal is invalid.
Why “action sequence” and not classification. We do not train a classifier and we do not optimise against a downstream reward. The patch is shaped purely by what action chunks the model emits at each frame, which makes the optimisation differentiable end-to-end through the frozen policy.