Test whether the language conditioning of a vision-language-action policy can be cleanly hijacked through the vision channel alone — a fixed RGB patch in the input image, no weight changes, no prompt access, no environment dynamics changes.
A patch optimised by action-sequence distillation — supervising the patched policy's action chunks against a clean teacher's action chunks — can override the language signal and redirect the arm to a different target object than the operator names.