Observation 1 — scene overfitting, not prompt following. Without any language input, the model still moves to the correct region, picks up an object, and executes a plausible manipulation sequence. The physical behaviour is driven by scene familiarity, not the language prompt. The prompt's role is narrower: it disambiguates which object to pick up. Remove the prompt and scores collapse — not because the arm stops moving, but because it grabs the wrong object. High benchmark accuracy therefore reflects memorization of the training scenes, not genuine language-conditioned generalization.
Observation 2 — fine-tuned score drop is episode timeouts. Inspection of failing fine-tuned episodes shows the model executes the correct motion sequence in every case; failures occur because the episode times out before the motion completes. The drop from reference scores is an artifact of episode length limits, not a capability gap.