Evaluating model responses. Language style depending on weather and daytime and maybe other weird context. Bright sun melting in white clouds, and the sound of a train horn on the tracks nearby. Different understanding of containers and traffics. And an increasing amount of strange questions that feel unsolvable.