"Precisely locating specific objects in an image based on verbal instructions"—this task, known as visual grounding, is seeing increasing demand in fields such as robotics, automated GUI operation, ...