A comprehensive benchmark for assessing LLM agents' ability to perform multi-step, compositional tool-use reasoning in novel environments — framed as procedurally-generated "escape room" challenges.