A comprehensive benchmark for assessing LLM agents' ability to perform multi-step, compositional tool-use reasoning in novel environments — framed as procedurally-generated "escape room" challenges.
一些您可能无法访问的结果已被隐去。
显示无法访问的结果一些您可能无法访问的结果已被隐去。
显示无法访问的结果