ServiceNow Research has introduced EnterpriseOps-Gym, a new benchmark for assessing autonomous agents in professional environments. This tool addresses the lack of standards for long-term planning and complex workflows.

EnterpriseOps-Gym: New Benchmark for LLMs
EnterpriseOps-Gym is a containerized Docker environment that simulates eight critical business domains, including customer service and IT services. The benchmark contains 164 relational databases and 512 functional tools, with 1,150 expert-curated tasks.
The tool is important for evaluating the performance of large language models (LLMs) in complex professional settings. Results show that existing models have a success rate below 40%, indicating the need for improvements in strategic planning and task execution.
Source: MarkTechPost