τ-bench simulates real-world interactions by emulating dynamic conversations between a language agent and a simulated human user, incorporating domain-specific APIs and policy guidelines1. It evaluates an agent's ability to interact consistently and reliably, comparing the final database state after a conversation to the expected goal state. The framework includes diverse databases, APIs, and user simulations to test agents' capabilities in retail and airline domains, emphasizing complex, open-ended tasks and consistent rule-following5.
τ-bench specifically tested language agents in two domains: retail and airlines. These domains were chosen for their balance between ease of data synthesis and policy specification, and their potential for diverse, realistic applications.
τ-bench evaluated GPT-4's performance in dynamic conversations with a simulated human user, incorporating domain-specific APIs and policy guidelines. It found that despite being the best-performing model, GPT-4 succeeded in less than 50% of tasks and exhibited inconsistent behavior across trials. Challenges included complex tasks such as database reasoning, following domain-specific rules, and handling compound requests.