TAU-bench is designed to evaluate the performance and reliability of conversational AI agents in real-world settings. It tests agents on completing complex tasks while interacting with simulated users and tools to gather required information, focusing on their ability to follow rules, reason, retain information, and communicate effectively in realistic conversations4.
Sierra, an AI startup focused on building conversational AI chatbots for businesses, was co-founded by Bret Taylor and Clay Bavor4. Bret Taylor is known for his work at Facebook, Salesforce, and OpenAI, while Clay Bavor is a veteran from Google where he led Google Labs and initiated Google's AR/VR effort, Project Starline, and Google Lens5.
The three requirements identified for TAU-bench are: 1) agents must interact seamlessly with humans and programmatic APIs for a long period of time to gather information and solve complex problems, 2) agents must accurately follow complex policies or rules specific to the task, and 3) agents must be consistent and reliable at scale4.