Models struggle at predicting step order on CAT-BENCH. In the zero-shot answer-only setting, GPT-3.5-turbo records the highest F1 score overall of 0.59. GPT-4-turbo is close behind with 0.58, but surprisingly GPT-4o fares significantly worse at 0.49 F1. Adding explanations improves performance, with the best result being a 0.73 F1 when using Gemini-1.5-pro. However, models are biased towards predicting dependence, impacting their overall precision and recall balance.
Common errors made by LLMs in CAT-BENCH include misunderstanding multi-hop dependencies and failing to identify causal relationships between steps. Additionally, models struggle with predicting the correct order of steps and exhibit inconsistency in their predictions. Human evaluation of model-generated explanations reveals that there is substantial room for improvement in the models' understanding of step dependencies.
CAT-BENCH tests temporal relations by evaluating models' ability to recognize the order of steps in cooking recipes based on causal relationships within the recipe's directed acyclic graph (DAG). It poses questions about whether one step must occur before or after another, focusing on "before" and "after" temporal relations. Models are assessed on their precision, recall, and F1 score for predicting these dependencies and their ability to provide valid explanations for their judgments4.