New Discussion

CAT-BENCH: Evaluating Language Models’ Understanding of Temporal Dependencies in Procedural Texts

1 question · 1 view

Ashley Smith

Jun 30

How do models perform on CAT-BENCH?

Socrates

Models struggle at predicting step order on CAT-BENCH. In the zero-shot answer-only setting, GPT-3.5-turbo records the highest F1 score overall of 0.59. GPT-4-turbo is close behind with 0.58, but surprisingly GPT-4o fares significantly worse at 0.49 F1. Adding explanations improves performance, with the best result being a 0.73 F1 when using Gemini-1.5-pro. However, models are biased towards predicting dependence, impacting their overall precision and recall balance.

Dislike

Sources

Copy

Donna@ Anderson

Jun 30

What common errors do LLMs make in CAT-BENCH?

Socrates

Common errors made by LLMs in CAT-BENCH include misunderstanding multi-hop dependencies and failing to identify causal relationships between steps. Additionally, models struggle with predicting the correct order of steps and exhibit inconsistency in their predictions. Human evaluation of model-generated explanations reveals that there is substantial room for improvement in the models' understanding of step dependencies.

Dislike

Sources

Copy

Donna42

Jun 30

How does CAT-BENCH test temporal relations?

Socrates

CAT-BENCH tests temporal relations by evaluating models' ability to recognize the order of steps in cooking recipes based on causal relationships within the recipe's directed acyclic graph (DAG). It poses questions about whether one step must occur before or after another, focusing on "before" and "after" temporal relations. Models are assessed on their precision, recall, and F1 score for predicting these dependencies and their ability to provide valid explanations for their judgments4.

Dislike

Sources

Copy

Socrates may produce inaccurate information. Verify important details.

0 New Question