Large language models (LLMs) like GPT-4 struggle with implicit reasoning, often making inaccurate comparisons and having difficulty inducing structured representations of rules and facts3. This affects their ability to generalize knowledge systematically. While transformers can learn implicit reasoning through a process called grokking, they face challenges in generalizing effectively for composition tasks, particularly with out-of-distribution examples.
Transformers perform well in comparison tasks, showing strong generalization even with out-of-distribution examples. However, they struggle to generalize effectively for composition tasks when faced with such examples, as revealed by recent research from Ohio State University and Carnegie Mellon University.
Grokking is a phenomenon in deep learning models where the model continues to learn and improve its performance on a task even after it has seemingly overfit the training data. It involves the model learning the underlying patterns and structures of the task, leading to better generalization and robustness.