A new study reveals that large language models such as GPT-4 perform much worse on counterfactual task variations compared to standard tasks. This suggests that the models often recall memorized solutions instead of truly reasoning.
ORIGINAL LINK: https://the-decoder.com/language-models-like-gpt-4-memorize-more-than-they-reason-study-finds/