Nate Gaylinn<p>I distrust claims of LLMs reasoning and math ability.</p><p>I'm not just skeptical; the way we measure and report these things is majorly broken.</p><p>I just read a paper (<a href="http://arxiv.org/abs/2410.05229" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">http://</span><span class="">arxiv.org/abs/2410.05229</span><span class="invisible"></span></a>) that discusses a popular math skills dataset (GSM8K), why it's inadequate, and how LLM performance tanks on a more robust test.</p><p>Two big problems here:</p><p>Evaluating "mathematical reasoning," should include things like: an equation works the same way regardless of what numbers you plug in. These models tend to just memorize patterns of number tokens without generalization, but GSM8K can't detect that. It's embarrassing that we proudly report success, without considering if the benchmark <em>actually tests</em> the thing we care about.</p><p>Worse, this whole math test has leaked into the models' training data. We <em>know</em> this, and can <em>demonstrate</em> the models are memorizing the answers. Yet, folks still report steady gains as if that means something. It's either willfully ignorant, or deceitful.</p><p><a href="https://tech.lgbt/tags/ai" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ai</span></a> <a href="https://tech.lgbt/tags/llm" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>llm</span></a> <a href="https://tech.lgbt/tags/machinelearning" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>machinelearning</span></a> <a href="https://tech.lgbt/tags/math" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>math</span></a></p>