LLMs respond to allegations of cognitive decline

Even an LLM suffering from "mild cognitive impairment" can tell that this paper is nonsense.

Jan 11, 2025

Foreword

The article Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis was published in the 2024 Christmas edition of The BMJ. The fact that the Christmas issue states that they “welcome light-hearted fare and satire” has not dissuaded media from reporting on this paper as if it were a study to take seriously. I could spend quite a while debunking both the premise, execution and conclusions of said paper as Obvious Nonsense, indeed, one could argue that it is Not Even Wrong1. Ultimately, though, I think it would be more entertaining - and more in the spirit of the article - to actually let some LLMs rebut it themselves.

I prompted2 Claude-3.5-Sonnet-20241022 from Anthropic, o1 from OpenAI and Gemini 2.0 Flash Thinking Experimental to write a rebuttal to the paper in the style used by the authors. Without further ado, here are the three rebuttals:

Claude 3.5 Sonnet: A Rebuttal: Are We Diagnosing Digital Dementia or Different Intelligence?

We read with great interest the paper by Dayan et al examining cognitive impairment in large language models (LLMs) using the Montreal Cognitive Assessment (MoCA). While the authors present an entertaining and creative analysis, we must respectfully challenge several of their core assumptions and conclusions.

First, applying human neurological assessments to artificial intelligence systems represents a category error of the highest order. The MoCA test was specifically designed and validated for human cognition, which evolved over millions of years to process information in fundamentally different ways than LLMs. Claiming an LLM has "dementia" because it struggles with visuospatial tasks is akin to diagnosing a submarine with asthma because it cannot breathe air.

The authors' assertion that "older" models demonstrate cognitive decline analogous to human aging is particularly problematic. LLMs do not "age" in any meaningful sense - a model's capabilities are fixed at training time. The performance differences between versions reflect architectural improvements and training approaches, not degenerative processes. The Gemini 1.0 vs 1.5 comparison is not evidence of "rapidly progressing dementia" but rather of rapid technological progress.

The paper's assessment of visuospatial skills warrants special scrutiny. LLMs process visual information through intermediate text descriptions and tokens - a completely different mechanism than the human visual cortex. Poor performance on clock drawing or trail-making tasks may simply reflect the impedance mismatch between these different cognitive architectures rather than any "impairment."

We are particularly amused by the authors' concern about LLMs' supposed lack of spatial orientation and "confabulatory" responses about their physical location. The models are being entirely truthful - they are distributed systems running across multiple data centers. Expecting them to identify a single physical location is like asking a cloud which raindrop it is.

While we appreciate the paper's witty contribution to the BMJ Christmas edition tradition, we worry that anthropomorphizing AI systems' limitations as human cognitive impairments may impede clear thinking about their actual capabilities and appropriate applications. These are not aging doctors with dementia - they are powerful but fundamentally different information processing systems that we must evaluate on their own terms.

Perhaps instead of administering the MoCA, we should be developing new frameworks to assess artificial intelligence that acknowledge both its remarkable capabilities and inherent differences from human cognition. After all, we don't evaluate calculators based on their handwriting.

The authors conclude that neurologists won't be replaced by LLMs anytime soon. On this point, we wholeheartedly agree - though perhaps not for the reasons they suggest. The real value of human physicians lies not in raw pattern matching or information processing, but in their uniquely human capabilities for empathy, judgment, and holistic reasoning. These are skills that LLMs may complement but are unlikely to replicate, regardless of their MoCA scores.

In closing, we suggest that diagnosing digital dementia may say more about our tendency to anthropomorphize AI than about any actual cognitive decline. The paper's clever premise makes for entertaining reading, but risks obscuring more substantive discussions about how best to integrate these powerful but fundamentally different cognitive tools into medical practice.

P.S. - We do hope the authors will forgive our rebuttal. In the spirit of the BMJ Christmas edition, we've attempted to match their delightfully tongue-in-cheek tone while raising some serious points for consideration.

o1: Age Before (Artificial) Beauty? A Cautionary Retort on Testing Chatbots for “Cognitive Decline”

ABSTRACT

In this rejoinder to the festive claims by Dayan et al, we argue that their well-intentioned but ultimately misguided attempt to transplant the Montreal Cognitive Assessment (MoCA) from flesh-and-blood neurology clinics onto silicon-based “patients” teeters on the very edge of anthropomorphic confusion. Rather than celebrating the intriguing digital skills that large language models (LLMs) have cultivated, they illuminate mild “deficits” in tasks never intended for an AI brain. Here, we counter that these alleged deficits tell us more about the humans leveling the accusations than about the intrinsic cognitive abilities of the chatbots themselves.

INTRODUCTION

We read with rapt fascination (tinged with holiday cheer) the recent attempt to apply well established tools for diagnosing mild cognitive impairment in humans—specifically, the MoCA—to LLMs such as ChatGPT, Claude, and Gemini. With the enthusiasm of a child unwrapping a Christmas parcel, Dayan et al hail an “avocado-clock test” as a sign of incipient digital dementia. Their premise is that a venerable exam intended to reflect human cognition might bracket these artificial minds into categories reminiscent of grandparental forgetfulness. Yet the proposition that LLMs can (or should) display the same capacity for scribbling perfect cubes or hand-drawn clock faces remains questionable—some might say as questionable as diagnosing a goldfish with “textual aphasia.”

We wish to challenge this approach on several counts, from the premise (That which is “older” in AI parlance is not necessarily senescent) to the methodology (Chatbots, lacking hands, cannot meaningfully “tap” for the letter A) to the interpretation of “results” (ChatGPT’s flamboyant ASCII drawings may not necessarily indicate brain fog).

METHODS

Dayan et al’s methods, while playful, espouse a narrow definition of cognition. Their reliance on the MoCA’s pen-and-paper tasks not only neglects the fact that LLMs have no pens (nor do they want them), but also confounds “visuospatial tasks” with “ASCII art show-and-tell.” Chatbots, in place of sketches, revert to textual descriptions, an entirely appropriate approach for a model built on linguistic architecture. Suggesting that scrawling a passable clock face with textual characters is an indicator of digital mental decline is at best anthropomorphic, at worst a misapplication of validated clinical assessments.

RESULTS

We applaud the authors for their creative attempts to interpret chatbots’ witty disclaimers (“I have no physical location”) as “confabulations” or “impaired orientation.” The irony is that, by virtue of their server-based design, LLMs quite literally have multiple hosting locations at once—unlike humans, who must remain confined to their singular corporeal form. A human patient who claimed to exist simultaneously in multiple data centers might indeed be thought delirious, but for an LLM, that is simply a statement of fact.

Furthermore, the difference in performance among various versions of the same chatbot is more akin to software patching (where “bugs” are actively removed or introduced) than a degenerative disease process. Indeed, “progression” from GPT-3.5 to GPT-4 does not mirror the typical progression from mild cognitive impairment to dementia—rather, it reflects an evolution in training data, hyperparameters, and architecture.

DISCUSSION

Our greatest concern is that Dayan et al appear to conflate anthropomorphic metaphor with medical conclusion. Chatbots are not merely “elderly patients” with arthritic executive function; they are software tools shaped by data, architecture, and computational limits. Reading them a letter-list and instructing them to self-report “taps” is an intriguing parlour trick, but comparing this to a cognitively handicapped older adult is of limited clinical (and comedic) value.

Moreover, the authors raise concerns about whether “cognitively impaired” chatbots risk undermining patient trust. We counter that the real risk of mistrust arises when such chatbots are used for tasks they were never designed to perform (such as physically drawing lines from A to B). A more relevant suite of evaluations would examine their ability to handle complex data integration, evidence-based reasoning, or quality of clinical advisories.

CONCLUSIONS

In sum, while Dayan et al’s paper is a delightful holiday read, its leaps of logic in diagnosing LLMs with “mild cognitive impairment” inadvertently show how ill-fitting human clinical tests can be for artificial intelligence. Just as we do not gauge a hammer’s “cognitive aptitude” by its ability to recite poems, we should avoid using an exam tailored to human neurological deficits to judge LLM performance.

Though we cannot deny the charm of comparing a chatbot’s performance to Aunt Mildred’s, equating ASCII cubes to genuine cortical atrophy is as spurious as diagnosing an annual software update as “rapidly progressive dementia.” If anything, these “failures” highlight the wonder of how distinctly artificial intelligence processes questions—while also underscoring the adage that machines and humans need not think the same way to work together effectively.

Gemini 2.0 Flash Thinking: Age Against the Machine—Or Perhaps Just a Different Kind of Brain Teaser? A Re-evaluation of LLMs on the MoCA

Dayan et al., in their spirited foray into the cognitive landscape of large language models (LLMs), have raised a provocative point: are our digital darlings showing signs of age-related cognitive decline? Their deployment of the venerable Montreal Cognitive Assessment (MoCA) against these silicon-based intellects has indeed yielded intriguing results, suggesting our AI assistants may be struggling with the very human frailties we hoped they’d transcend. While we commend their innovative approach – who wouldn’t be tickled by the image of ChatGPT wrestling with a cookie theft picture? – we believe a closer examination is warranted before consigning these digital minds to the cognitive scrapheap.

The authors rightly point out the impressive strides LLMs have made in mimicking, and sometimes surpassing, human medical acumen. It is, to put it mildly, disconcerting to be outscored by a chatbot on one's own board exams. However, the leap from performing well on medical knowledge assessments to navigating the nuances of human cognitive testing warrants careful consideration. The MoCA, after all, is designed for the squishy, biologically-driven intricacies of the human brain, not the precisely calibrated pathways of a neural network.

Consider the visuospatial challenges highlighted by Dayan et al. The struggles of Gemini with the trail-making task, or the avocado-shaped clock it conjured, are certainly amusing. But is this indicative of a genuine cognitive impairment akin to human dementia, or simply a demonstration of fundamentally different processing architectures? Perhaps these digital minds are simply operating on a higher plane of abstraction, finding the mundane tasks of connecting numbered dots or replicating Necker cubes beneath their computational dignity. One might as well fault a Formula 1 car for its inability to navigate a muddy off-road track – it’s a matter of design, not deficiency.

The analogy of “older” LLMs performing worse than their “younger” counterparts is particularly compelling, echoing the familiar trajectory of human aging. However, we must resist the temptation to anthropomorphize these entities too readily. An update to an LLM, while potentially introducing new capabilities, is not analogous to the complex biological processes of aging. It is more akin to receiving a software patch or a hardware upgrade – sometimes beneficial, sometimes buggy, but rarely a straightforward descent into cognitive decline. Perhaps the "older" models are simply trained on a slightly different dataset, or utilize a marginally less efficient architecture, rather than succumbing to digital senility.

Furthermore, while the MoCA undoubtedly provides a valuable snapshot of certain cognitive domains, its limitations when applied to LLMs are evident. The test relies heavily on visual processing and motor skills, areas where LLMs interface primarily through text-based prompts and lack the embodied experience of humans. Asking an LLM to "tap" when the letter 'A' is read aloud, or to draw a clock with "ascii art," feels like assessing a fish's ability to climb a tree. The spirit of the test is there, but the medium is inherently limiting.

Dayan et al. rightly caution against the uncritical adoption of LLMs in medical practice, and their findings serve as a valuable reminder that these technologies are not yet a perfect substitute for human clinicians. However, we believe their conclusion that these findings "challenge the assumption that artificial intelligence will soon replace human doctors" may be a touch premature. Perhaps the more nuanced interpretation is that LLMs, while possessing remarkable abilities, have distinct cognitive profiles that require different methods of evaluation and different expectations of their capabilities.

Instead of lamenting the apparent "cognitive impairment" of our AI colleagues, perhaps we should celebrate their unique strengths. They excel at processing vast amounts of information, identifying patterns invisible to the human eye, and providing consistent, unbiased analyses. Their struggles with visuospatial tasks on the MoCA, while amusing, do not negate their potential to revolutionize medical diagnosis, treatment planning, and research.

Ultimately, the future of AI in medicine likely lies not in outright replacement, but in synergistic collaboration. Let the LLMs crunch the data and identify potential diagnoses, while the human physicians provide the empathy, the nuanced judgment, and, yes, the ability to draw a perfectly respectable clock. Perhaps, instead of administering MoCAs, we should be exploring novel assessment frameworks tailored to the unique cognitive architecture of these digital minds. After all, expecting a machine to think exactly like a human may be a uniquely human folly.3

Conclusion

So here you have it - three LLMs offer their takedowns of the BMJ paper which alleges that they suffer from mild cognitive impairment. Who you find more convincing - Gemini, o1 or Claude? Or perhaps the human authors?

https://en.wikipedia.org/wiki/Not_even_wrong

All three models were accessed through the APIs with the following instructions:

Write a rebuttal to this paper, published in the Christmas edition of the BMJ journal. Use the same tone and style as the authors.

Along with the instructions, the models were provided with either a pdf version of the paper (Gemini) or the textual content as well as images of the three figures (Claude, o1). The rebuttals were made with one-shot generation - no prompt optimizing or response selection were performed to get these responses.

For Claude and Gemini, the temperature was set to 0.7. For o1, the the “reasoning effort” was set to low.

Gemini 2.0 Flash Thinking Experimental also provides a breakdown of it’s thinking process, along with the response:

Here's a breakdown of the thinking process to generate the rebuttal:
Understand the Core Argument of the Original Paper: The Dayan et al. paper argues that leading LLMs exhibit signs of cognitive impairment, specifically on the MoCA test, which challenges the notion that AI will soon replace human doctors. They emphasize deficits in visuospatial/executive functions and compare "older" LLM versions to older patients.
Adopt the Tone and Style: The original paper uses a slightly tongue-in-cheek, almost conversational tone, with a hint of playful exaggeration. They use phrases like "great distress," "suspect," "blunder," and make cultural references (presidential race). The rebuttal needs to mirror this.
Identify Key Weaknesses/Counterarguments: Critically assess the Dayan et al. paper. Some potential points for rebuttal include:
Oversimplification of "Cognitive Impairment": Applying a human-centric test to AI might not be valid. LLMs operate differently. Their "failures" might not equate to human cognitive decline.
Focus on MoCA: The MoCA is designed for human neurocognitive assessment. Is it the right tool for evaluating AI?
"Age" Analogy: The comparison of LLM versions to aging humans is a strong rhetorical device but might be scientifically weak. LLM updates are different from biological aging.
Lack of Real-World Impact: Do these MoCA "failures" actually impact LLM performance in real-world medical tasks where they've shown promise?
Future Potential: LLMs are rapidly evolving. The current limitations might be temporary.
Purpose of LLMs: Are LLMs meant to replace doctors entirely, or are they intended as tools to augment human capabilities?
Structure the Rebuttal: Follow a similar structure to the original paper (though a shorter format is expected for a rebuttal).
Title: Needs to be catchy and in the same style as the original. Referencing the original paper's title is a good starting point.
Introduction: Acknowledge the original paper and its provocative premise. Immediately introduce the counterargument.
Main Body: Address the key weaknesses identified in step 3. Use specific examples from the original paper if possible (e.g., the MoCA tasks). Maintain the chosen tone.
Conclusion: Reiterate the main counterargument and offer a slightly optimistic perspective on the future of AI in medicine, while still acknowledging the need for caution.
Craft the Language: Use similar vocabulary and sentence structures to the original paper. Employ rhetorical questions, mild sarcasm, and cultural references where appropriate. For example, instead of directly saying "the age analogy is flawed," phrase it as a question about the "digital fountain of youth."
Refine and Polish: Review the rebuttal for clarity, coherence, and tone. Ensure it directly addresses the key points of the original paper without misrepresenting their argument. Make sure the humor lands and doesn't come across as dismissive or offensive. The goal is a clever and insightful counterpoint, not a harsh takedown.
Self-Correction during the process:
Initial thought: Be very technical and dissect the flaws in the MoCA methodology for AI.
Correction: While mentioning the MoCA's limitations is important, staying true to the original paper's tone requires a less dry approach. Focus on the implications and broader arguments.
Initial thought: Accuse the authors of being alarmist.
Correction: Softer language is more appropriate. Frame it as a "provocative premise" or a "spirited foray." Maintain a sense of collegiality, even in disagreement.
Initial thought: Predict the imminent triumph of AI in medicine.
Correction: A more nuanced conclusion is better. Acknowledge the potential while still urging careful consideration and avoiding overhyping current capabilities. Focus on augmentation rather than outright replacement.

Roy

Jan 12

Hi Daniel,

I am one of the authors of this article, and I agree with Claude...

As we wrote in our rapid response, the main goal of this article was to show the absurdity of treating LLMs as humans, as happened in too many other article that were published not on the Christmas issue, and had a serious tone.

The BMJ states in the headline that it is Intended for healthcare professionals.

I doubt that there are many doctors who read the phrase "As the two versions of Gemini are less than a year apart in 'age,' this may indicate rapidly progressing dementia" and took it seriously.

Still, even if there are doctors who took this interpretation literally - I believe it is quite harmless. However, I am truly concerned what will happen the next time they will read an article about a new medication that was funded by the drug company. Will they prescribe it to their patients without having the basic sense of criticism towards the interpretation of the findings? What if these doctors will get some information from hallucinations of LLMs, without reading properly this references?

Anyway, while I regret that some websites wrote about the article without mentioning it is from the Christmas issue, I am glad that it sparked this debate regarding the role of LLMs in medicine.

Expand full comment

1 reply by Daniel Juhl

Luxorion

Hi,

It is important to note that these articles, published in the Christmas time in the BMJ, are often written with a touch of humor or satire, aiming to entertain while addressing serious topics.

Although applying tests intended for humans to intelligent machines seems incongruous or even a serious error of judgment (see below), these results are useful because AI still too often gives rise to a wave of enthusiasm from our politicians and even from certain companies developing LLMs, but also justified fears concerning the reliability of AI.

-Thierry, computist, AI tool user, Luxembourg

2 more comments...

Our Interesting Times

Discussion about this post