AI models are improving every day, and if you’ve played around with these large language models like ChatGPT or Claude, you’ve seen these improvements firsthand over the last few months. Because these models are always improving, it should come as no surprise that they are now able to do really well on certain tests, like medical boards exams or the bar exam.
However, just like people you’ve met during school, there are those that score well on tests and don’t do as well in the real world, and vice versa. Can the same be said about these AI models? They may be able to ace a medical board exam with relative ease, but can they empathize or reason like a doctor in the real world? Can these models actually showcase clinical reasoning when presented with individual cases that are more than just a summary of words in an exam question?
AI Models And Clinical Reasoning
AI models may be great at answering exam questions, but how does that translate to the real world? After all, doctors rely on a variety of factors to provide the right care for their patients. Situations where an AI model could fail in practice in a clinical setting include:
- Misinterpreting context
- Basing advice on incomplete patient information
- Failing to seek out relevant information with follow-up questions
- Being overly certain when it’s normal to admit uncertainty
- Basing recommendations on what is most likely to work for the masses instead of the individual
Interestingly, researchers recently tried to put AI models to the test in a clinical setting to see how they might perform in the “real world.” The study authors tested 21 LLMs—including models from OpenAI, Anthropic, xAI, DeepSeek and Google DeepMind—across 29 standardized clinical examples drawn from the MSD Manual (Merck Sharp & Dohme), a widely used clinical reference that publishes peer-reviewed, structured case presentations developed by independent clinical experts. AI model performance was assessed across five specific domains:
- Differential diagnosis
- Diagnostic testing
- Final diagnosis
- Management
- Miscellaneous clinical reasoning
After looking at the data, researchers came to some interesting conclusions. AI models performed quite well when given given a relatively complete clinical picture and asked to name the condition or recommend a treatment protocol, but they struggled with differential diagnosis, which essentially sought to answer the question, “what else could it be?” It’s where additional testing is ordered, specific symptoms are reconsidered and rarer diagnoses are reviewed. In this arena, failure rates exceeded 80% across all 21 models. Researchers said the reasoning for this failure rate was because AI models far too often collapse prematurely onto a single diagnosis rather than preserving uncertainty and reviewing for additional or unlikely-but-possible conditions.
In their conclusion, researchers wrote:
“In this cross-sectional study of 21 LLMs, frontier LLMs achieved high accuracy on final diagnoses but performed poorly in generating differential diagnoses and navigating uncertainty relative to other reasoning stages. Despite version-based improvements and advantages in reasoning-optimized models, off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning.”
So while LLMs are improving, they still can’t reason, adjust on the fly and pivot as well as a real doctor who understands that what works for one person may not work for another. For more stories about how artificial intelligence may shape the future of healthcare, connect with Dr. Silverman and his team today!