New LMU study shows: How artificial intelligence really makes doctors better | Ludwig Maximilian University of Munich

Xpert Pre-Release

Available in 27 languages 📢

Published on: May 26, 2026 / Updated on: May 26, 2026 – Author: Konrad Wolfenstein

New LMU study shows: How artificial intelligence really makes doctors better | Ludwig Maximilian University of Munich – Image: Xpert.Digital

Lifesaver or risk? How "thinking" AI is completely changing everyday hospital life

EU law forces a rethink: AI in hospitals will have to “think out loud” in the future

Artificial intelligence has long been hailed as a savior in healthcare, combating chronic time pressure and acute staff shortages. However, a groundbreaking new study from Germany reveals that whether an algorithm saves lives or, in the worst-case scenario, even provokes misdiagnoses, depends on a crucial detail that has received little attention until now. It is simply not enough for an AI to deliver accurate results – it must also be able to explain its reasoning process to the physician step by step. A fascinating experiment with over 100 radiologists reveals why so-called "chain-of-thought" models drastically reduce the diagnostic error rate, why classic differential diagnoses suddenly become cognitive traps, and why these findings could radically transform not only medical practice but also the global AI market and future EU regulations.

Related to this:

The effect of medical explanations from large language models on diagnostic accuracy in radiology

When AI thinks for itself: How explainable artificial intelligence is changing medical diagnostics

A plausible answer is not enough – those who blindly trust AI endanger patients' lives

Large language models are no longer confined to laboratory experiments. They can be found in law firms, newsrooms, management consultancies – and increasingly in hospitals. But while public debate often revolves around the question of whether artificial intelligence will one day replace doctors, researchers at LMU Munich, the LMU University Hospital, the Karlsruhe Institute of Technology, and the University of Bayreuth are asking a far more nuanced question that is directly relevant to everyday clinical practice: Under what conditions does AI support actually improve diagnostic quality – and when, in the worst-case scenario, is it even harmful?

The answer, published in the journal npj Digital Medicine by the research team led by Stefan Feuerriegel, Professor at the LMU Munich School of Management, and Boj Friedrich Hoppe from the LMU University Hospital, is as clear as it is sobering: The primary concern is not whether an AI provides a correct diagnosis. It is how it explains that diagnosis. This finding is significant because it elevates the entire debate about AI in healthcare to a new level – moving away from the binary question of "AI yes or no?" towards the more nuanced question of how to design human-machine interaction.

The experiment: 101 radiologists and four conditions

The study is methodologically remarkable. In a randomized experiment, 101 radiologists were presented with real clinical cases involving radiological imaging – including findings from computed tomography and magnetic resonance imaging. Participants were asked to formulate a diagnosis in free text, which is significantly more challenging than simply selecting a multiple-choice option and reflects clinical reality much more accurately.

The participants were randomly assigned to one of four groups. The first group worked entirely without AI support and served as the control group. The second group received only a single diagnostic recommendation from the multimodal language model. The third group received a differential diagnosis, i.e., a list of possible diseases with graded probabilities. Finally, the fourth group received a so-called chain-of-thought explanation: The model revealed its reasoning step by step—it named relevant image features, explained clinical indications, discussed exclusion criteria, and made its line of reasoning comprehensible to the physician.

The result: A twelve percentage point difference and what's behind it

The results are clear. Radiologists who used the step-by-step chain-of-thought explanation achieved a diagnostic accuracy rate 12.2 percentage points higher than the control group without AI. This is not a marginal effect. In the context of everyday clinical practice, where thousands of reports are generated daily, this difference corresponds to a significant number of misdiagnoses that could be avoided.

Simple diagnostic outputs and differential diagnoses, on the other hand, fared significantly worse. The finding regarding differential diagnosis is particularly revealing: In cases where the AI model delivered an incorrect assessment, physicians followed the list more frequently than they would have with a simple single diagnosis. The differential diagnosis conveys an impression of completeness. It presents multiple possibilities and thus creates the feeling that the diagnostic space has already been fully covered. This leads physicians to reduce their own critical thinking – especially in the case of rare or complex conditions that do not even appear in the presented list.

Automation bias: The underestimated risk in everyday clinical practice

The phenomenon that the LMU study so impressively illustrates is known in the research literature as automation bias. It describes the tendency of people to follow the recommendations of automated systems even when their own perception or expertise contradicts them. Automation bias is not a sign of incompetence. It is a deeply human cognitive pattern that stems from evolutionary heuristics: those who trust efficient systems conserve cognitive resources. In most everyday situations, this is functional. In medicine, however, it can be fatal.

Previous studies have shown that automation bias is significantly more pronounced under time pressure. A study on AI-supported clinical decision support in pathology measured that while AI integration led to a statistically significant overall improvement in performance, it simultaneously generated an automation bias rate of 7 percent – meaning cases in which initially correct assessments were altered by erroneous AI recommendations. Time pressure did not increase the frequency of the bias, but it did increase its intensity. The parallels to radiological practice, where radiologists in some hospitals have to produce more than one hundred reports per shift, are obvious.

The LMU study now shows that the way AI is explained is a crucial factor in moderating this risk. Step-by-step explanations make the model's line of reasoning transparent and allow the physician to compare it with their own expertise – a process that makes errors in the model easier to identify and simultaneously encourages active cognitive engagement rather than passive acceptance.

The economics of explainability: What good AI really costs

From an economic perspective, the LMU study opens up an important debate that is often overlooked in market-driven growth forecasts for AI in healthcare. The global market for artificial intelligence in healthcare was estimated at around 28 to 39 billion US dollars for 2025 and is projected to grow to over 500 billion US dollars by 2034, with annual growth rates exceeding 34 percent. However, these figures primarily describe the market for AI products – not the actual economic value these products generate in clinical use.

This is precisely where the problem lies. A systematic review published in 2025 on the economic evaluation of AI in radiology analyzed more than 1,800 publications and found only 21 studies that actually quantified the costs, savings, or cost-effectiveness of AI tools. The vast majority of the evidence is based on modeled scenarios, not on real clinical implementations. Even more seriously, the real data shows that AI in radiology does not automatically save costs. The economic value is highly context-dependent: it tends to be positive with high volume, a shortage of radiologists, or resource-intensive tasks. However, it can also be negative—if insufficient specificity leads to more follow-up examinations, or if usage-based licensing models negate the efficiency gains achieved with high case volumes.

The explainability of AI expenditures is not merely an academic luxury problem – it is a tangible economic variable. An AI that achieves a 12.2 percentage point highersegenaccuracy when its expenditures are explained using a chain-of-thought approach generates significantly higher clinical and economic value than an AI that simply provides a diagnosis, assuming the same model quality. Translated into cost terms, this means: avoided misdiagnoses, reduced follow-up examinations, shorter treatment durations, and a lower error rate. The benefits are real, even if they are difficult to quantify in euros – because misdiagnoses have direct medical costs as well as indirect costs due to extended hospital stays, legal risks, and a loss of trust in the healthcare system.

Explainable AI as a strategic necessity within the regulatory framework

The EU AI Act, which has been in force since August 2024, classifies almost all clinical AI applications – diagnostic tools, therapy planning systems, and digital monitoring applications – as high-risk. This entails extensive obligations: technical documentation, risk and quality management, continuous monitoring, and explicit transparency requirements. From August 2028, following the updated Digital Omnibus Package, which the EU Council and Parliament provisionally agreed upon on May 7, 2026, the full requirements for medical device manufacturers will apply.

The central regulatory core of these regulations is precise: High-risk AI must be comprehensible to users. Decision-making processes must be transparent, and recommendations must be contestable. What the EU AI Act normatively requires is empirically confirmed by the LMU study: Explainability is not merely a compliance requirement. It is the prerequisite for the safe use of AI in high-risk clinical situations. The new regulation thus compels manufacturers of AI systems in healthcare to address the nature and quality of their output – not just the technical accuracy of their models.

From a strategic perspective, this creates an interesting market dynamic. Providers who take their explanatory power seriously and invest in transparent, chain-of-thought-like output formats will be better positioned from a regulatory standpoint. At the same time, they will demonstrably achieve better clinical outcomes. The competition for AI solutions in healthcare will therefore shift in the future from the question of technical model accuracy to the question of clinical usability – a paradigm shift with significant consequences for the entire industry.

A new dimension of digital transformation with 'Managed AI' (Artificial Intelligence) - Platform & B2B solution | Xpert Consulting

A new dimension of digital transformation with 'Managed AI' (Artificial Intelligence) – Platform & B2B solution | Xpert Consulting - Image: Xpert.Digital

Here you will learn how your company can implement customized AI solutions quickly, securely and without high entry barriers.

A managed AI platform is your all-inclusive, worry-free solution for artificial intelligence. Instead of dealing with complex technology, expensive infrastructure, and lengthy development processes, you receive a ready-made solution tailored to your needs from a specialized partner – often within just a few days.

The key advantages at a glance:

⚡ Rapid implementation: From idea to ready-to-use application in days, not months. We deliver practical solutions that create immediate added value.

🔒 Maximum data security: Your sensitive data stays with you. We guarantee secure and compliant processing without sharing data with third parties.

💸 No financial risk: You only pay for results. High upfront investments in hardware, software, or personnel are completely eliminated.

🎯 Focus on your core business: Concentrate on what you do best. We take care of the entire technical implementation, operation, and maintenance of your AI solution.

📈 Future-proof & scalable: Your AI grows with you. We ensure continuous optimization and scalability, and flexibly adapt the models to new requirements.

More information here:

The Managed AI Solution - Industrial AI Services: The Key to Competitiveness in the Services, Industry and Mechanical Engineering Sectors

When AI is convincing: How “plausible errors” can become dangerous for doctors

Skills shortage as a catalyst for uncritical AI adoption

The findings of the LMU study take on particular significance in light of the structural shortage of skilled professionals in the German healthcare system. Radiology is a specialty that, in Germany—as in many other European countries—is under considerable staffing pressure. At the same time, the volume of imaging findings is exploding due to the ever-increasing use of CT, MRI, and other imaging techniques. This pressure creates a context in which the temptation is great to quickly adopt AI recommendations instead of critically examining them.

Automation bias is particularly dangerous in this context. When a radiologist is under time pressure and the AI presents a list of plausible-sounding diagnoses, the path to uncritical acceptance is short. The LMU study shows that well-designed, explanatory AI output can counteract this – but only if physicians actively read and review the explanations. This requires that AI systems be integrated into clinical workflows in such a way that sufficient time remains for this critical evaluation. Those who introduce AI merely as a tool for acceleration, without considering the quality of the interaction, risk achieving the opposite of what is desired: faster, but more error-prone diagnoses.

The Bertelsmann Foundation estimates that Germany is missing out on productivity gains of up to 16 percent due to a lack of AI expertise – equivalent to billions in lost revenue. In the healthcare sector, this effect is even more complex to measure because the value is expressed not in revenue but in health outcomes. Nevertheless, the underlying logic is the same: the potential of AI can only be realized if users are competent enough to critically evaluate AI expenditures – and if the AI systems themselves are designed in such a way that critical evaluation is both possible and encouraged.

Differential diagnoses and the deceptive sense of security

One of the most subtle findings of the LMU study deserves special attention because it contradicts clinical intuition. Differential diagnoses are considered a sign of clinical diligence in medicine. They demonstrate that a physician considers multiple possibilities and does not prematurely settle on a diagnosis. However, in interaction with an AI system, precisely this type of output can be problematic.

The underlying mechanism is easily explained psychologically: A list of differential diagnoses gives the impression that the problem has already been exhaustively considered. The information density of this output is high, which signals cognitive relief. Consequently, physicians tend to think less beyond the listed diagnoses and to exercise less self-assessment. If the model produces erroneous or incomplete differential diagnoses at this moment—which language models certainly do—the likelihood of error adoption is higher than with a single diagnosis clearly marked as preliminary.

Chain-of-thought explanations counteract this because they explicitly identify uncertainties, disclose exclusionary factors, and thus communicate the epistemic openness of the model. Physicians are invited to question the model – and are therefore better able to correct it where it is flawed.

Generalizability: What the finding means beyond radiology

Stefan Feuerriegel, corresponding author of the study, explicitly emphasizes that the findings extend far beyond radiology. Large language models are increasingly being used for decisions in everyday life and at work – in law, finance, management consulting, and education. Wherever people use AI output as the basis for consequential decisions, the same questions arise: Do I critically examine the recommendation, or do I adopt it for reasons of efficiency? Do I understand the reasoning, or do I rely on the AI because the result sounds plausible?

The warning against "convincing-sounding errors" is particularly important. Language models are capable of producing explanations that appear structurally correct and rhetorically persuasive—yet are factually incorrect. This is a well-known phenomenon, referred to in the research literature as "hallucination," and cannot be completely eliminated simply by optimizing the models' performance. While step-by-step explanations offer an improved opportunity for critical review, they do not entirely protect against this risk. The responsibility for the final decision always remains with the human.

From an economic perspective, this can be interpreted as an argument for differentiated user competence: Those who want to benefit sustainably from AI tools—be it in medicine, law, or management consulting—must not only know how to operate them, but also how to evaluate their costs. This competence can be learned, but requires targeted training and professional development. Institutions that invest in this competence will utilize AI systems more effectively than those that treat AI as an autonomous decision-making tool.

Explainable AI and the Trust Problem: A Systemic Perspective

Trust is not a soft factor in medicine – it is a hard economic value. Patients who trust their doctors are more likely to follow treatment recommendations, report symptoms earlier, and demonstrably have better treatment outcomes. This trust has now been expanded to include another dimension: it increasingly encompasses trust in the AI systems involved in diagnosis and treatment planning.

The concept of explainable AI – referred to in the literature as XAI, Explainable Artificial Intelligence – addresses precisely this trust issue. It's not about making models less complex, but about making their decision-making processes understandable to relevant user groups. "Understandable" is not an absolute term: what is a helpful step-by-step explanation for an experienced radiologist may be too detailed or misleading for a general practitioner without a specialization in medical imaging. Therefore, XAI must be considered not only from a technical perspective, but also with user and context in mind.

From the manufacturers' perspective, this means that developing effective AI explanations is not trivial. It requires a deep understanding of clinical workflows and the cognitive demands of the respective user groups. Chain-of-thought explanations, which performed superiorly in the study, are not merely a technical output format – they are the result of a carefully designed interaction. This design requires resources, but it demonstrably creates value – for patients, physicians, and society.

Regulatory obligations and clinical reality: A pragmatic outlook

The transitional periods of the EU AI Act give manufacturers and operators of AI systems in healthcare time to adapt. According to the new regulations of the Digital Omnibus Package, the final deadline for medical device manufacturers is August 2028. However, this period should not be misunderstood as a postponement, but rather as a structured transition in which the findings of clinical research – such as those of the LMU study – can be incorporated into product development.

Specifically, this means for hospitals and hospital technicians: The evaluation of AI systems should not only measure technicalsegenaccuracy, but also the quality of output in clinical use. Chain-of-thought statements and similar transparent output formats should be considered selection criteria during procurement. Training for physicians using AI tools must explicitly address automation bias and the critical review of AI recommendations. Finally, clinical quality assurance systems should document the adoption of AI recommendations to identify systematic errors early on.

For developers and providers of AI solutions in healthcare, the message is clear: Investing in explainability is not an optional add-on. It is the crucial lever that transforms a technically sound model into a clinically effective and regulatory-compliant tool.

The overarching theme: How humans and machines can become smarter together

The LMU study ultimately contributes to a larger question that extends far beyond radiology and medicine: How must AI systems be designed so that they augment human thinking instead of replacing it or – worse – undermining it? The answer is: through transparency, traceability, and actively encouraging critical examination.

This is not a technically romantic ideal. It is an empirically proven, economically sound, and ethically imperative design principle. In a healthcare system under increasing performance pressure, reliant on digital tools, and simultaneously required to meet the highest quality standards, the question "How does your AI explain its recommendations?" could soon become the most important procurement question in clinical settings.

A good AI response is not only correct – it is verifiable. Those who consistently translate this principle into the development, procurement, and deployment of AI systems will not only achieve better medical outcomes. They will also gain the trust that the profound digitalization of healthcare urgently needs – the trust of physicians, patients, and society as a whole.

🎯🎯🎯 Data-driven B2B industry hub as a quasi-in-house solution

The quasi-in-house solution: How Xpert.Digital closes operational gaps in B2B marketing and sales – Smart Content-Driven Business - Image: Xpert.Digital

Xpert.Digital is a data-driven B2B industry hub led by Konrad Wolfenstein . The company acts as an external, quasi-in-house solution for industrial partners, closing operational gaps in marketing, content, and sales – without requiring additional resources on the client side.

More information here: