Revolutionary ChatGPT Vs. Traditional Medicine: The Surprising Verdict

This article presents a recently published study comparing the diagnostic accuracy of doctors using AI and doctors without AI.

The study later showed a surprising result when the researchers evaluated how a large language model AI can make a diagnosis independently.

Introduction

Artificial intelligence (AI), particularly large language models (LLMs), has shown promise in diagnosing challenging clinical cases. Early research on LLMs indicates they can perform well on medical exams and reasoning tasks.

However, a key question remains: Does giving physicians access to an LLM improve their diagnostic reasoning compared with conventional resources alone?

A new JAMA study published on October 28, 2024, entitled Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial,” addresses this question.

The study, conducted across multiple academic medical institutions, evaluated physicians’ performance when aided by an LLM chatbot compared to standard references like medical guidelines or textbooks.

The surprising result: While the LLM alone outscored physicians, combining AI with physicians did not provide a clear advantage over usual resources in this trial.

A Doctor Collaborating With An Ai Assistant
A Doctor Collaborating With An Ai Assistant

Methods Overview

Study Design

  • Type of Study: Single-blind randomized clinical trial
  • Timeline: November 29 to December 29, 2023
  • Setting: Multiple academic medical institutions; participants joined remotely via video conferencing or in person at one site.
  • Participants: 50 physicians (26 attendings, 24 residents) with backgrounds in family medicine, internal medicine, or emergency medicine
  • Randomization:
    • LLM Group (Intervention): Physicians could use a large language model chatbot in addition to standard clinical references.
    • Conventional Group (Control): Physicians had access only to typical diagnostic resources (e.g., clinical guidelines, textbooks, online databases) but not the LLM.

Both groups solved up to six clinical cases within a 60-minute window. Their performance was measured using a structured rubric that assessed diagnostic reasoning skills, time spent per case, and final diagnosis accuracy.

Structured Reflection Scoring

Rather than simply grading “correct diagnosis vs. incorrect,” researchers designed a structured reflection tool. Participants had to:

  1. List up to three differential diagnoses.
  2. Identify factors supporting and opposing each diagnosis.
  3. Provide a final diagnosis and next steps (e.g., additional tests).

Points were awarded for correct or partially correct entries in each category, and raters were blinded to the group participants belonged to.


Clinical Vignettes: Layman-Friendly Summary

The clinical vignettes were adapted from a landmark study on computer-based diagnostic systems. Although the exact cases remain confidential, here’s how they were selected and scored:

  • Case Selection:
    • Physicians chose various conditions—avoiding trivial or exceedingly rare diseases.
    • Six final cases were included, each requiring thorough diagnostic reasoning.
    • The language was modernized (e.g., updated lab values), and specific telltale terms were replaced with more general descriptions to minimize “giveaway” clues.
  • Assessment Beyond Final Diagnosis:
    Each participant’s written rationale was scored for:
    • Correct diagnoses (including partial credit).
    • Accuracy of supporting/opposing factors for each diagnosis.
    • Final chosen diagnosis specificity.
    • Next diagnostic steps (e.g., recommended tests).

This holistic approach gave insights into how well the physician reasoned through the case, not just whether they arrived at the correct conclusion.


Results

Primary Findings

1. LLM-Assisted Physicians vs. Conventional Resources

    • Median Diagnostic Reasoning Scores:
      • LLM group: 76%
      • Conventional group: 74%
    • The 2-percentage-point difference was not statistically significant (P = .60).

2. Time Spent per Case

    • LLM group: 519 seconds (around 8.6 minutes)
    • Conventional group: 565 seconds (about 9.4 minutes)
    • Difference: 46 seconds (P = .20), also not significant.

3. LLM Alone Scores 90%

    • The LLM achieved a 90% diagnostic reasoning score in a secondary exploratory analysis—roughly 16 percentage points higher than the conventional resources group (P = .03).
    • Yet, this standalone high performance did not translate into significantly improved scores for physicians using the LLM.
Visual Abstract: Large Language Model Influence On Diagnostic Reasoning
Visual Abstract: Large Language Model Influence On Diagnostic Reasoning. Jama Oct 2024

Discussion

Physicians + LLM: Why No Gain?

According to the authors, merely providing physicians with an LLM interface did not translate into superior performance. Possible reasons include:

  • Workflow Integration: Using the LLM may have disrupted the natural diagnostic process, or physicians may have lacked structured training on how to incorporate AI suggestions effectively.
  • Trust and Familiarity Issues: Physicians might be skeptical of AI-generated answers and opt not to fully rely on the chatbot’s advice.
  • Contextual Limitations: The study was “acontextual,” evaluating only the reasoning step and not real-world nuances like patient interaction or dynamic follow-up questions.

LLM Alone Performing Better

Notably, the LLM itself—prompted with carefully curated instructions—outperformed both physician groups.

This mirrors findings from other AI research: well-tuned models can excel in controlled scenarios. However, implementing such technology in day-to-day practice is more complex.

No Significant Difference in Time Spent

The researchers found no clear evidence that LLM access altered how quickly physicians arrived at their conclusions. Larger studies might be needed to see whether more experienced AI users reduce their diagnostic time.

Future Implications

  1. Training in Prompt Engineering:
    Teaching clinicians how to ask AI the right questions (prompting) could enhance synergy.
  2. Integration Into Clinical Workflows:
    Embedding LLM suggestions in Electronic Health Records systems might streamline usage and reduce friction.
  3. Caution Against Autonomous Use:
    The authors stress that LLMs are not recommended for independent diagnosis. Physicians’ real-world clinical context is vital for safe and ethical care.

Can LLMs Replace Physicians One Day?

Although this study does not advocate replacing human doctors, some may argue that the LLM’s high standalone performance (90%) portends a future in which AI takes on more physician duties. Key considerations:

Argument For:

    • LLMs can diagnose accurately in controlled tests.
    • Rapidly evolving technology might integrate additional data streams (labs, imaging) for even higher accuracy.
    • Potentially addresses physician shortages and cost pressures.

Argument Against:

    • Medicine extends beyond diagnosis: empathy, ethical judgment, patient communication, and real-time decision-making are critical.
    • Current AI lacks the human touch, contextual awareness, and adaptability required for complex, dynamic care settings.
    • Liability and regulatory frameworks require human oversight.

In essence, LLMs show great promise but are more likely to augment physicians rather than replace them in the foreseeable future.


Conclusion

The JAMA trial reveals an interesting paradox: while the LLM achieved impressive standalone diagnostic accuracy (90%), giving physicians access to this same LLM did not significantly boost their performance compared with relying on standard references. These findings underscore that:

  1. AI Tools Are Powerful but must be thoughtfully integrated.
  2. Human-AI Collaboration requires training, trust, and workflow alignment.
  3. Clinical Context remains essential, as raw computational accuracy alone does not necessarily equate to better real-world decisions.

As AI advances, similar studies with larger sample sizes and real-world environments will help clarify how to harness LLMs to improve patient outcomes while preserving medicine’s invaluable human elements.


Reference 

Smith, J. R., et al. (2024). Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial. JAMA, published online October 28, 2024.

Let me know what you think!