Can AI Chatbots Reason Like Doctors?

Overview

A large language model from OpenAI outperformed physicians on several clinical reasoning tasks using real emergency room records, according to a study published 30 April in Science. The findings arrive amid mixed evidence about chatbot medical advice, with some studies showing strong diagnostic performance and others documenting fabricated citations and flawed advice. The study authors expressed optimism but stressed that the results should not be read as AI replacing doctors.

Key Takeaways

OpenAI's o1-preview model outperformed physicians on several clinical reasoning tasks built from real emergency room records.
The study compared two physicians against two large language models across multiple stages of emergency-room care.
Authors recommended further testing of LLMs in real cases, with physicians seeking second opinions at specific checkpoints.
Researchers warned that chatbots are equally convincing whether they are right or wrong, making errors hard to catch.
Products aimed at medical professionals, such as ChatGPT for Clinicians and ChatGPT for Healthcare, are already entering the market.

Stats & Key Facts

#Study published 30 April in Science
#Nearly half of responses from five popular chatbots to open-ended health questions were flawed in one study
#Two physicians compared against two large language models in the diagnostic tasks

What the study found

The research tested a general-purpose OpenAI model on diagnostic work drawn from actual patient records.

›The model used was o1-preview, since supplanted by newer models.
›Tasks drew on real emergency room records rather than constructed test cases.
›Performance compared physicians and models at multiple stages of emergency-room care.

Coauthor Arjun Manrai of Harvard Medical School said the findings do not mean AI replaces doctors. Coauthor Adam Rodman, a medical educator at Beth Israel Deaconess Medical Center in Boston, said he gets queasy about how some of the results might be used.

Why clinical reasoning is a natural target

Aiding clinical reasoning was one of the earliest stated goals for computing in medicine.

›Clinical reasoning covers the decision-making steps to reach a diagnosis and form a treatment plan.
›Earlier clinical decision support systems were purpose-built with hand-written rules about symptoms, test thresholds, and medication interactions.
›Large language models offer a more general approach as AI capabilities develop.

Concerns about chatbot reliability

Other research raises doubts about the trustworthiness of chatbot medical advice.

›In one study, nearly half of responses that five popular chatbots gave to open-ended health questions were flawed.
›Chatbots fabricated information and citations and answered confidently regardless of accuracy.
›Arya Rao, who studies AI in medical practice, noted a risk that is not being quantified or mitigated.

Doctor-facing tools versus consumer questions

Using an LLM as a clinical decision-support tool for doctors is a different task than answering everyday health questions.

›Physicians have a better sense of what information helps an LLM reach an accurate diagnosis.
›Physicians carry background knowledge to identify obvious mistakes.
›Detecting hallucinations could still be challenging because models are equally convincing when right or wrong.

Adam Rodman said the field needs to find workflows with a low rate of errors.

Calls for real-world trials

Researchers say the moment is right to test these systems in practice.

›Authors recommended further testing in real-life cases with physicians seeking second opinions at checkpoints.
›Mickael Tordjman of the Icahn School of Medicine called for more proof in prospective clinical trials.
›Newer or medically trained LLMs might perform even better, according to Tordjman.

Frequently Asked Questions

Which model outperformed physicians in the study?

OpenAI's o1-preview, a general-purpose model that has since been replaced by newer models, outperformed physicians on several clinical reasoning tasks.

Where was the study published?

It was published 30 April in the journal Science.

Do the authors believe AI will replace doctors?

No. Coauthor Arjun Manrai said the findings do not mean AI replaces doctors, and coauthor Adam Rodman raised concerns about how results might be misused.

What are the main risks the article identifies?

Chatbots can fabricate information and citations, give flawed advice, and appear equally convincing whether right or wrong, which makes hallucinations hard to detect.

Are AI products for clinicians already available?

Yes. OpenAI introduced ChatGPT for Clinicians and ChatGPT for Healthcare, and the article notes products aimed at medical professionals are already entering the market.

The study suggests promise for AI in clinical reasoning while underlining the need for real-world trials and careful human oversight.