New Study Gives ChatGPT High Marks as a CDS Tool

Mass General Brigham researchers say the large language model AI chatbot is almost as good in making clinical decisions as a med school graduate

Healthcare executives looking for support in developing a ChatGPT tool for their clinicians should take a look at the latest research coming out of Boston.

Investigators from Mass General Brigham have found that a large language model (LLM) AI chatbot is 72% accurate in making clinical decisions across all medical specialties and phases of care, and the tool is 77% effective in making a final diagnosis.

Those results make a good case for using the technology as a clinical decision support tool for clinicians—but not, as some might fear—a replacement.

“Our paper comprehensively assesses decision support via ChatGPT from the very beginning of working with a patient through the entire care scenario, from differential diagnosis all the way through testing, diagnosis, and management,” Marc Succi, MD, associate chair of innovation and commercialization and strategic innovation leader at Mass General Brigham and executive director of the MESH Incubator, said in a press release announcing the study's results.

“No real benchmarks exist, but we estimate this performance to be at the level of someone who has just graduated from medical school, such as an intern or resident," he added. "This tells us that LLMs in general have the potential to be an augmenting tool for the practice of medicine and support clinical decision making with impressive accuracy.”

The study, recently published in the Journal of Medical Internet Research, is the latest step in the whirlwind romance between healthcare and AI, and LLMs like the ChatGPT tool in particular. While some fear the technology could someday supplant clinicians, those who've been in the arena for a while say it holds value in giving clinicians the information they need at their fingertips to make decisions.

And those study results subtly point out that while LLMs are good, they aren't good enough to replace anybody.

In the study, Succi noted that ChatGPT was only 60% effective in making differential diagnoses, and it was only 68% accurate in making clinical management decisions, such as deciding what medication to prescribe after making a correct diagnosis.

“ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do,” Succi, who co-authored the study, said in the press release. “That is important because it tells us where physicians are truly experts and adding the most value—in the early stages of patient care with little presenting information, when a list of possible diagnoses is needed.”

AI in clinical care needs "to include clinician voices at the front end, not as an afterthought," American Medical Association President Jesse Ehrenfeld, MD, MPH, said during the AIMed Global Summit this past June in San Diego.

The AIMed conference, which saw attendance skyrocket to some 1,500 people from last year to this year, served as a forum to discuss how the technology (called "augmented intelligence" rather than artificial intelligence") should be slowly and gradually adopted by healthcare. Ehrenfeld pointed out that the industry botched the roll-out of the electronic health record by rushing things and forcing clinicians to use the platform before they were comfortable with it.

"There is enthusiasm about this disruptive technology," he said, but "the existing regulatory framework is clearly not equipped to handle" AI governance.

That's why studies like that done by Mass General Brigham and pilot projects are important. Healthcare leaders need to see how the technology can and should be used before they use it.

Hospital officials say they'll be doing more research on AI tools like ChatGPT, including studying whether the technology can improve patient care and outcomes, particularly in areas where access to information and resources is strained or limited.

"Mass General Brigham sees great promise for LLMs to help improve care delivery and clinician experience,” Adam Landman, MD, MS, MIS, MHS, chief information officer and senior vice president of digital at Mass General Brigham and the study's co-author, said in the press release. “We are currently evaluating LLM solutions that assist with clinical documentation and draft responses to patient messages with focus on understanding their accuracy, reliability, safety, and equity. Rigorous studies like this one are needed before we integrate LLM tools into clinical care."

Eric Wicklund is the associate content manager and senior editor for Innovation at HealthLeaders.

KEY TAKEAWAYS

The Mass General Brigham study found that ChatGPT was 72% effective in making clinical decisions and 77% effective in making a final diagnosis, but it was only 60% accurate in making differential diagnoses and 68% accurate in clinical management decisions.

AI advocates say the technology can be effective as a clinical decision support tool, giving clinicians on-demand access to the information they need to treat patients.

Many worry that the technology as being used too quickly in healthcare, before enough research is done and governance is established.