
NEW YORK – Currently available large language models (LLMs) are not accurate enough to use safely for clinical decision support, but physician-scientists are continuing to test artificial intelligence (AI)-driven chatbots to better understand their limits when used to navigate complex precision oncology treatment options.
LLMs such as OpenAI's ChatGPT have the potential to help oncologists sort through increasingly complex targeted therapy options for cancer patients, but the tools continue to be hampered by inaccuracies and hallucinations. As a step toward improving and adapting chatbots for use by oncologists, researchers from the University of Illinois Chicago (UIC) and elsewhere developed a performance metric, the Generative artificial intelligence Performance Score (G-PS), that takes into account both the accuracy of predictions made by an LLM-based chatbot and the frequency of hallucinations.
In a study published in the Journal of Clinical Oncology in December, Ryan Nguyen, a medical oncologist at UIC, and colleagues reported results from a study using G-PS to assess the performance of two versions of ChatGPT — GPT 3.5 and GPT 4 — in recommending treatment options in lung cancer.
Nguyen, an oncologist who specializes in treating lung cancer and melanoma, said the proliferation of new targeted therapies in non-small cell lung cancer inspired his team to conduct the study. "Since I graduated fellowship two-and-a-half years ago, there have been at least 20 FDA approvals in the lung cancer space," Nguyen said.
In the precision oncology space in particular, oncologists are finding it hard to keep up with all the new drug approvals, rapid discoveries of targetable biomarkers, and constantly changing treatment and biomarker testing guidelines. In Precision Medicine Online's annual survey, oncologists in recent years have consistently cited the rapid pace of advances as one of their biggest challenges to implementing precision oncology (see here and here).
"We saw ChatGPT and generative AI as a tool that [could] help oncologists stay on top of the literature and see what the most relevant recommendations are," Nguyen said.
Nguyen and his collaborators prompted the chatbots to create a report in the style of a next-generation sequencing report, providing a list of first-line treatment options, for a hypothetical patient with stage IV NSCLC harboring a specific oncogenic driver mutation.
That prompt was repeated 10 times for each of the eight oncogenes with US Food and Drug Administration-approved targeted therapies in the first-line NSCLC setting listed in the National Comprehensive Cancer Network's guidelines from 2021, version five, which was the most recent version available to the ChatGPT 3.5 and Chat GPT 4 database. Those mutations included the EGFR exon 21 L858R mutation, EGFR exon 19 deletion, ALK rearrangement, ROS1 rearrangement, BRAF V600E mutation, NTRK1/2/3 gene fusion, METex14 skipping mutation, and RET rearrangement.
The researchers submitted each prompt in a separate chat session to ensure an independent response. The human reviewers then cross-referenced the chatbot's recommendations against NCCN guidelines to determine which recommendations were accurate and relevant for each prompt.
The study revealed that GPT 4 was significantly better than GPT 3.5, with accuracy rates as high as 90 percent for GPT 4, while GPT 3.5 had accuracy of around 60 percent. GPT 3.5 had a 53 percent hallucination rate compared with 34 percent for GPT 4. When evaluated using the G-PS, which applies a penalty for hallucinations, GPT 3.5 had a negative score of -0.15 on a scale from -1 to 1, and GPT 4 performed with a positive score of 0.34. The study authors noted that GPT 4's performance was significantly better than GPT 3.5 for six of the eight mutations.
Nguyen said the G-PS is a tool that can be used to grade the ability of any LLM to generate relevant, accurate treatment recommendations against guidelines while also factoring in a negative component for hallucinations. "We found that GPT 4 did better than GPT 3.5, but ultimately, the important thing that came out of our work is that this is one of the first research articles that tries to combine those factors of relevancy, accuracy, and hallucinations into a single score."
Although GPT 4 had a higher accuracy than GPT 3.5, Nguyen noted that, surprisingly, GPT 4 was still generating hallucinations about 30 percent of the time. As an example of the type of hallucination seen in the study, Nguyen said the chatbot offered treatments for a hypothetical NSCLC patient with a BRAF mutation that were only approved in melanoma and had not been tested or approved in lung cancer. Although BRAF mutations are found in melanoma and NSCLC, doctors can't assume the same treatment works similarly across tumor types harboring the same biomarker in the absence of clinical evidence.
"That's one situation where just because there's a lot of noise about BRAF treatments, and there's some overlap between lung and melanoma, the model wasn't able to make that differentiation, and it would hallucinate that melanoma treatments were appropriate for lung cancer patients," Nguyen said.
Sometime, the chatbots would recommend immunotherapy for NSCLC patients with an EGFR mutation. Treatment guidelines recommend patients with EGFR-mutated NSCLC receive an EGFR-targeted treatment in the first-line setting. "That's a situation where we know that patients who get immunotherapy prior to getting their EGFR-targeted therapy actually do worse and have worse side effects when they get on EGFR-targeted therapy," Nguyen said. "That has the potential to not just be wasteful from a time perspective, but also the potential to be harmful [to patients]."
Given the jump in accuracy between GPT 3.5 and GPT 4, Nguyen was optimistic that subsequent versions would show improved performance with better accuracy and lower rates of hallucinations. However, he cautioned that in precision oncology, there's a very high threshold for accuracy in treatment recommendations. "I don't quite feel that these LLMs are [ready] to be used in clinical oncology quite yet. They need more oncology-specific training to minimize the hallucination rate to essentially zero."
Nguyen suggested that the performance of LLMs for this type of task might be improved further by developing a custom model trained only on oncology-specific information. In fact, other researchers are already working on customized natural language tools for clinical decision support in cancer. For example, a group of German researchers reported results from a proof-of-concept study evaluating a small language model (SLM) for breast cancer decision support in October 2024.
Small language models are built with a curated set of resources with a smaller number of parameters compared to a large language model. They are more compact than an LLM and can typically be run on smaller platforms such as a desktop computer or even a mobile device.
The researchers set out to determine whether an SLM tailored to German breast cancer treatment guidelines could give accurate treatment recommendations in simulated patients. They found that agreement between the SLM-generated options and recommendations from a molecular tumor board was 86 percent. When they evaluated OpenAI's GPT 4 for the same task, it performed comparably with a 90 percent concordance.
Sebastian Griewing, a gynecologic oncologist at Philipps-University Marburg in Germany and first author on the study, said the advantage the SLM has over an LLM like ChatGPT is that doctors are not able to enter sensitive patient data into a publicly available model that is not compliant with health privacy requirements. He said the LLM also does not provide a rationale or sources for its recommendations, while the customized SLM can be fully explainable.
While the results of the proof-of-concept study were encouraging, Griewing acknowledged that the customized SLM had the same struggles with inaccuracies and hallucinations seen in LLMs. "Although [LLMs] might be used in a clinical setting more often than we would like to admit, we are still in a very early preclinical testing phase," Griewing said. "We need to solve these issues to get those models to a performance [level] where we can safely use them in a clinical setting."
In the future, Griewing is interested in improving the SLM with features such as knowledge graphs and additional context or by subdividing tasks involved in treatment recommendations and assigning them to separate algorithms within an AI system.
Griewing noted that a tool such as an LLM that can help providers sift through ever more complex treatment options could improve care for patients, but it need not be a formal decision support tool. Instead, it could be a smart system that searches and manages treatment guidelines documents for doctors. "We have guidelines that are 400 pages long, and especially if you're talking about molecular therapies and precision oncology, those guidelines are getting longer and longer," Griewing said.
In the pursuit of a true LLM-based clinical decision support tool, Griewing cautioned that physicians should not move too quickly to adopt publicly available chatbots to guide their decision-making. "If we're just jumping into the technology, we might be doing things that are not good for our patient," he added.