
NEW YORK – Although the life sciences industry is increasingly using artificial intelligence to develop drugs and diagnostic tests, many doctors and researchers remain skeptical that machine-generated predictions are reliable enough to direct patient care.
The use of machine learning in medical applications was once a novelty, but AI has now become commonplace in healthcare product development. It would be unusual at this point for a pharmaceutical or diagnostics firm not to employ AI somewhere along the pathway from discovery to market for their products. To many industry players, AI offers clear advantages with little downside in data-intensive applications such as target discovery or designing novel molecular entities.
In the clinical setting, particularly in oncology, AI tools have potential to personalize diagnosis and treatment for patients by drawing on a variety of data types from genomics to electronic health records to environmental factors. But here, the shortcomings of AI, such as inadequate training datasets, lack of generalizability, "black box" functionality, and hallucinations, have prompted caution among healthcare providers. Doctors and physician-scientists, grappling with the risk that the present limitations of AI may lead to medical errors and harm patients, aren't ready to fully rely on these tools.
AI algorithms fall into three broad categories: machine learning, language processing, and computer vision. Computer vision has been one of the fastest-moving areas of AI development. In recent years, a multitude of new programs and commercial partnerships have cropped up around algorithms that read images, such as radiology scans or histology slides, and detect markers indicative of disease presence or progression with equal or greater precision than human experts.
For example, digital pathology firm PathAI boasts a long list of biotech and pharma partnerships, including one with Roche Tissue Diagnostics inked in February to develop AI-driven companion diagnostics for identifying subsets of patients most likely to benefit from various treatments. Other companies developing AI-based digital pathology tools for precision medicine include firms such as Paige and Tempus, as well as startups like Picture Health and Valar Labs.
In 2021, Paige became the first company to have an AI-based cancer diagnostic tool approved by the US Food and Drug Administration. The tool, Paige Prostate, analyzes biopsied prostate tissue and identifies areas likely to contain cancer cells, which a human pathologist evaluates and confirms.
The COVID-19 pandemic expedited the adoption of digital pathology in healthcare. Many pathologists were suddenly working from home, which prompted widespread digitization of slides they could review remotely. These digitized slides were then available to train AI algorithms. Having seen the AI boom in digital pathology, researchers are now hoping to use machine learning and large language models that incorporate other types of patient information, particularly genomic and electronic health record data, to tease out prognostic and predictive patterns that can be used to guide treatments.
Not changing the practice of medicine … yet
One of the algorithms furthest along in development for personalizing treatment comes from GE Healthcare, which is advancing a machine learning model that can identify cancer patients most likely to respond to immunotherapy and those at risk of toxicities such as pneumonitis, hepatitis, and colitis. The algorithm, invented at Vanderbilt University Medical Center, has been trained on data curated from the medical records of about 3,000 cancer patients who received anti-PD-1, anti-PD-L1, or anti-CTLA-4 therapies at the medical center.
The model determines the probability that a patient might experience those toxicities and predicts their overall survival after factoring in results from medical tests they received and other information gathered during routine care. The algorithm had an overall accuracy between 75 percent and 80 percent in a validation analysis, which Jan Wolber, global digital product leader at GE Healthcare, said is "actually a really good result," given that the model is based exclusively on routinely collected patient information. The model performed similarly when it was applied to a dataset of patients at the University Medicine Essen in Germany. GE Healthcare is hoping to validate the model in additional datasets and commercialize it as a tool to help doctors guide treatment decisions for patients.
Travis Osterman, a medical oncologist at Vanderbilt and a lead researcher on the study, said his team is receiving inquiries from doctors asking how AI technology may affect their clinical practice. While the algorithm he and his colleagues developed has shown promise within clinical trials and may eventually be integrated into patient care, Osterman is cautious in his outlook on the field of AI-driven precision medicine more broadly. "When I look at projects [focused on] predicting treatment for oncology patients, for example, there are a lot of reasons why, currently, artificial intelligence and this space aren't quite ready, and it's not necessarily the best investment," Osterman said.
At GE Healthcare, Wolber sees growing interest in AI tools among doctors but also recognizes the need to build trust. "We have been working at Vanderbilt with clinicians who are very open to AI becoming a companion that provides clinical decision support and informs the oncologist," Wolber said. "We have to make sure that we build trust and make these models as transparent and as explainable as possible, so the clinician feels secure and empowered when they are using them."
Hesitancy about AI in the healthcare marketplace is largely due to a combination of unfamiliarity with the technology, well-documented weaknesses, and failures in the implementation of AI in clinical decision support.
Ronald Razmi, a cardiologist, cofounder of Zoi Capital, and author of AI Doctor: The Rise of Artificial Intelligence in Healthcare, said that in some use cases, such as subtyping certain types of tumors, AI technology is showing promise and potentially near-term utility, but that utility remains to be demonstrated in large, prospective clinical trials. "Anything we talk about right now in terms of the initial promise of AI in some of these areas is speculative," Razmi said. "[AI technologies] haven't changed the practice of medicine yet."
It'll take a couple of decades, he suspects, before some of the AI applications being developed today see their way into real-world use. In the meantime, Razmi worries that without a "sober-minded" discussion of AI's limitations, the hype and large investments surrounding these technologies could lead to the introduction of products to the market that are not ready for prime time.
One example of a "spectacular failure" in AI for clinical decision support, according to Razmi, is IBM Watson for Oncology.
IBM partnered with Memorial Sloan Kettering Cancer Center in 2012 to develop a tool to help doctors individualize cancer diagnoses and treatment recommendations, dubbed IBM Watson for Oncology. The tool was designed to make treatment recommendations by harnessing the computational power and natural language processing ability of the IBM Watson platform to analyze the clinical experience, molecular and genomic data, and cancer case histories at MSKCC, plus all of the latest relevant research.
In a 2021 meta-analysis of nine studies comprising data from 2,463 patients, there was an 80 percent concordance rate between the treatment recommendations made by IBM Watson for Oncology and those by a multidisciplinary team of physicians. But by then, IBM had already discontinued this project. An investigation into internal company documents by Stat News several years earlier had revealed that the supercomputer often made erroneous or even unsafe treatment recommendations. In one example, the algorithm suggested a combination of chemotherapy and Genentech's VEGF inhibitor Avastin (bevacizumab) for a 65-year-old man with lung cancer who was experiencing severe bleeding, despite Avastin's label containing a black box warning against administering it to patients with serious hemorrhaging.
Transparency and explainability
Against that backdrop, oncologists are understandably skeptical of new AI tools being developed to assist with cancer diagnosis and treatment. When GE Healthcare recently surveyed clinicians on their attitudes about the future of healthcare, 61 percent said that AI can support clinical decision-making, but only 42 percent felt that AI data can be trusted. And when considering the views of just US clinicians, only 26 percent said they could trust AI data.
In another survey, focusing on US oncologists' attitudes published in March in JAMA Network Open, 89 percent said they believed that AI would improve cancer treatment decisions, but 84.8 percent also said that AI models need to be explainable if they're used to make clinical decisions. As opposed to an AI tool that operates as a "black box," which provides users little or no insight into the inputs and algorithm used to arrive at the results, oncologists largely prefer a tool operating an explainable model that provides a rationale for its output.
Although explainability may seem to be a straightforward feature to incorporate in an AI, the design of many algorithms can make it difficult to identify a reason for a particular output. Machine learning algorithms that work with labeled data are more readily explainable than deep learning algorithms that use unlabeled data, for example. This leads to a trade-off between performance and interoperability. AI excels at processing large and complex datasets, and the types of algorithms that handle these also tend to be deep learning models that rely on unlabeled data. To meet the demand for explainability, researchers are attempting to develop models that balance these two opposing goals.
Tim Rattay, a breast surgeon at the University of Leicester, said his team prioritized explainability in developing an AI tool to predict which breast cancer patients may be at risk of side effects after surgery and radiotherapy. Their model classifies patients into high or low risk of arm lymphedema and also produces a list of features that contributed to the risk assessment for each patient.
Rattay's group is now planning a prospective clinical trial to gauge if having this additional information from the explainable AI model allows doctors to provide supportive or preventive care that improves patients' outcomes. Patients in the study will first be given a high- or low-risk rating by the AI model. Then, the investigators will compare high-risk patients' outcomes based on whether they received information about their risk factors from the AI model plus a therapeutic arm sleeve against patients who do not receive the information or arm sleeve.
The researchers developed their AI model to be explainable to foster confidence in its predictions, Rattay said, and "to address … the mistrust of black box AI among physicians and … the general public."
Bias and other data challenges
Doctors have additional concerns about AI's use in medicine beyond explainability. Although the JAMA Network Open survey did not explore all the potential reasons for oncologists' hesitancy toward AI, "the majority were concerned with biases of AI outputs and their ability to recognize them," said Andrew Hantel, lead author and an oncologist at the Dana-Farber Institute. He further noted that less than 30 percent of surveyed oncologists felt confident in their ability to protect patients from biased AI predictions or recommendations.
Bias is present in an algorithm when there is a difference in the accuracy of predictions between subgroups. As an example, in a 2021 Nature Medicine study, researchers reported that classifiers developed using computer vision algorithms consistently underdiagnosed pulmonary abnormalities from chest X-rays in patients from underrepresented groups including female, Black, and Hispanic patients, and those with Medicaid insurance.
Bias in an AI algorithm can occur for many reasons, including unconscious bias on the part of developers, disparate representations of groups in datasets, and biases in selected training features. "Unfortunately, the bias we see in training data is only the tip of the iceberg," Razmi wrote in his book AI Doctor. "All of the data that we use are biased to some degree because they represent specific geographies and demographics, and certain diseases may be over- or underrepresented."
The problem of bias is part of a greater issue often cited by AI tool developers, which has to do with the quantity, quality, and timeliness of the data used to train the models. Researchers have been hampered in obtaining sufficiently large and inclusive datasets to train precision medicine algorithms for many reasons, including a lack of standardized data formats, difficulty obtaining data from multiple institutions, and the complications involved in anonymizing and de-identifying patient data. To protect patient privacy, identifying information must be carefully removed from records, either manually or using automated tools. Either process is prone to error and could expose protected health information. Additionally, researchers must confirm that the patients have given informed consent allowing their data to be used for the development of such algorithms.
Vanderbilt's Osterman noted that even when the data used to build AI models in healthcare are robust, they are typically already out of date by a year or more by the time the model is deployed. "If your training data aren't [current] up to the month and your system is recommending treatments that are now antiquated because of the results of a recent clinical trial, that's a really hard setup that we haven't been able to crack, yet," he said.
For example, if a model is seeing for the first time a treatment regimen for a particular indication, its recommendations might not be accurate because it has not been trained on that regimen. And attempting to design an algorithm that can ingest that data in real time reintroduces those challenges of data formatting and harmonization at each iteration, since real-time patient data will be fragmented and require formatting.
Wolber said data curation when building GE Healthcare's AI model for predicting immunotherapy toxicities and outcomes was challenging because most of the data came from electronic health records, which often contain missing data or have data recorded in different units or formats, depending on the institution. "Before you can actually start building models, there is a lot of hard work making sure that the dataset is of as good quality as it can be," Wolber said.
Wolber's team has also put a lot of effort into curating data on toxicities, which are not well recorded in electronic health records as so-called structured data that can be easily analyzed by AI tools but are more likely to appear in unstructured doctors' notes or in a PDF uploaded into patients' records. "There was a lot of detective work to draw those toxicity pieces out of the unstructured data," Wolber said.
Lastly, to model efficacy, Wolber's team also had to ascertain whether patients were still alive or had died after beginning treatment. "We had to then go to death records and reconcile those with patients that we had in the Vanderbilt cohort," Wolber said.
The number of data points that had to be correctly entered before modeling could be numbered in the tens of thousands. Although the Vanderbilt team initially used a deep learning approach, allowing the algorithm to select which features were most important, Wolber said those models became "unwieldy," and the researchers turned to a simpler machine learning approach using a limited number of features that they knew were clinically meaningful.
When starting to develop their model for predicting side effects of radiotherapy, Rattay's team at the University of Leicester also had to confront huge quantities of messy data. For example, one of the datasets from France had a lot of descriptive textual data, Rattay recalled. Because the algorithm could not read the descriptive text, his team worked with collaborators in the Netherlands to convert the text-based content into numeric values.
The complexity of the data harmonization also dictated the approach Rattay's team used in developing the AI model. It was simpler, Rattay said, to initially use the most common types of clinical data in building the model because those were also the easiest to harmonize.
Assume non-generalizability until proven otherwise
Generalizability is another concern that has emerged with predictions made by AI in healthcare. Predictive models typically perform well within the training and validation datasets they are developed on. However, in most cases, there is little data available on how well those models generalize to truly independent patient samples.
In a study published in Science in January, researchers evaluated the performance of a clinical model designed to predict whether a schizophrenia patient's clinical symptoms would significantly improve over four weeks of antipsychotic treatment. The researchers developed and tested the performance of a machine learning model using data from one large, multisite trial, and then compared its performance using a dataset from other independent trials.
"We showed that machine learning models routinely achieve perfect performance in one dataset, even when that dataset is a large, international multisite clinical trial," said Adam Chekroud, an assistant professor of psychiatry at Yale University. "However, when that exact model was tested in truly independent clinical trials, performance immediately fell to chance levels." The predictive performance of the model remained poor in independent datasets even when the researchers built what should have been a more robust model by aggregating data across a group of similar multisite trials.
This generalizability challenge with predictive models in medicine is "highly concerning for clinicians, researchers, and patients alike," Chekroud said. "This study shifts our prior expectation to be non-generalizability rather than assuming generalizability." Chekroud added that, going forward, models should be proven to generalize in large, independent samples before assuming that they do. If a model can be shown to generalize on independent samples in real-world context, Chekroud said that would be a "great signal" that they can be used in clinical practice.
Can generative AI make a difference?
Generative AI, as the name suggests, is a newer content-generating AI application, which experts believe could resolve the tension between explainability and interpretability. Using AI systems that work with language, called large language models (LLMs), generative AI tools have the ability to ingest large quantities of unlabeled, unstructured data and produce explainable output. However, the trade-off with generative AI is accuracy. Tools like Open AI's ChatGPT, which use LLMs, can be unreliable in healthcare settings and are vulnerable to "hallucinations," or information that appears to be made up out of whole cloth.
Ryan Nguyen, a medical oncologist at the University of Illinois Chicago, is an early adopter of ChatGPT for personal use and has been experimenting with it to see how accurate it might be for recommending precision treatment options.
"One challenge in current oncology practice is that there's a rush of new technology coming out in terms of biomarkers," Nguyen said. Although he finished fellowship training in 2022, he said he had no dedicated training on how to interpret biomarkers that can help identify the treatments his patients should receive. Nguyen said his experience seems to be typical among his colleagues. He is seeing oncologists turn to online resources for help, including to tools like generative AI, because the science is changing so rapidly that it's difficult to stay on top of the latest advances.
But even for an AI-curious oncologist like Nguyen, patient privacy is a big reason to proceed with caution. Under the Health Insurance Portability and Accountability Act in the US, healthcare organizations must protect health data from patients within their information technology infrastructure — and ChatGPT is not a HIPAA-compliant environment. As such, Nguyen has been testing the tool's treatment predictive capabilities with hypothetical patient cases.
Nguyen presented some of those hypothetical scenarios during a virtual molecular tumor board webinar hosted by Precision Oncology News in November 2023. During the session, Nguyen asked GPT-4, a recent update of ChatGPT, for treatment recommendations for a hypothetical prostate cancer patient with rising prostate-specific antigen levels on hormone therapy and a BRCA mutation. He also asked it about potential medical interactions for another made-up patient starting an NTRK inhibitor.
Nguyen said the chatbot seemed to get the gist of what he was asking, but that the accuracy of its responses fell short of what would be needed for it to be used as a clinical decision support tool. "Any time you're guiding patients on what the best treatments are for them, there's zero room for hallucinations or mistakes," Nguyen said. "And that's one challenge we've seen with ChatGPT."
Because of the tendency toward mistakes and hallucinations, in Nguyen's view, any use of generative AI technology in patient care would need "strong guardrails" to ensure its results are accurate. Nguyen is now working with other colleagues to develop a system, dubbed the GPT Performance scale (G-PS), that scores the relevancy and accuracy of ChatGPT's outputs in oncology and applies a penalty for hallucinations. Nguyen's group is in the process of submitting a manuscript describing that work for publication.
Although the use of AI in precision medicine and healthcare presents some challenges, AI Doctor author Razmi believes it's worth the effort to try to refine its use for this purpose. Given the billions of data points generated by interactions between the genome, proteome, microbiome, immune system, and more, Razmi said, "AI is the only technology we have to start going after the secrets of the human body. The trajectory long term is phenomenal."
To address the concerns about AI that have surfaced in recent years, the Coalition for Health AI, a group of hospitals, health systems, and industry partners developing guidelines for safe implementation of AI in healthcare, put out a report last year that identified issues that need addressing and proposed solutions to enable trustworthy AI. In the report, the coalition discusses the need to ensure the usefulness, safety, accountability and transparency, and explainability of AI, as well as address the mitigation of harmful bias, data security, and patient privacy protection. Implementing those values will require collaboration between stakeholders in the healthcare system.
Although advances like explainability, more robust generalizability, and improved accuracy in AI models will go a long way toward reassuring clinicians and building trust, the ultimate test will be the performance of these models in large, randomized, prospective clinical trials.
When it comes to AI, "[doctors] have concerns about whether they're going to lose their autonomy. They have concerns about whether it's going to affect their jobs. They have concerns about whether it's going to lower their incomes because some of what they're doing is going to be done by AI," Razmi said. "But if and when those concerns are addressed, they're going to be concerned about whether the output can be trusted. And for that, you need to do clinical trials, prospective, large-scale clinical trials, which haven't been done yet for most technologies in health AI."