NEW YORK – Artificial intelligence tools based on natural language processing could ease the workload for clinical trial investigators by sifting through unstructured data in patient records and finding likely matches for clinical trials.
Trials can have dozens of criteria that patients must meet to be eligible, including biomarkers, disease stage, disease subtype, age, general health status, and treatment history. For trial sponsors and investigators, identifying patients who meet all of those criteria can be challenging, leading to slow accruement and missed opportunities for patients. Although an estimated 50 percent of cancer patients are eligible for a clinical trial, only about 7 percent end up enrolled in one, according to Aaron Brauser, founder and CEO of Realyze Intelligence, a developer of software designed to read and understand unstructured data from patients' electronic medical records.
Realyze, which was recently acquired by Carta Healthcare, is on a mission to extract data that is otherwise "trapped" in those records. Unfortunately, from a clinical trial perspective, even though healthcare "went digital … all of the detail about the patient that's not billing related gets trapped in the narrative notes," said Brauser. That means researchers must comb through those notes manually to find patients eligible for their trials.
Brauser recalled that for one breast cancer study conducted by the University of Pittsburgh Medical Center's Hillman Cancer Center, researchers spent one to two hours per patient to screen them for eligibility, work that must be done typically before a patient's first visit with their oncologist. "The reason [for that] is, once they get put on a therapy, you're not going to pull someone off that therapy," Brauser said.
The other challenge, Brauser said, is that as precision medicine and genomics have taken off, clinical trial matching requirements have become increasingly precise, too. For example, rather than simply looking for patients with a certain type of breast cancer, investigators must find patients with the correct subtype and stage and on a particular line of therapy, who have tested positive for certain genomic biomarkers, and who have an overall treatment history compatible with the trial requirements.
The Realyze platform combines a clinical mechanistic model with a large language model and natural language processing to analyze structured and unstructured clinical notes from patient electronic health records. It uses standard medical nomenclatures including the Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), Logical Observation Identifiers Names and Codes (LOINC), and RxNorm to organize the data it ingests.
Brauser said the company didn't want to rely on an LLM alone to match patients. "We put an extra clinical intelligence layer on top that understands … what are the guardrails you have to put on this to prevent hallucinations," Brauser said. Then, the developers applied that clinical intelligence capability to building an individualized model of the patient that predicts the likelihood of a match. "Our differentiator is more the infrastructure and this patient modeling versus the core LLM," he added.
Realyze has validated its platform on manually curated data from 7,000 patients across nine cancer types. In a case study, the platform's matching capabilities were tested against manual matching at one comprehensive cancer center in 2024, and the platform showed a specificity of 93 percent and a sensitivity of 100 percent and was 85 percent to 93 percent accurate on a per-patient basis. Over a six-week period, twice as many patients converted from matches to bona fide accruals, compared to manual procedures.
Norstella, a company that provides pharmaceutical software and other services, introduced another AI platform, NorstellaLinq, in October 2024, that addresses the need for expedited patient screening and enrollment within clinical trials. NorstellaLinq combines real-world patient data, including insurance claims, lab results, and electronic medical records data with Norstella's proprietary forecasting, clinical, regulatory, payor, and commercial intelligence data using generative AI, large language modeling, and machine learning. The platform also brings together resources from Norstella's other brands, including Citeline, Evaluate, MMIT, Panalgo, and the Dedham Group.
Among its many capabilities for supporting drug discovery and development, NorstellaLinq is able to ingest structured and unstructured data from electronic health records and identify patients likely to qualify for a clinical trial.
Suzanne Caruso, general manager of clinical and regulatory for Norstella and Citeline, said with the newly launched platform, they can scan anonymized patient data within an individual investigator's remit to identify patients who might qualify for a study. "[Investigators] do not have a ton of time and staff to be hunting through Epic and Cerner [electronic medical record systems] every single day for hours to [find out whether] someone has presented within the hospital that might qualify for a study," Caruso said, adding that the system has access to the investigator's direct patients as well as those of affiliated professionals within the hospital system and can scan data down to the biomarker level.
According to Caruso, the unstructured records are particularly crucial for gleaning biomarker information, which is often captured in notes dictated by the physician. To improve the system's accuracy for interpreting those notes, Norstella has built custom ontologies created in collaboration with subject matter experts to organize the data.
"One of the challenges around hallucinations in LLMs is that there's no underlying ontology," Caruso said. For example, NorstellaLinq has a custom ontology for the KRAS G12C biomarker that includes all the different ways that the term might appear in text including different uses of spaces and dashes. Using this ontology, the biomarker can then be linked to relevant clinical frameworks such as non-small cell lung cancer or breast cancer, and the user can ask questions, Caruso said, such as: "Show me every patient with this [biomarker]."
Norstella has currently built 50 of these ontologies and aims to add many more. "We're going to chug along biomarker by biomarker," Caruso said, noting that eventually the system will be able to pull out combinations of biomarkers to find even more specific patient populations.
Although unfamiliar with the offerings of Realyze and Norstella, Mitch Schnall, a professor of radiology at Penn Medicine and co-chair of ECOG-ACRIN, said attempting to work with a commercial technology partner to identify patients for clinical trials can be a challenging and complex undertaking, particularly if the trial is taking place at multiple sites across different health systems. ECOG-ACRIN has coordinated large, umbrella trials such as the National Cancer Institute's MATCH and ComboMATCH, and for trials across multiple institutions, he said it has not been feasible to use a single automated system to match patients. Instead, the group has relied on individual centers to set up their own systems.
One workaround they found during the MATCH trial was partnering with sequencing vendors to identify patients that at least meet the biomarker criteria, and then, if there was a potential match, to notify the physician who ordered the test. "That's a way we can get automated systems to help us in these trials, but it's a complex environment," Schnall said. "Even though [automation] sounds like it would be an easy solution, there are tons of technology as well as policy issues for local institutions that make it difficult."
Peter O'Dwyer, a medical oncologist at Penn Medicine and ECOG-ACRIN co-chair, noted that improvements in technology and interoperability will continue to facilitate the development of automated AI systems for clinical trial matching, but that "patients have to be at the table, too," due to concerns about the use of AI. One worry patients have, he said, is that these platforms will "dangle potential options that aren't options in front of them."
"We're still in the early days of advanced analytics being applied to medical record information," Schnall said, noting that developers must navigate complicated information structures across healthcare organizations that lack clear standards for sharing. "We're going to continue to see progress. … I'm optimistic that we're going to get there, but I think it's going to take a bit because it's a pretty complicated space to navigate."