Skip to main content
Premium Trial:

Request an Annual Quote

Pediatric Cancer Big Data Initiatives Could Help Set Best Practices for Sequencing, Data Sharing


NEW YORK – Though it has become apparent that combining DNA and RNA sequencing is beneficial for researchers seeking to understand more about the biology of pediatric cancers and for clinicians making therapeutic decisions for their patients, not all researchers and clinicians have access to the resources that would allow them to do that sequencing themselves. 

St. Jude Children's Research Hospital is aiming to help, not only for the sake of its own patients, but also to fulfill its mandate of generating and sharing as much information about pediatric cancer as possible. To that end, the hospital's bioinformatics group has recently launched the St. Jude Cloud, a repository of pediatric clinical genome sequencing data available in real time. The initiative is aiming to provide researchers with high-quality whole-genome, exome, and transcriptome data from consenting St. Jude patients. Data will be uploaded in a private, secure environment on a monthly basis.

But to have a cloud, you need to start with data, and that's where St. Jude's mass sequencing of its patients comes in handy. For Jinghui Zhang, chair of the department of computational biology at St. Jude, whole-genome sequencing is an integral part of pediatric cancer research, and it's why all St. Jude patients are offered WGS testing. 

In a September 2018 paper in Nature Communications, Zhang and her colleagues evaluated the efficacy of combining whole-genome, whole-exome, and transcriptome sequencing on tumors and normal tissue from 78 pediatric cancer patients. They found that the three-platform sequencing approach had a positive predictive value of 97 to 99 percent, 99 percent, and 91 percent for somatic SNVs, indels, and structural variations, respectively. They also reported 240 pathogenic variants across all cases with 98 percent sensitivity, while combined WES and RNA-seq testing achieved only 78 percent sensitivity. These results, the authors noted, emphasized the need for incorporating WGS in pediatric oncology testing.

Indeed, Zhang said, this has been a driving principle behind the way St. Jude approaches pediatric cancer research since 2010, when the hospital began the Pediatric Cancer Genome Project in collaboration with Washington University School of Medicine in St. Louis in an effort to identify the genetic changes underlying some of the deadliest childhood cancers. The teams sequenced more than 600 childhood cancer patients as part of that initiative.

"If you don't perform whole-genome sequencing, you're not getting [the whole picture], because structural variation breakpoints occur in intronic regions. We gather intronic sequences to obtain this information," Zhang said, highlighting some of the varied genomic information to be had from WGS analysis of pediatric cancer patients.

Zhang and several collaborators published in Nature in February 2018 a pan-cancer genome and transcriptome analysis of 1,699 pediatric leukemias and solid tumors. They found that only 45 percent of the driver genes in the pediatric cancers matched those found in adult pan-cancer studies, and that copy-number alterations and structural variants constituted about 62 percent of driver events in the pediatric cancers. 

"These studies help us reaffirm that the whole-genome-based approach is really essential for understanding the entire genomic profile of pediatric cancer," Zhang said. In real-time research and clinical applications, she added, WGS is useful for performing mutation signature analysis. Because the overall mutation burden in pediatric cancers is very low compared with adult cancers, exome data is insufficient for performing those types of investigations. But as WGS data offers a look at factors like intronic variations and SNV indels, mutation signature-based analysis becomes possible. 

She also noted that WGS can help sometimes help researchers discover new features in cancers that may be considered to be fairly common among children, such as acute lymphoblastic leukemia. 

"We've discovered some subtypes of leukemia that have rearrangements involving the IGH locus," Zhang said. "Because these rearrangements actually involve enhancer hijacking, without the DNA data you actually do not know if that elevated expression was caused by an activation upstream or a downstream event. So, it's only after you analyze the case that you know this is a standard case."

However, she noted, some researchers may choose to perform enriched exome- or capture-based assays first in order to see if what they find is sufficient for their purposes, before proceeding on to WGS, in a tiered testing approach. But this approach requires a longer timeline, which cancer patients may not have. It may also require more sample material, which may not be an issue with blood cancers, but is a major hurdle in solid tumors. "If you have utilized the sample for this [exome] assay, do you still have enough left for other tests?” Zhang posited. "Each institution really has to adjust for their own needs."

At St. Jude, however, it's not just about helping the hospital's clinicians make treatment decisions for patients. Though directly serving existing clinical needs is critical, Zhang and her colleagues must also keep in mind the hospital's mission for data sharing and advancing research into the biology of pediatric cancers and developing new treatments.

Sequencing thousands of whole genomes, exomes, and transcriptomes is only the first step. Researchers and clinicians must have a way to easily access the data and tools to analyze it. That's where the St. Jude Cloud comes in.

St. Jude researchers unveiled the resource at the annual meeting of the American Society of Clinical Oncology in Chicago earlier this month, noting that prospective data from 685 patients who have undergone clinical genomic sequencing is currently available in the database, and that data from an additional 273 people will be made available in July. The team anticipates adding data from 500 patients to the cloud each year. Retrospective whole-genome sequence data from 10,000 study participants is also available in the repository. 

The original pediatric genome sequencing project between WashU and St. Jude's collected massive amounts of data, but downloading 600 whole genomes from a database can take anywhere from two to six months, depending on your internet connection speeds, said Alexander Gout, a bioinformatician at St. Jude who worked on building the St. Jude Cloud.

"The question became, what's the best way to handle, store, disseminate, and share this information with the international pediatric research community," Gout said. "Jinghui Zhang came up with the idea to basically create a cloud platform to host all of this genomic data, and not only just create this cloud platform, but also provide analytical capabilities within this platform."

St. Jude partnered with Microsoft, which has provided storage and compute capability for the cloud through Microsoft Azure. The hospital also partnered with DNAnexus to build bioinformatic analysis pipelines and visualization tools directly into the cloud platform.

The resulting infrastructure is one that allows users to not only access and download the raw pediatric genome, exome, and transcriptome sequence data, but also work within the ecosystem to analyze it, Gout explained. "We've created a number of end-to-end bioinformatics workflows within the St. Jude Cloud whereby users can select particular samples or patient cohorts from the large data repository and then process it through the bioinformatics workflow,” he said. "And this is what we're working towards in the sense that we're wanting to support not only bioinformatics power users who just really want to download data and analyze it themselves, but also postdocs and clinicians."

In addition to the bioinformatics pipelines, the cloud has a number of visualization tools built into it to help researchers and clinicians see the data in any number of ways. One such tool is called GenomePaint. It allows users to visualize somatic coding and noncoding alterations from about 3,800 pediatric tumors alongside multi-omics information, revealing oncogene activation by noncoding alterations, enhancer hijacking events, aberrant splicing, mutual-exclusivity events, mutation signatures, and other oncogenic drivers. The sister tool, called ProteinPaint, allows users to look at the spectrum of mutations within any user-defined protein or gene within the genome. 

"You can essentially [use these tools to] traverse the genome and look at [multiple] cohorts of different types of cancers and all the genomic features they involve," Gout said. "You can look at that from a sample view, where you look at individual patients as individual rows, or you can have a condensed cohort view and collapse all of the patient sample information into different kind of tracks. You can look at lollipop plots for SNV calls or lollipop plots for structural variant calls. You can have a beautiful overview of what's going on in any different region within the genome, then drill down specifically in particular places, genes, or intergenic regions."

Importantly, Gout noted, the data in the St. Jude Cloud isn't being withheld until St. Jude's researchers are done using it. It's being uploaded for the wider research community almost as soon as it can be generated. Additionally, he said that researchers who use St. Jude's data are free to publish their own discoveries based on it. Though there may be a few datasets that are embargoed for a short period of time, the large majority of the data is unrestricted for research purposes.

"Pediatric cancer is so rare that we really needed to create a critical mass of pediatric cancer genomics data so that it can be collated and aggregated in a way that allows discoveries to take place," Gout said. "So, we've decided that we need to release this clinical genomics data in real time, as soon as we get it."

Although there aren't currently any tools built into the St. Jude Cloud for the purposes of helping clinicians make therapeutic decisions, Gout plans to develop a feature that would function essentially like an online tumor board, empowering clinicians who come to the site to go from discovery to decision-making. However, he added, this hasn't been approved or started yet.

But St. Jude isn't the only pediatric cancer research center that has begun to generate large amounts of sequencing data, leading some in the research community to ask whether standards need to be implemented for how the sequencing is done, and then how the data is harmonized and uploaded to the internet. 

Gout believes that St. Jude will eventually play a role in setting data-sharing standards for pediatric cancer research in partnership with other initiatives such as the Gabriella Miller Kids First Pediatric Research Program and the Treehouse Childhood Cancer Initiative. Having one centralized data repository for the entire research community's pediatric sequencing data may be unrealistic, he said, adding that the focus should be on what he called "federated analysis" — how to connect these different databases and resources together so that the information contained in them can be shared as needed in an efficient manner. 

"How can they talk to each other? How can we develop methods, protocols, bioinformatic analysis routines that can be set in motion and access data released within each of these different location sites?" Gout asked. "Federated analysis is really what we're looking at right now."