NEW YORK – Investigators from the University of Cambridge, the UK's Genes & Health study, and the University of Tartu in Estonia have been working on a project to benchmark polygenic risk scores in the Polygenic Score Catalog in order to assess their performance in other population cohorts. The researchers also recently released a tool to factor genetic ancestry into polygenic score calculations.
Samuel Lambert, an assistant professor of health data science at the University of Cambridge, has been involved with the PGS Catalog since its inception in 2019. The open database currently includes 3,245 published scores for 570 traits drawn from 389 publications. The PGS Catalog is annotated with metadata, including scoring files, notes on how the scores were developed and applied, and assessments of their predictive performance. The researchers published a paper describing the catalog in Nature Genetics in March 2021.
According to Lambert, the PGS Catalog was established in part because of the growing number of published scores, plus a desire to use them clinically as opposed to purely for research. "The problem was that everyone uses different polygenic scores to make claims about how well they work or don't work," Lambert said.
He said the issue is confounded by multiple factors, as scores developed with the same data can perform differently. The catalog therefore organizes information around what scores are in use and makes them available for researchers to use and build out a knowledge base of relevant data.
The University of Cambridge hosts the PGS Catalog and since the end of 2020 has been working with Genes & Health and the University of Tartu, which hosts the Estonian Biobank, to benchmark the polygenic scores in the catalog in other population cohorts. Lambert said this is necessary as scores are sometimes developed in European cohorts and validated in a single separate cohort, which raises questions about their reliability in other populations.
Researchers continue to discuss issues around the transferability of polygenic scores across populations of different genetic ancestries. A study published in Cell Genomics in April, for example, reported that scores that worked robustly in European populations were moderately transferable to South Asians and performed poorly in individuals with African ancestry.
"If you have a score that has only been evaluated in European individuals, maybe you don't know how well it works in other ancestry groups or in European ancestry individuals in another cohort," said Lambert. "The goal of benchmarking is to take scores in the catalog and fill in the gaps in the data."
The benchmarking project is ongoing, and the researchers expect to publish a preprint about the work next year. While Lambert said that "the best scores may often be the best scores across all ancestries," he declined to elaborate on any other conclusions while the work continues. Instead, he said that questions remain about polygenic scores and their applicability in different ancestries. "Will there be one score for a trait for all ancestries, or will one ancestry have different or separate, ancestry-optimized polygenic scores, or might there be more complex methods needed for both? I think this is an open question that has not been solved," he said.
The collaboration with Genes & Health is particularly helpful for the effort, Lambert noted, as that study has collected data on a cohort of Bangladeshi and Pakistani origin. Hilary Martin, a group leader at the Wellcome Trust Sanger Institute who is involved in Genes & Health, said the effort involves participants from East London and Bradford, a city in northern England.
Genes & Health is focusing on South Asian populations, as they have higher rates of diabetes compared to the rest of the UK and suffer from other health issues, including cardiovascular disease and mental health.
With about £40 million ($48 million) in funding from various sources, Genes & Health has recruited 50,000 volunteers since it commenced in 2015, with the aim to include 100,000 people.
According to Martin, Genes & Health has genotyped its cohorts with microarrays and aims to sequence their exomes, as well.
The Estonian Biobank is providing access to another large, well-characterized European cohort of more than 200,000 participants, all genotyped with a variety of microarrays, mostly the Illumina Global Screening Array.
Reedik Mägi, a research associate at the Estonian Genome Center, described the work with Lambert and colleagues as a "good and fruitful collaboration." He said that researchers from the center have validated several polygenic risk score models of different diseases in the catalog using Estonian Biobank data. In the future, the Estonian researchers aim to use some of the disease risk prediction models to give feedback to Estonian Biobank participants about their genetic risk, Mägi said.
Meanwhile, the Estonian Genome Center has been providing genetic counseling to participants for years. In October, they discussed some of their experiences in the European Journal of Human Genetics.
According to Mägi, new scores investigated as part of the benchmarking project with the PGS Catalog still require validation before being used to report risk to Estonian Biobank participants. "These scores are being validated in multiple biobanks to see how well these predict diseases in different populations," he said.
Lambert discussed the PGS Catalog at Genomics England's Research for Genomic Equity Conference, held in October. The meeting also marked the launch of Link23, a new effort backed by Genomics England and Data Science for Health Equity that aims to both foster a community of researchers as well as provide researchers access to analytical tools to improve equity in genomics.
During his talk, Lambert mentioned that the PGS Catalog recently developed a calculator to streamline the calculation of polygenic scores using scoring files found in the catalog or in custom files. It also automates polygenic score downloads from the catalog, variant matching between scoring files and target genotyping sample sets, and the parallel calculation of multiple polygenic scores.
New features in development relate to genetic ancestry and score normalization, Lambert said. The genetic ancestry component will allow users to calculate the similarity of target samples to populations in a reference dataset using principal components analysis. The normalization tool will use reference population data and PCA projections to report individual-level polygenic score predictions that account for genetic ancestry.
"When you calculate a polygenic score, you just get a number," said Lambert. "We want to know in a population of people similar to you, what their values would be, and then we would know if you are at greater risk" for a particular trait, he said. By using reference population data relevant for a particular individual, the calculator will help to remove variation determined by ancestry.
Lambert said that the new features are in development and that investigators are testing them in other biobanks. In general, he remarked, computational methods continue to improve for calculating polygenic scores and scores continue to improve, powered by more genome-wide association studies and the inclusion of more diverse population cohorts.
"Methods are getting better and more robust, and computational methods are becoming more sophisticated," said Lambert. "The methods you use do make a difference."