News Release

Compressed data technique enables pangenomics at scale

Peer-Reviewed Publication

University of California - San Diego

Creative illustration of PanMAN

image: 

An illustration depicting the vast amounts of genetic data the PanMAN technique is able to handle with very small data storage requirements. Credit: Alice Grishchenko

view more 

Credit: Alice Grishchenko

Engineers at the University of California have developed a new data structure and compression technique that enables the field of pangenomics to handle unprecedented scales of genetic information. The team, led by UC San Diego electrical and computer engineering professor Yatish Turakhia, described their compressive pangenomics approach in Nature Genetics on Jan. 12, 2026.

Pangenomics, a subset of bioinformatics, is the study of many different genomes from one specific species. This can provide a more holistic picture of the natural variation and mutations that occur within a species than using one singular reference genome. This has many practical applications, such as studying how genomic mutations lead to increased transmissibility or drug resistance in pathogens.

Although advances in genome sequencing technologies have reduced the cost and increased the speed of sequencing, the data structures and analysis tools needed to study and graphically represent the relationships between millions of sequenced genomes remain a challenge. While graph-based data formats for pangenomes have become popular and widely adopted, they only represent the genetic variation in a collection of genomes, not their shared evolutionary and mutational histories. They also have large storage requirements that do not scale well.

“The data structures used for pangenomics research are critical because they determine not only how efficiently genetic data is represented, but also what the data can represent,” said Sumit Walia, an electrical engineering PhD candidate at the Jacobs School of Engineering and co-first author of the study. 

The research team, which includes engineers from the Genomics Institute at UC Santa Cruz, pioneered a new data structure and file format, called Pangenome Mutation-Annotated Network (PanMAN). PanMAN not only provides unmatched compression for pangenomes but also significantly advances the representative power by encoding additional biologically relevant information, including phylogenies, mutations, and whole-genome alignments. Their compressive pangenomics approach can perform analysis on compressed pangenomic data, allowing researchers to handle vastly larger scales of genetic data than currently possible. 

“Our compressive technique with PanMANs allows doing more with less, greatly improving the scale and scope of current pangenomic analysis”, said Turakhia, the study’s corresponding author.

PanMANs are composed of mutation-annotated trees, called PanMATs, which store a single ancestral genome sequence at the root and annotate mutations, such as substitutions, insertions, and deletions, on the different branches. Multiple PanMATs are connected in the form of a network using edges to generate a PanMAN. These edges store complex mutations, such as recombination and horizontal gene transfer data, which result in sequences involving multiple parent sequences and violate the vertical inheritance assumption of single trees. This representation is compact as it exploits the shared ancestry among genomes, representing each mutation only once on the branch where it arose instead of duplicating them across individual sequences.

In addition, PanMAN was crafted to represent a rich set of biologically meaningful information that current pangenome formats lack. Some information in PanMAN is explicitly stored, such as mutations, phylogeny, annotations, and root sequence, whereas other information can be derived, such as ancestral sequences, multiple whole-genome alignment, and genetic variation.

So far, the researchers have used PanMAN to study microbial genomes. They have found that this method is the most compressible format among variation-preserving pangenomic formats, providing up to hundreds or even thousands of times more compression. For example, the team built the largest pangenome for SARS-CoV-2, using more than 8 million separate genomes of the virus. Using their PanMAN method, this vast amount of genetic data only required 366MB of file storage space, which is roughly 3,000 times less storage than its corresponding whole-genome alignment that PanMAN encodes. Constructing an alignment for SARS-CoV-2 genomes at this scale was itself a formidable challenge, which was addressed by another computational tool developed at Turakhia’s lab, called TWILIGHT.

Now, the researchers are expanding their use of TWILIGHT and PanMANs from microbes to human genomes. Turakhia and Melissa Gymrek, a professor of computer science and engineering at UC San Diego, received a Jacobs School Early Career Faculty Development Award to advance this effort. 

“Extending compressive pangenomics to human genomes can fundamentally transform how we store, analyze, and share large-scale human genetic data,” said Turakhia. “Besides enabling studies of human genetic diversity, disease, and evolution at unprecedented scale and speed, it can depict detailed evolutionary and mutational histories which shape diverse human populations, something that current representations do not capture.”

Full study: Compressive pangenomics using mutation-annotated networks


Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.