Article Highlight | 31-Oct-2025

An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations

Higher Education Press

Uncovering complex disease patterns from large-scale, heterogeneous health data remains a significant challenge. Traditional statistical methods and conventional machine learning algorithms often struggle to integrate and analyze such diverse data effectively, limiting both the accuracy and depth of insights. Moreover, these approaches typically treat diseases as independent, discrete entities, overlooking critical interconnections such as comorbidities, pathological pathways, and phenotypic overlaps.

Recently, a research team led by Gengjie Jia at the Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, in collaboration with Yu Li at The Chinese University of Hong Kong, Xin Gao at King Abdullah University of Science and Technology, and Andrey Rzhetsky at The University of Chicago, published an article in Quantitative Biology titled “An effective encoding of human medical conditions in disease space provides a versatile framework for deciphering disease associations.” The study presents an embedding-based approach that encodes medical records into a high-dimensional disease space, providing a versatile framework for uncovering disease associations, facilitating genetic parameter estimation, and enabling data-driven disease classification. The authors also discuss the key challenges and future prospects of this emerging paradigm.

As shown in Figure 1A, the research team proposed an efficient disease embedding approach that encodes human diseases into a high-dimensional vector space, providing a versatile computational framework for systematically deciphering disease associations. This method maps sparse, large-scale, and multimodal health data—such as electronic health records—into continuous vector representations, thereby enabling the quantitative assessment of disease similarities and supporting a variety of downstream analyses, including disease association studies and genetic analyses. In addition, the study discusses key challenges related to medical text input, online model training, result validation, and the construction of multimodal foundation models.

Disease embedding workflow: Electronic health record (EHR) data—including demographic information, medical history, laboratory results, medication records, genetic sequences, and medical imaging—are collected and preprocessed before being fed into an embedding model (e.g., a neural network). The model generates high-dimensional disease vectors that can be applied to various downstream analyses, such as disease association studies, disease classification, genetic parameter estimation, and comorbidity analysis.

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.