News Release 22-Dec-2025

Breakthroughs in optical image processing powered by vision-language models

Peer-Reviewed Publication

KeAi Communications Co., Ltd.

**image:**
**Evolution of optical image processing techniques.**
view more

Credit: Xuelong Li, et al

The field of optical image processing is undergoing a transformation driven by the rapid development of vision-language models (VLMs). A new review article published in iOptics details how these models are overcoming challenges such as scarce high-quality expert annotations, weak cross-modal association, and poor task generalization. This shift is moving the field from perceptual computation towards cognitive understanding, opening new pathways for intelligent analysis.

The review noted that optical images, generated from the modulation of light's amplitude, phase, wavelength, and polarization, are crucial in specialized fields including medicine, remote sensing, and industrial inspection. Unlike natural images, they contain high-dimensional physical information and fine structural details but often lack rich semantic expression. The integration of VLMs is now enabling a more unified, intelligent approach to processing these complex images.

The review outlines technological milestones enabling this progress:

Vision Transformer (ViT) established a new paradigm by using global attention mechanisms for comprehensive image feature extraction, surpassing previous convolutional methods.

CLIP demonstrated powerful cross-modal contrastive learning, achieving zero-shot recognition by aligning images and text in a shared semantic space.

BLIP and similar models bridged visual understanding and language generation, enabling high-quality image captioning and interactive question-answering.

LLaVA series effectively connected visual encoders with large language models, creating robust multimodal dialogue systems for tasks like visual question answering.

Kosmos series introduced a unified architecture where visual and language tokens are processed together within a single Transformer, enabling deeper multimodal fusion and reasoning.

Medical Imaging: Models are achieving 3D understanding of CT/MRI data, supporting diagnostic localization and automated report generation.

Remote Sensing Monitoring: Systems enable integrated analysis of optical, synthetic aperture radar (SAR), and infrared data for unified scene recognition, land cover classification, and interactive query-answering.

Industrial Inspection: Tools allow for conversational anomaly detection, few-shot defect identification, and interpretable semantic analysis with pixel-level localization.

The authors noted tha future trajectory points toward systems with enhanced autonomous decision-making, real-time response capabilities, and sophisticated multi-source fusion understanding. “Continued progress is expected from upgrades in model architectures, the systematic construction of high-quality multimodal datasets, and stronger cross-modal reasoning abilities,” says corresponding author Prof. Xuelong Li, CTO and Chief Scientist of China Telecom, and Director of Institute of Artificial Intelligence of China Telecom (TeleAI). “These developments are set to provide revolutionary technical support across scientific research and industrial applications, steering optical image processing toward a more general and intelligent future.”

###

About the Author

The review is led by Prof. Xuelong Li, CTO and Chief Scientist of China Telecom, and Director of Institute of Artificial Intelligence of China Telecom (TeleAI). Prof. Li has long been engaged in optical imaging and image processing, with notable original contributions to deep-sea cameras and intelligent processing that have been applied to deep-sea exploration missions.

He is a Fellow of SPIE, OSA, IEEE, AAAI, AAAS, ACM, et al. He is also a member of the European Academy of Sciences. He has previously served as Deputy Director of the Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, and has founded several key laboratories. Professor Li has received numerous national awards, including the National Technological Invention Award, the National Natural Science Award, the Ho Leung Ho Lee Foundation Science and Technology Innovation Award, et al.

The future trajectory points toward systems with enhanced autonomous decision-making, real-time response capabilities, and sophisticated multi-source fusion understanding. Continued progress is expected from upgrades in model architectures, the systematic construction of high-quality multimodal datasets, and stronger cross-modal reasoning abilities. These developments are set to provide revolutionary technical support across scientific research and industrial applications, steering optical image processing toward a more general and intelligent future.

The publisher KeAi was established by Elsevier and China Science Publishing & Media Ltd to unfold quality research globally. In 2013, our focus shifted to open access publishing. We now proudly publish more than 200 world-class, open access, English language journals, spanning all scientific disciplines. Many of these are titles we publish in partnership with prestigious societies and academic institutions, such as the National Natural Science Foundation of China (NSFC).

DOI

10.1016/j.iopt.2025.100003

Article Title

Optical image processing and applications empowered by vision-language models

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.