Article Highlight | 5-Jan-2026

MCANet: Medical image segmentation with multi-scale cross-axis attention

Beijing Zhongke Journal Publising Co. Ltd.

Detailed visualization of the proposed method compared with the recently popular medical segmentation methods (e.g., MISSFormer and Swin-UNet) on the synapse dataset. — **image:**
**The segmentation details produced by different methods are shown in focus in the blue rectangular box areas. The proposed method performs better than other methods.**
view more

Credit: Beijing Zhongke Journal Publising Co. Ltd.

Medical image segmentation is a critical and challenging research problem in medical image processing and computer vision. This task focuses on delineating clinically significant regions within medical images to establish reliable foundations for clinical diagnosis, pathological studies, and ultimately supporting clinicians in making precise assessments. It has a wide range of applications in clinical diagnosis, computer-aided surgery, pathology analysis, and other medical fields.

Over the past decade, rapid advancements in deep learning have established neural network-based approaches as the dominant paradigm in medical image segmentation. These works are primarily based on convolutional neural networks (CNNs). In particular, U-Net and its variants have achieved remarkable success in recent years. Their success can be attributed to the encoder-decoder architecture, in which skip connections can efficiently combine the features extracted from the encoder at different scales with semantic features extracted from the decoder. However, the limitation of a small field of perception prevents convolutions from establishing long-distance dependencies among pixels, which has been shown to be essential, especially for segmentation-like tasks.

As an alternative to CNNs in segmentation, vision transformers have recently attracted significant attention. Since transformers can build long-range dependencies, many works have also introduced it into the medical image segmentation task. For example, TransUNet absorbs the advantages of ViT and U-Net to design a new network. Later works, such as PMTrans, TransBTS, and UNet Transformers (UNETR), were also proposed for medical image segmentation with different types of transformer. These methods have greatly improved the performance of previous CNNs and have obtained state-of-the-art results on many benchmarks.

Although transformer-based approaches achieve good results, they still have flaws. First, transformers treat all elements equally, which may lead to overlooking locally important features. Yet, local information in medical images is crucial for accurate segmentation, as organs and lesions are often concentrated in specific regions. Additionally, when combined with small datasets, learning positional biases becomes inefficient, making long-distance interactions challenging, and hindering the capture of spatial structures. Although alternative methods, such as axial attention, consume fewer resources, they still overlook local information. Consequently, these methods fail to address the challenge of learning positional biases when trained on small datasets.

In this paper published in Machine Intelligence Research, to address the aforementioned issues, researchers propose a multi-scale cross-axis attention (MCA) decoder. Researchers modify the design of axial attention in two aspects to better suit the medical image segmentation task. First, to mitigate the local information loss, researchers introduce multi-scale local features to each axial attention path using strip-shaped kernels of varying sizes. This enables the decoder to better locate target regions and capture local details. Furthermore, to address the difficulty of learning positional bias, researchers create a dual interaction between horizontal and vertical axial attention, rather than computing axial attention sequentially in the horizontal and vertical dimensions. Compared to previous methods, researchers’ decoder is extremely lightweight, with fewer than 1 million parameters, making it more suitable for practical use in clinical diagnosis.

Connecting the MCA decoder to the MSCAN backbone yields researchers’ network, named MCANet. From the figures in this paper, it can be seen that this proposed method, with less computational complexity, achieves the best results on a series of widely-used benchmark datasets, including skin lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation, and polyp segmentation. Also, it can be observed that this proposed method performs better when dealing with changes in individual size and shape, etc. In summary, researchers’ contributions can be summarized as follows:

1) Researchers propose multi-scale cross-axis attention (MCA), which can capture long-range dependencies and encode multi-scale local information simultaneously without introducing much computational complexity.

2) Researchers design MCANet on the basis of MCA, which achieves excellent segmentation performance. Designing such networks is crucial to accommodate the trend of medical imaging shifting from the laboratory to the bedside.

3) Experiments on four typical tasks show that MCANet outperforms previous state-of-the-art methods with fewer parameters and lower computational cost.

Section 2 describes the network architectures and several multi-scale feature aggregation methods for medical image segmentation. Researchers also review some attention mechanisms that are related to their work.

Section 3 introduces the method. Researchers take the MSCAN network proposed in SegNeXt as their encoder because of its capability of capturing multi-scale features. The feature maps from the last three stages of the encoder are combined via upsampling and then concatenated as the input of the decoder. Researchers’ decoder is based on multi-scale cross-axis attention, which takes advantage of both multi-scale convolutional features and the axial attention.

Section 4 evaluates MCANet on four challenging tasks: skin lesion segmentation, nuclei segmentation, abdominal multi-organ segmentation, and polyp segmentation on widely used datasets. Researchers also provide ablation analysis to help readers understand how each component in their method contributes to segmentation performance.

Section 5 summarizes the full paper. Researchers propose MCANet for medical image segmentation. The core component of MCANet is the multi-scale cross-axis attention, a new method that combines multi-scale features and cross-axis attention to better segment organs or lesions with different sizes and shapes. Extensive experiments on four typical medical image segmentation tasks show that the proposed MCANet outperforms previous state-of-the-art methods with fewer parameters and lower computational complexity.

See the article:

MCANet: Medical Image Segmentation with Multi-scale Cross-axis Attention

http://doi.org/10.1007/s11633-025-1552-6

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.