EFFICIENT HYBRID CNN-VISION TRANSFORMER FOR MEDICAL IMAGE CLASSIFICATION WITH LIMITED ANNOTATIONS

Authors

  • San Sudirman Universitas Teknologi Mataram
  • Ahmad Yani University Of Technology Mataram
  • Lalu Darmawan Bakti University Of Technology Mataram

DOI:

https://doi.org/10.69916/jkbti.v4i3.453

Keywords:

medical image classification, hybrid cnn–vision transformer, limited annotation, rganamnist;, computational efficiency

Abstract

Medical image classification is a critical component of computer-aided diagnosis systems, yet its performance is often hindered by the scarcity of annotated data. This situation is common in the medical domain due to ethical, cost, and labeling constraints. Convolutional Neural Networks (CNNs) are effective at extracting local features but are suboptimal at capturing global context. Conversely, Vision Transformers (ViTs) excel at modeling long-range dependencies but require large amounts of training data. To address these limitations, this study proposes a hybrid CNN–Vision Transformer model that integrates the strengths of both to improve classification performance under limited annotation conditions. The model was tested using the OrganAMNIST dataset, consisting of 53,339 two-dimensional abdominal CT images with 11 organ classes. Experimental results show that the model achieves an accuracy of 92.3%, an F1-score of 91.8%, and an AUC of 99.5%, with only 3.67 million parameters. Compared to ResNet50, this model reduces the number of parameters by 84% and increases inference speed by up to 2.4 times. Additionally, the model demonstrates better training stability compared to baseline models such as ResNet50 and ViT-Small. The results of the study show that the integration of local and global features in a hybrid architecture can simultaneously improve accuracy and efficiency. This approach has the potential to be applied to medical diagnosis systems with limited data and computational resources.

Downloads

Download data is not yet available.

References

G. Litjens, T. Kooi, B. E. Bejnordi, and others, “A Survey on Deep Learning in Medical Image Analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017, doi: 10.1016/j.media.2017.07.005.

S. Shrestha and A. Mahmood, “Review of Deep Learning Algorithms in Medical Imaging,” IEEE Access, vol. 11, pp. 20815–20838, 2023, doi: 10.1109/ACCESS.2023.3246834.

Y. Roh, G. Heo, and S. E. Whang, “Addressing Data Scarcity in Medical Imaging using Deep Learning,” Nat. Biomed. Eng., vol. 5, pp. 373–386, 2021, doi: 10.1038/s41551-020-00616-z.

Y. Gu and others, “Recent Advances in Convolutional Neural Networks for Medical Image Analysis,” IEEE Rev. Biomed. Eng., vol. 14, pp. 58–72, 2021, doi: 10.1109/RBME.2020.3019805.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.

M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 6105–6114. doi: 10.48550/arXiv.1905.11946.

A. Howard and others, “Searching for MobileNetV3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1314–1324. doi: 10.1109/ICCV.2019.00140.

S. Khan and others, “Transformers in Vision: A Survey,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1–41, 2022, doi: 10.1145/3505244.

A. Dosovitskiy and others, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations (ICLR), 2021. doi: 10.48550/arXiv.2010.11929.

H. Touvron and others, “Training Data-efficient Image Transformers & Distillation,” International Conference on Machine Learning (ICML), 2021, doi: 10.48550/arXiv.2012.12877.

Z. Liu and others, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 10012–10022, 2022, doi: 10.1109/TPAMI.2021.3131312.

H. Wu and others, “CvT: Introducing Convolutions to Vision Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31. doi: 10.1109/ICCV48922.2021.00009.

A. Hatamizadeh and others, “UNETR: Transformers for 3D Medical Image Segmentation,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 574–584. doi: 10.1109/WACV51458.2022.00063.

E. Xie and others, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 12077–12090.

Z. Dai and others, “CoAtNet: Marrying Convolution and Attention for All Data Sizes,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 3965–3977.

J. Chen and others, “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,” arXiv preprint arXiv:2102.04306, 2021, doi: 10.48550/arXiv.2102.04306.

S. Azizi and others, “Big Self-Supervised Models Advance Medical Image Classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3478–3488. doi: 10.1109/ICCV48922.2021.00347.

Downloads

Published

2025-09-29

PlumX Metrics

Scite Metrics

Altmetric

How to Cite

[1]
S. Sudirman, A. Yani, and L. Darmawan Bakti, “EFFICIENT HYBRID CNN-VISION TRANSFORMER FOR MEDICAL IMAGE CLASSIFICATION WITH LIMITED ANNOTATIONS ”, JKBTI, vol. 4, no. 3, pp. 341–348, Sep. 2025.