EFFICIENT HYBRID CNN-VISION TRANSFORMER FOR MEDICAL IMAGE CLASSIFICATION WITH LIMITED ANNOTATIONS
DOI:
https://doi.org/10.69916/jkbti.v4i3.453Keywords:
medical image classification, hybrid cnn–vision transformer, limited annotation, rganamnist;, computational efficiencyAbstract
Medical image classification is a critical component of computer-aided diagnosis systems, yet its performance is often hindered by the scarcity of annotated data. This situation is common in the medical domain due to ethical, cost, and labeling constraints. Convolutional Neural Networks (CNNs) are effective at extracting local features but are suboptimal at capturing global context. Conversely, Vision Transformers (ViTs) excel at modeling long-range dependencies but require large amounts of training data. To address these limitations, this study proposes a hybrid CNN–Vision Transformer model that integrates the strengths of both to improve classification performance under limited annotation conditions. The model was tested using the OrganAMNIST dataset, consisting of 53,339 two-dimensional abdominal CT images with 11 organ classes. Experimental results show that the model achieves an accuracy of 92.3%, an F1-score of 91.8%, and an AUC of 99.5%, with only 3.67 million parameters. Compared to ResNet50, this model reduces the number of parameters by 84% and increases inference speed by up to 2.4 times. Additionally, the model demonstrates better training stability compared to baseline models such as ResNet50 and ViT-Small. The results of the study show that the integration of local and global features in a hybrid architecture can simultaneously improve accuracy and efficiency. This approach has the potential to be applied to medical diagnosis systems with limited data and computational resources.
Downloads
References
G. Litjens, T. Kooi, B. E. Bejnordi, and others, “A Survey on Deep Learning in Medical Image Analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017, doi: 10.1016/j.media.2017.07.005.
S. Shrestha and A. Mahmood, “Review of Deep Learning Algorithms in Medical Imaging,” IEEE Access, vol. 11, pp. 20815–20838, 2023, doi: 10.1109/ACCESS.2023.3246834.
Y. Roh, G. Heo, and S. E. Whang, “Addressing Data Scarcity in Medical Imaging using Deep Learning,” Nat. Biomed. Eng., vol. 5, pp. 373–386, 2021, doi: 10.1038/s41551-020-00616-z.
Y. Gu and others, “Recent Advances in Convolutional Neural Networks for Medical Image Analysis,” IEEE Rev. Biomed. Eng., vol. 14, pp. 58–72, 2021, doi: 10.1109/RBME.2020.3019805.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
M. Tan and Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 6105–6114. doi: 10.48550/arXiv.1905.11946.
A. Howard and others, “Searching for MobileNetV3,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 1314–1324. doi: 10.1109/ICCV.2019.00140.
S. Khan and others, “Transformers in Vision: A Survey,” ACM Comput. Surv., vol. 54, no. 10s, pp. 1–41, 2022, doi: 10.1145/3505244.
A. Dosovitskiy and others, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations (ICLR), 2021. doi: 10.48550/arXiv.2010.11929.
H. Touvron and others, “Training Data-efficient Image Transformers & Distillation,” International Conference on Machine Learning (ICML), 2021, doi: 10.48550/arXiv.2012.12877.
Z. Liu and others, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 10012–10022, 2022, doi: 10.1109/TPAMI.2021.3131312.
H. Wu and others, “CvT: Introducing Convolutions to Vision Transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 22–31. doi: 10.1109/ICCV48922.2021.00009.
A. Hatamizadeh and others, “UNETR: Transformers for 3D Medical Image Segmentation,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 574–584. doi: 10.1109/WACV51458.2022.00063.
E. Xie and others, “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 12077–12090.
Z. Dai and others, “CoAtNet: Marrying Convolution and Attention for All Data Sizes,” in Advances in Neural Information Processing Systems (NeurIPS), 2021, pp. 3965–3977.
J. Chen and others, “TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation,” arXiv preprint arXiv:2102.04306, 2021, doi: 10.48550/arXiv.2102.04306.
S. Azizi and others, “Big Self-Supervised Models Advance Medical Image Classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3478–3488. doi: 10.1109/ICCV48922.2021.00347.
Downloads
Published
Scite Metrics
Altmetric
How to Cite
Issue
Section
License
Copyright (c) 2025 San Sudirman, Ahmad Yani, Lalu Darmawan Bakti

This work is licensed under a Creative Commons Attribution 4.0 International License.
Most read articles by the same author(s)
- Lalu Moh. Nurkholis, San Sudirman, Maspaeni, Muhammad Said, A SECURE DIGITAL TRADING PLATFORM FOR ONLINE GAME ACCOUNTS USING DUAL AUTHENTICATION AND SMART PAYMENT INTEGRATION , Jurnal Kecerdasan Buatan dan Teknologi Informasi: Vol. 5 No. 1 (2026): January 2026













