Yıl: 2023 Cilt: 31 Sayı: SI-1 (6) Sayfa Aralığı: 1079 - 1098 Metin Dili: İngilizce DOI: 10.55730/1300-0632.4035 İndeks Tarihi: 22-11-2023

TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders

Öz:
Image captioning is known as a fundamental computer vision task aiming to figure out and describe what is happening in an image or image region. Through an image captioning process, it is ensured to describe and define the actions and the relations of the objects within the images. In this manner, the contents of the images can be understood and interpreted automatically by visual computing systems. In this paper, we proposed the TRCaptionNet a novel deep learning-based Turkish image captioning (TIC) model for the automatic generation of Turkish captions. The model we propose essentially consists of a basic image encoder, a feature projection module based on vision transformers, and a text decoder. In the first stage, the system encodes the input images via the CLIP (contrastive language–image pretraining) image encoder. The CLIP image features are then passed through a vision transformer and the final image features to be linked with the textual features are obtained. In the last stage, a deep text decoder exploiting a BERT (bidirectional encoder representations from transformers) based model is used to generate the image cations. Furthermore, unlike the related works, a natural language-based linguistic model called NLLB (No Language Left Behind) was employed to produce Turkish captions from the original English captions. Extensive performance evaluation studies were carried out and widely known image captioning quantification metrics such as BLEU, METEOR, ROUGE-L, and CIDEr were measured for the proposed model. Within the scope of the experiments, quite successful results were observed on MS COCO and Flickr30K datasets, two known and prominent datasets in this field. As a result of the comparative performance analysis by taking the existing reports in the current literature on TIC into consideration, it was witnessed that the proposed model has superior performance and outperforms the related works on TIC so far. Project details and demo links of TRCaptionNet will also be available on the project’s GitHub page (https://github.com/serdaryildiz/TRCaptionNet).
Anahtar Kelime: Image captioning image understanding Turkish image captioning contrastive language–image pretraining bidirectional encoder representations from transformers image and natural language processing

Belge Türü: Makale Makale Türü: Araştırma Makalesi Erişim Türü: Erişime Açık
  • [1] Chen F, Li X, Tang J, Li S, Wang T. A Survey on Recent Advances in Image Captioning. Journal of Physics: Conference Series 2021; 1914 (1): 012053. https://doi.org/10.1088/1742-6596/1914/1/012053
  • [2] Ghandi T, Pourreza H, Mahyar H. Deep Learning Approaches on Image Captioning: A Review. arXiv preprint arXiv:2201.12944 2022. https://doi.org/10.48550/arXiv.2201.12944
  • [3] Ayesha H, Iqbal S, Tariq M, Abrar M, Sanaullah M et al. Automatic medical image interpretation: State of the art and future directions. Pattern Recognition 2021; 114: 107856. https://doi.org/10.1016/j.patcog.2021.107856
  • [4] Pavlopoulos J, Kougia V, Androutsopoulos I, Papamichail D. Diagnostic captioning: A survey. Knowledge and Information Systems 2022; 64 (7): 1691-1722. https://doi.org/10.1007/s10115-022-01684-7
  • [5] Makav B, Kılıç V. A new image captioning approach for visually impaired people. In: 2019 11th In- ternational Conference on Electrical and Electronics Engineering (ELECO); Bursa, Turkey 2019;45-949. https://doi.org/10.23919/ELECO47770.2019.8990630
  • [6] Dognin P, Melnyk I, Mroueh Y, Padhi I, Rigotti M et al. Image captioning as an assistive technology: Lessons learned from VizWiz 2020 challenge. Journal of Artificial Intelligence Research 2022; 73: 437-459. https://doi.org/10.1613/jair.1.13113
  • [7] Zhao B. A systematic survey of remote sensing image captioning. IEEE Access 2021; 9: 154086-154111. https://doi.org/10.1109/ACCESS.2021.3128140
  • [8] Li Y, Fang S, Jiao L, Liu R, Shang R. A multi-level attention model for remote sensing image captions. Remote Sensing 2020; 12 (6): 939. https://doi.org/10.3390/rs12060939
  • [9] Huang TH, Ferraro F, Mostafazadeh N, Misra I, Agrawal A et al. Visual storytelling. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies; San Diego, California, USA 2016;233-1239. http://dx.doi.org/10.18653/v1/N16-1147
  • [10] Fudholi DH, Windiatmoko Y, Afrianto N, Susanto PE, Suyuti M et al. Image captioning with attention for smart local tourism using EfficientNet. IOP Conference Series: Materials Science and Engineering 2021; 1077 (1): 012038. https://doi.org/10.1088/1757-899X/1077/1/012038
  • [11] Mori Y, Hirakawa T, Yamashita T, Fujiyoshi H. Image captioning for near-future events from vehicle camera images and motion information. In: 2021 IEEE Intelligent Vehicles Symposium (IV); Nagoya, Japan 2021;378-1384. https://doi.org/10.1109/IV48863.2021.9575562
  • [12] Zhang B, Zhou L, Song S, Chen L, Jiang Z et al. Image captioning in Chinese and its application for children with autism spectrum disorder. In: Proceedings of the 2020 12th International Conference on Machine Learning and Computing; Shenzhen, China; 2020. pp. 426-432. https://doi.org/10.1145/3383972.3384072
  • [13] Li W, Qu Z, Song H, Wang P, Xue B. The traffic scene understanding and prediction based on image captioning. IEEE Access 2020; 9: 1420-1427. https://doi.org/10.1109/ACCESS.2020.3047091
  • [14] Sathe S, Shinde S, Chorge S, Thakare S, Kulkarni L. Overview of Image Caption Generators and Its Applica- tions. In: Bhalla S, Bedekar M, Phalnikar R, Sirsikar S (editors). Proceeding of International Conference on Computational Science and Applications, Algorithms for Intelligent Systems, Springer, Singapore 2021;105-110. https://doi.org/10.1007/978-981-19-0863-7_8
  • [15] Sharma D, Dhiman C, Kumar D. Evolution of visual data captioning methods, datasets, and eval- uation metrics: A comprehensive survey. Expert Systems with Applications 2023; 221: 119773. https://doi.org/10.1016/j.eswa.2023.119773
  • [16] Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C et al. Every picture tells a story: Generating sentences from images. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision; Heraklion, Crete, Greece 2010;15-29. https://doi.org/10.1007/978-3-642-15561-1_2
  • [17] Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K et al. Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics; Avignon, France 2012; 747-756.
  • [18] Aneja J, Deshpande A, Schwing AG. Convolutional image captioning. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA 2018; 5561-5570. https://doi.org/10.1109/CVPR.2018.00583
  • [19] Feng Y, Ma L, Liu W, Luo J. Unsupervised image captioning. In: 2019 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR); Long Beach, CA, USA 2019; 4125-4134. https://doi.org/10.1109/CVPR.2019.00425
  • [20] He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B et al. Image captioning through image transformer. In: Proceedings of the 15th Asian Conference on Computer Vision (ACCV); Kyoto, Japan 2020;153–169. https://doi.org/10.1007/978-3-030-69538-5_10
  • [21] Yang M, Liu J, Shen Y, Zhao Z, Chen X et al. An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing 2020; 29: 9627-9640. https://doi.org/10.1109/TIP.2020.3028651
  • [22] Luo G, Cheng L, Jing C, Zhao C, Song G. A thorough review of models, evaluation metrics, and datasets on image captioning. IET Image Processing 2022; 16 (2): 311-332. https://doi.org/10.1049/ipr2.12367
  • [23] You Q, Jin H, Wang Z, Fang C, Luo J. Image captioning with semantic attention. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA 2016; 4651-4659. https://doi.org/10.1109/CVPR.2016.503
  • [24] Yao T, Pan Y, Li Y, Mei T. Exploring visual relationship for image captioning. In: Proceedings of the 15th European Conference on Computer Vision (ECCV); Munich, Germany 2018; 684-699. https://doi.org/10.1007/978-3-030- 01264-9_42
  • [25] Herdade S, Kappeler A, Boakye K, Soares J. Image captioning: Transforming objects into words. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS); Vancouver, BC, Canada 2019; 11137–11147.
  • [26] Xu L, Tang Q, Lv J, Zheng B, Zeng X et al. Deep image captioning: A review of methods, trends and future challenges. Neurocomputing 2023; 546: 126287. https://doi.org/10.1016/j.neucom.2023.126287
  • [27] Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G et al. From show to tell: A survey on deep learning- based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022; 45 (1): 539-559. https://doi.org/10.1109/TPAMI.2022.3148210
  • [28] Costa-jussà MR, Cross J, Çelebi O, Elbayad M, Heafield K et al. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 2022. https://doi.org/10.48550/arXiv.2207.04672
  • [29] Ma Y, Ji J, Sun X, Zhou Y, Ji R. Towards local visual modeling for image captioning. Pattern Recognition 2023; 138: 109420. https://doi.org/10.1016/j.patcog.2023.109420
  • [30] Hu N, Fan C, Ming Y, Feng F. MAENet: A novel multi-head association attention enhancement net- work for completing intra-modal interaction in image captioning. Neurocomputing 2023; 519: 69-81. https://doi.org/10.1016/j.neucom.2022.11.045
  • [31] Wang N, Xie J, Wu J, Jia M, Li L. Controllable image captioning via prompting. Proceedings of the AAAI Conference on Artificial Intelligence 2023; 37 (2): 2617-2625. https://doi.org/10.1609/aaai.v37i2.25360
  • [32] Hu N, Ming Y, Fan C, Feng F, Lyu B. TSFNet: Triple-steam image captioning. IEEE Transactions on Multimedia 2022; 1-14. https://doi.org/10.1109/TMM.2022.3215861
  • [33] Jiang W, Li Q, Zhan K, Fang Y, Shen F. Hybrid attention network for image captioning. Displays 2022; 73: 102238. https://doi.org/10.1016/j.displa.2022.102238
  • [34] Hu J, Yang Y, Yao L, An Y, Pan L. Position-guided transformer for image captioning. Image and Vision Computing 2022; 128: 104575. https://doi.org/10.1016/j.imavis.2022.104575
  • [35] Wang C, Shen Y, Ji L. Geometry attention transformer with position-aware LSTMs for image captioning. Expert Systems with Applications 2022; 201: 117174. https://doi.org/10.1016/j.eswa.2022.117174
  • [36] Wei Y, Wu C, Li G, Shi H. Sequential transformer via an outside-in attention for image captioning. Engineering Applications of Artificial Intelligence 2022; 108: 104574. https://doi.org/10.1016/j.engappai.2021.104574
  • [37] Wang J, Li Y, Pan Y, Yao T, Tang J et al. Contextual and selective attention networks for image captioning. Science China Information Sciences 2022; 65 (12): 222103. https://doi.org/10.1007/s11432-020-3523-6
  • [38] Wang Z, Shi S, Zhai Z, Wu Y, Yang R. ArCo: Attention-reinforced transformer with contrastive learning for image captioning. Image and Vision Computing 2022; 128: 104570. https://doi.org/10.1016/j.imavis.2022.104570
  • [39] Ji J, Huang X, Sun X, Zhou Y, Luo G et al. Multi-branch distance-sensitive self-attention network for image captioning. IEEE Transactions on Multimedia 2022. https://doi.org/10.1109/TMM.2022.3169061
  • [40] Du S, Zhu H, Lin G, Wang D, Shi J et al. Object semantic analysis for image captioning. Multimedia Tools and Applications 2023. https://doi.org/10.1007/s11042-023-14596-7
  • [41] Wang C, Gu X. Learning joint relationship attention network for image captioning. Expert Systems with Applica- tions 2023; 211: 118474. https://doi.org/10.1016/j.eswa.2022.118474
  • [42] Li Z, Wei J, Huang F, Ma H. Modeling graph-structured contexts for image captioning. Image and Vision Computing 2023; 129: 104591. https://doi.org/10.1016/j.imavis.2022.104591
  • [43] Jiang W, Hu H. Hadamard product perceptron attention for image captioning. Neural Processing Letters 2023; 55: 2707-2724. https://doi.org/10.1007/s11063-022-10980-w
  • [44] Xiao F, Xue W, Shen Y, Gao X. A new attention-based LSTM for image captioning. Neural Processing Letters 2022; 54 (4): 3157-3171. https://doi.org/10.1007/s11063-022-10759-z
  • [45] Luo J, Li Y, Pan Y, Yao T, Feng J et al. Semantic-conditional diffusion networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada; 2023. pp. 23359–23368. https://doi.org/10.1109/CVPR52729.2023.02237
  • [46] Kuo CW, Kira Z. HAAV: Hierarchical aggregation of augmented views for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Vancouver, BC, Canada 2023; 11039-11049. https://doi.org/10.1109/CVPR52729.2023.01062
  • [47] Unal ME, Citamak B, Yagcioglu S, Erdem A, Erdem E et al. TasvirEt: A benchmark dataset for automatic Turkish description generation from images. In: 2016 24th Signal Processing and Communication Application Conference (SIU); Zonguldak, Turkey 2016; 1977-1980. https://doi.org/10.1109/SIU.2016.7496155
  • [48] Samet N, Hiçsönmez S, Duygulu P, Akbaş E. Could we create a training set for image captioning using automatic translation?. In: 2017 25th Signal Processing and Communications Applications Conference (SIU); Antalya, Turkey 2017; 1-4. https://doi.org/10.1109/SIU.2017.7960638
  • [49] Kuyu M, Erdem A, Erdem E. Image captioning in Turkish with subword units. In: 2018 26th Signal Processing and Communications Applications Conference (SIU); İzmir, Turkey 2018; 1-4. https://doi.org/10.1109/SIU.2018.8404431
  • [50] Yılmaz BD, Demir AE, Sönmez EB, Yıldız T. Image Captioning in Turkish Language. In: 2019 Innovations in Intelligent Systems and Applications Conference (ASYU); İzmir, Turkey 2019; 1-5. https://doi.org/10.1109/ASYU48272.2019.8946358
  • [51] Yıldız T, Sönmez EB, Yılmaz BD, Demir AE. Image captioning in Turkish language: Database and model. Journal of the Faculty of Engineering and Architecture of Gazi University 2020; 35 (4): 2089-2100. https://doi.org/10.17341/gazimmfd.597089
  • [52] Atıcı B, İlhan Omurca S. Generating Classified Ad Product Image Titles with Image Captioning. In: Trends in Data Engineering Methods for Intelligent Systems: Proceedings of the International Conference on Artifi- cial Intelligence and Applied Mathematics in Engineering (ICAIAME 2020); Antalya, Turkey 2021; 211-219. https://doi.org/10.1007/978-3-030-79357-9_21
  • [53] Ani Y, Amasyali MF. A General Purpose Turkish CLIP Model (TrCLIP) for Image&Text Retrieval and its Appli- cation to E-Commerce. In: 2022 International Conference on INnovations in Intelligent SysTems and Applications (INISTA); Biarritz, France 2022; 1-6. https://doi.org/10.1109/INISTA55318.2022.9894123
  • [54] Golech SB, Karacan SB, Sönmez EB, Ayral H. A complete human verified Turkish caption dataset for MS COCO and performance evaluation with well-known image caption models trained against it. In: 2022 International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME); Maldives, Maldives 2022; 1-6. https://doi.org/10.1109/ICECCME55909.2022.9988025
  • [55] Lin TY, Maire M, Belongie S, Hays J, Perona P et al. Microsoft COCO: Common objects in context. In: 13th Eu- ropean Conference Computer Vision - ECCV 2014; Zurich, Switzerland 2014;740-755. https://doi.org/10.1007/978- 3-319-10602-1_48
  • [56] Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2014; 2: 67-78. https://doi.org/10.1162/tacl_a_00166
  • [57] Chen X, Fang H, Lin TY, Vedantam R, Gupta S et al. Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 2015. https://doi.org/10.48550/arXiv.1504.00325
  • [58] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020 2021. https://doi.org/10.48550/arXiv.2103.00020
  • [59] Schweter S. BERTurk - BERT models for Turkish (1.0.0). Zenodo 2020. https://doi.org/10.5281/zenodo.3770924
  • [60] Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 2018. https://doi.org/10.48550/arXiv.1810.04805
  • [61] Han K, Wang Y, Chen H, Chen X, Guo J et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence 2022; 45 (1): 87-110. https://doi.org/10.1109/TPAMI.2022.3152247
  • [62] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, Nevada, USA; 2016. pp. 770-778. https://doi.org/10.48550/arXiv.1512.03385
  • [63] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17); Long Beach, California, USA; 2017. pp. 6000–6010.
  • [64] Sharif N, White L, Bennamoun M, Shah SAA. NNEval: Neural network based evaluation metric for image captioning. In: Proceedings of the 15th European Conference on Computer Vision (ECCV); Munich, Germany; 2018. pp. 37-53. https://doi.org/10.1007/978-3-030-01237-3_3
  • [65] Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02); Philadelphia, Pennsylvania, USA; 2002. pp. 311–318. https://doi.org/10.3115/1073083.1073135
  • [66] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation (StatMT’07); Prague, Czech Republic; 2007. pp. 228–231.
  • [67] Lin CY. ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004); Barcelona, Spain; 2004. pp. 74-81.
  • [68] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Boston, MA, USA; 2015. pp. 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
APA Yıldız S, Memiş A, Varlı S (2023). TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. , 1079 - 1098. 10.55730/1300-0632.4035
Chicago Yıldız Serdar,Memiş Abbas,Varlı Songül TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. (2023): 1079 - 1098. 10.55730/1300-0632.4035
MLA Yıldız Serdar,Memiş Abbas,Varlı Songül TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. , 2023, ss.1079 - 1098. 10.55730/1300-0632.4035
AMA Yıldız S,Memiş A,Varlı S TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. . 2023; 1079 - 1098. 10.55730/1300-0632.4035
Vancouver Yıldız S,Memiş A,Varlı S TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. . 2023; 1079 - 1098. 10.55730/1300-0632.4035
IEEE Yıldız S,Memiş A,Varlı S "TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders." , ss.1079 - 1098, 2023. 10.55730/1300-0632.4035
ISNAD Yıldız, Serdar vd. "TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders". (2023), 1079-1098. https://doi.org/10.55730/1300-0632.4035
APA Yıldız S, Memiş A, Varlı S (2023). TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. Turkish Journal of Electrical Engineering and Computer Sciences, 31(SI-1 (6)), 1079 - 1098. 10.55730/1300-0632.4035
Chicago Yıldız Serdar,Memiş Abbas,Varlı Songül TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. Turkish Journal of Electrical Engineering and Computer Sciences 31, no.SI-1 (6) (2023): 1079 - 1098. 10.55730/1300-0632.4035
MLA Yıldız Serdar,Memiş Abbas,Varlı Songül TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. Turkish Journal of Electrical Engineering and Computer Sciences, vol.31, no.SI-1 (6), 2023, ss.1079 - 1098. 10.55730/1300-0632.4035
AMA Yıldız S,Memiş A,Varlı S TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. Turkish Journal of Electrical Engineering and Computer Sciences. 2023; 31(SI-1 (6)): 1079 - 1098. 10.55730/1300-0632.4035
Vancouver Yıldız S,Memiş A,Varlı S TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders. Turkish Journal of Electrical Engineering and Computer Sciences. 2023; 31(SI-1 (6)): 1079 - 1098. 10.55730/1300-0632.4035
IEEE Yıldız S,Memiş A,Varlı S "TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders." Turkish Journal of Electrical Engineering and Computer Sciences, 31, ss.1079 - 1098, 2023. 10.55730/1300-0632.4035
ISNAD Yıldız, Serdar vd. "TRCaptionNet: A novel and accurate deep Turkish image captioning model with vision transformer based image encoders and deep linguistic text decoders". Turkish Journal of Electrical Engineering and Computer Sciences 31/SI-1 (6) (2023), 1079-1098. https://doi.org/10.55730/1300-0632.4035