A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

Dabbrata Das; Md Rishadul Bayesh; Mohammad Amanul Islam; Md. Asifur Rahman

A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

Files

ImageCaptioning.pdf (1.1 MB)

Date

2024-02-05

Authors

Dabbrata Das

Md Rishadul Bayesh

Mohammad Amanul Islam

Md. Asifur Rahman

Publisher

2025 IEEE International Conference on Quantum Photonics, Artificial Intelligence, and Networking, QPAIN 2025 Conference Paper

Abstract

Automatic image captioning is essential for generating natural language descriptions by extracting meaningful features and understanding contextual relationships in images. While traditional methods like CNN-RNN models struggle with computational complexity and spatial awareness, multimodal Large Language Models (LLMs) offer an alternative but often lack object precision and are computationally expensive. In this work, we propose a novel image captioning framework that combines a finetuned YOLOv8 model with an LLM for efficient and accurate caption generation. YOLOv8 detects objects, extracts their names, confidence scores, and bounding box coordinates, which are filtered based on confidence levels above 0.5 before being passed to the LLM. This integration results in richer, more contextually accurate captions with lower inference time compared to existing methods. We evaluate our approach against multimodal LLMs and CNN-RNN models, demonstrating that it significantly improves efficiency while maintaining high caption quality. This method provides a promising solution for real-time applications, offering faster and more reliable image captioning for systems such as autonomous technologies and content generation.

Keywords

Image Captioning, Large Language Models (LLMs), Object Detection, Multimodal Learning

Citation

Das, Dabbrata, et al. "A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs." 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN). IEEE, 2025.

URI

http://dspace.uttarauniversity.edu.bd:4000/handle/123456789/1390

Collections

Journal Articles

Full item page

A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By