A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

Dabbrata Das; Md Rishadul Bayesh; Mohammad Amanul Islam; Md. Asifur Rahman

A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

dc.contributor.author	Dabbrata Das
dc.contributor.author	Md Rishadul Bayesh
dc.contributor.author	Mohammad Amanul Islam
dc.contributor.author	Md. Asifur Rahman
dc.date.accessioned	2026-04-06T09:55:41Z
dc.date.issued	2024-02-05
dc.description.abstract	Automatic image captioning is essential for generating natural language descriptions by extracting meaningful features and understanding contextual relationships in images. While traditional methods like CNN-RNN models struggle with computational complexity and spatial awareness, multimodal Large Language Models (LLMs) offer an alternative but often lack object precision and are computationally expensive. In this work, we propose a novel image captioning framework that combines a finetuned YOLOv8 model with an LLM for efficient and accurate caption generation. YOLOv8 detects objects, extracts their names, confidence scores, and bounding box coordinates, which are filtered based on confidence levels above 0.5 before being passed to the LLM. This integration results in richer, more contextually accurate captions with lower inference time compared to existing methods. We evaluate our approach against multimodal LLMs and CNN-RNN models, demonstrating that it significantly improves efficiency while maintaining high caption quality. This method provides a promising solution for real-time applications, offering faster and more reliable image captioning for systems such as autonomous technologies and content generation.
dc.identifier.citation	Das, Dabbrata, et al. "A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs." 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN). IEEE, 2025.
dc.identifier.issn	019062651
dc.identifier.uri	http://dspace.uttarauniversity.edu.bd:4000/handle/123456789/1390
dc.language.iso	en_US
dc.publisher	2025 IEEE International Conference on Quantum Photonics, Artificial Intelligence, and Networking, QPAIN 2025 Conference Paper
dc.subject	Image Captioning
dc.subject	Large Language Models (LLMs)
dc.subject	Object Detection
dc.subject	Multimodal Learning
dc.title	A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs
dc.type	Article

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ImageCaptioning.pdf
Size:: 1.1 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed to upon submission
Description:

Download

Collections

Journal Articles