A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

dc.contributor.authorDabbrata Das
dc.contributor.authorMd Rishadul Bayesh
dc.contributor.authorMohammad Amanul Islam
dc.contributor.authorMd. Asifur Rahman
dc.date.accessioned2026-04-06T09:55:41Z
dc.date.issued2024-02-05
dc.description.abstractAutomatic image captioning is essential for generating natural language descriptions by extracting meaningful features and understanding contextual relationships in images. While traditional methods like CNN-RNN models struggle with computational complexity and spatial awareness, multimodal Large Language Models (LLMs) offer an alternative but often lack object precision and are computationally expensive. In this work, we propose a novel image captioning framework that combines a finetuned YOLOv8 model with an LLM for efficient and accurate caption generation. YOLOv8 detects objects, extracts their names, confidence scores, and bounding box coordinates, which are filtered based on confidence levels above 0.5 before being passed to the LLM. This integration results in richer, more contextually accurate captions with lower inference time compared to existing methods. We evaluate our approach against multimodal LLMs and CNN-RNN models, demonstrating that it significantly improves efficiency while maintaining high caption quality. This method provides a promising solution for real-time applications, offering faster and more reliable image captioning for systems such as autonomous technologies and content generation.
dc.identifier.citationDas, Dabbrata, et al. "A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs." 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN). IEEE, 2025.
dc.identifier.issn019062651
dc.identifier.urihttp://dspace.uttarauniversity.edu.bd:4000/handle/123456789/1390
dc.language.isoen_US
dc.publisher2025 IEEE International Conference on Quantum Photonics, Artificial Intelligence, and Networking, QPAIN 2025 Conference Paper
dc.subjectImage Captioning
dc.subjectLarge Language Models (LLMs)
dc.subjectObject Detection
dc.subjectMultimodal Learning
dc.titleA Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ImageCaptioning.pdf
Size:
1.1 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed to upon submission
Description:

Collections