A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

2025 IEEE International Conference on Quantum Photonics, Artificial Intelligence, and Networking, QPAIN 2025 Conference Paper

Abstract

Automatic image captioning is essential for generating natural language descriptions by extracting meaningful features and understanding contextual relationships in images. While traditional methods like CNN-RNN models struggle with computational complexity and spatial awareness, multimodal Large Language Models (LLMs) offer an alternative but often lack object precision and are computationally expensive. In this work, we propose a novel image captioning framework that combines a finetuned YOLOv8 model with an LLM for efficient and accurate caption generation. YOLOv8 detects objects, extracts their names, confidence scores, and bounding box coordinates, which are filtered based on confidence levels above 0.5 before being passed to the LLM. This integration results in richer, more contextually accurate captions with lower inference time compared to existing methods. We evaluate our approach against multimodal LLMs and CNN-RNN models, demonstrating that it significantly improves efficiency while maintaining high caption quality. This method provides a promising solution for real-time applications, offering faster and more reliable image captioning for systems such as autonomous technologies and content generation.

Description

Citation

Das, Dabbrata, et al. "A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs." 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN). IEEE, 2025.

Collections

Endorsement

Review

Supplemented By

Referenced By