A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
2025 IEEE International Conference on Quantum Photonics, Artificial Intelligence, and Networking, QPAIN 2025 Conference Paper
Abstract
Automatic image captioning is essential for generating natural language descriptions by extracting meaningful features and understanding contextual relationships
in images. While traditional methods like CNN-RNN models struggle with computational complexity and spatial
awareness, multimodal Large Language Models (LLMs)
offer an alternative but often lack object precision and are
computationally expensive. In this work, we propose a
novel image captioning framework that combines a finetuned YOLOv8 model with an LLM for efficient and accurate caption generation. YOLOv8 detects objects, extracts their names, confidence scores, and bounding box
coordinates, which are filtered based on confidence levels
above 0.5 before being passed to the LLM. This integration results in richer, more contextually accurate captions
with lower inference time compared to existing methods.
We evaluate our approach against multimodal LLMs and
CNN-RNN models, demonstrating that it significantly improves efficiency while maintaining high caption quality.
This method provides a promising solution for real-time
applications, offering faster and more reliable image captioning for systems such as autonomous technologies and
content generation.
Description
Citation
Das, Dabbrata, et al. "A Framework for Accurate and Efficient Image Captioning by Fusing Fine-Tuned YOLOv8 and LLMs." 2025 International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN). IEEE, 2025.
