← Back to VOLUME 14, ISSUE 6, JUNE 2026
This work is licensed under a Creative Commons Attribution 4.0 International License.
ZooVision: AI-Powered Animal Captioning And Question Answering
Sarah Jose, Goutham Krishna L U
π 3 viewsπ₯ 1 download
Abstract: This paper presents ZooVision, a domain-specific Visual Question Answering (VQA) system developed to support zoo animal identification and interactive educational applications. The proposed framework combines vision and language understanding by fine-tuning a Vision-and-Language Transformer (ViLT) on a custom dataset consisting of 212 animal images representing 20 different species. Through this specialized training process, the model acquires knowledge related to animal classification, dietary habits, habitats, and behavioral characteristics, enabling it to provide more accurate and context-aware responses to user queries. To further enhance visual understanding, a BLIP-based image captioning model is employed to generate descriptive captions from input images. These captions are incorporated as additional contextual information through a prompt augmentation strategy inspired by the Caption- Conditioned Visual Question Answering (CC-VQA) framework. The integration of caption-generated semantic context helps the system better align visual features with natural language questions, resulting in improved reasoning and answer accuracy. Furthermore, the fine-tuning process expands the model's domain knowledge by introducing 292 specialized biological terms that are not commonly represented in general-purpose VQA datasets. This enriched vocabulary enables the system to deliver more detailed and informative responses within the zoological domain. Experimental observations indicate that the caption-conditioned approach contributes to stronger contextual understanding, particularly for species recognition and attribute-based questioning. The modular architecture of ZooVision also allows future integration of larger datasets and advanced vision-language models. To facilitate practical use, the complete framework is deployed as a responsive web application where users can upload animal images, view automatically generated captions, and interactively ask questions. By combining image captioning and visual question answering within a unified platform, ZooVision demonstrates the potential of multimodal artificial intelligence to enhance zoological learning, public engagement, and wildlife-related educational experiences.
Keywords: Visual Question Answering, Image Captioning, BLIP, ViLT, Deep Learning, Prompt Augmentation, Wildlife Education.
Keywords: Visual Question Answering, Image Captioning, BLIP, ViLT, Deep Learning, Prompt Augmentation, Wildlife Education.
How to Cite:
[1] Sarah Jose, Goutham Krishna L U, βZooVision: AI-Powered Animal Captioning And Question Answering,β International Journal of Innovative Research in Electrical, Electronics, Instrumentation and Control Engineering (IJIREEICE), DOI: 10.17148/IJIREEICE.2026.14621
