Visual Question Answering (VQA) is a multimodal artificial intelligence task where a model must answer a natural language question about the content of an image or video. It requires a joint understanding of both visual information (objects, attributes, spatial relationships) and textual semantics, moving beyond simple object detection to complex reasoning, counting, and inference. The task is a benchmark for evaluating cross-modal integration and is foundational for applications in assistive technology, autonomous systems, and interactive AI agents.
