Speciality vs. Generality

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system's ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting.

GPV Teaser Figure

GPV-I is a general purpose vision-language architecture that can learn and perform any task that requires bounding boxes or text prediction. We demonstrate the effectiveness of GPV-I by jointly training it on VQA, Captioning, Localization, and Classification tasks and achieveing favorable performance in comparison to specialized single-task models.

Tenets of General Purpose Learning

  1. Generality of architecture: The system can learn and perform any task within a broad domain without change to network structure (e.g. learn to classify bird species, without adding new output heads, by re-using ability to encode images, interpret task from text, and produce words)
  2. Generality of concepts across skills: The system can perform tasks in skill-concept combinations not seen during training (e.g. localize ``muskrat'' after learning to answer questions about ``muskrats'')
  3. Generality of learning: The system can learn new tasks sample-efficiently with minimal loss to performance on previously learned tasks

Architecture of GPV-I

GPV-I consisting of a visual encoder, language encoder, vision-language co-attention module, and output heads for the supported output modalities - boxes, relevance scores, and text. We use the CNN backbone and the transformer encoder-decoder from DETR, an end-to-end trainable object detector. The natural language task description is encoded with BERT. To cross-contextualize representations from the visual and language encoders, we use ViLBERT's co-attention module. Box and objectness heads predict task-agnostic bounding boxes and scores. Relatedness head predicts a task-specific score for each output box that is combined with the objectness scores to obtain relevance scores. The text decoder is a transformer decoder that auto-regressively generates text output while using relevance-conditioned representations produced by the cross-modal module as memory.

GPV Teaser Figure

Qualitative Results

Paper

Towards General Purpose Vision Systems

Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem CVPR  2022

Video