VISPROG is a modular and interpretable neuro-symbolic system for compositional visual reasoning. Given a few examples of natural language instructions and the desired high-level programs, VISPROG generates a program for any new instruction using in- context learning in GPT-3 and then executes the program on the input image(s) to obtain the prediction. VISPROG also summarizes the intermediate outputs into an interpretable visual rationale (Fig. 4). We demonstrate VISPROG on tasks that require composing a diverse set of modules for image understanding and manipulation, knowledge retrieval, and arithmetic and logical operations

VisProg Teaser Figure


VisProg currently supports 20 modules for enabling capabilities like image understanding, image manipulation (including generation), knowledge retrieval, and arithmetic & logical operations. The red modules shown here are modules implemented using trained state-of-the-art neural models. The blue modules are non-neural python functions implemented using a range of python libraries such as PIL, OpenCV, and AugLy.
VisProg Module Figure

Program Generation in VisProg

VisProg generates programs by prompting an LLM, GPT-3, with an instruction along with in-context examples consisting of instructions and corresponding programs. Unlike previous approaches like Neural Module Network

Program Generation

Visual Rationales

Not only does VisProg produce a highly interpretable program, it also generates a visual rationale by stiching together a summary of the inputs and outputs of each step to help understand and the debug the information flow while executing the program. Below are two such visual rationales.

Image Editing with Natural Language

Paper Qual Fig

Reasoning about image pairs (NLVR)

Paper Qual Fig

Qualitative Results

We demonstrate VisProg on a range of complex visual tasks such as compositional visual question answering, zero-shot reasoning about image pairs (using only a single-image VQA model), factual knowledge object tagging, and language-guided image editing. The following figure shows the capabilities of VisProg on object tagging and image editing tasks.

Paper Qual Fig

Here are many more qualitiate results along with visual rationales for all tasks. We also include failure cases dues to logical errors in the generated program or due to error in a module's prediction.


Visual Programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi arXiv  2022