Can we transfer textual data to learn visual tasks?

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

Cross modaL transfer On Semantic Embeddings

We use Visual and Language models trained with a contrastive loss. These models learn to embed text and images into vectors such that the vectors for matching images and captions are close together, and vectors for unrelated images and captions are far apart. We propose a method called Cross modaL transfer On Semantic Embeddings (CLOSE) to take advantage of these encoders. During training, the text inputs are encoded into a vector using the (frozen) text encoder, which is then used as an input to a model. During testing, the visual input is embedded with an image encoder and used in place of the text embedding. Because these encoders were explicitly trained to produce embeddings that encode semantics in similar ways, learning to read and process the text vector should naturally translate to the ability to read and process the image vector.

CLOSE Model Figure

Figure 1. Overview of CLOSE.

Although we focus on text-to-image transfer, our approach is applicable to other contrastive models such as videos, point clouds, and audio, potentially allowing transfer between many other modalities. One potential difficulty with this approach is that, while contrastive embeddings do share some structure between modalities, there can still be significant differences between the image and text vectors in practice. To mitigate this, we propose to additionally use adapters that modify the text vectors being used during training. We find adding Gaussian noise to be very effective in boosting performance, but consider other approaches as well:

Linear Adapter

We learn the modality shift by training a linear model to minimize the euclidean distance between paired text and image vectors. We continue to add Gaussian noise after applying this model.

Structured Noise

Even in principle, we do not expect there to be a perfect one-to-one mapping between text and image vectors because an image vector can be similar to many different texts that describe different parts or detail of the image. We do a small case study by selecting four image/caption pairs that represent two different semantic changes, and then examining how the image or text vectors change with these changes in semantics. We observe that the text vectors change in a consistent manner when the species or position of the animal is changed, while the image vectors shift in more inconsistent directions. As a result, a shift in the text vectors does not correspond to a consistent shift in the image vectors, which makes perfectly aligning image and text vectors inherently challenging.

Example of Vector Shifts

Figure 2. An example of how image/text feature vectors shift with a specific change in species (vertically) or position (horizontally). Text adjacent to each arrow shows any significant changes in the text (purple) or image (red) vector that occurred because of the shift.

This motivates us to approach the problem from the perspective of better understanding how text vectors are distributed around the related image vectors, instead of just trying to learn a simple mapping function. To better account for this structured relationship during training, we add Gaussian noise using the mean and covariance of the differences between paired image and text vectors in the auxiliary corpus to the text vectors during training. This noise is expected to better simulate the text-image shift that will occur during evaluation.

Experiments

We perform several experiments to better understand the effectiveness of our approach. First, we train models for the tasks of captioning, visual question answers (VQA), and visual entailment using only text data -- typically using a caption describing a scene as a stand-in for an image, and then test the models on real images. We find these models only slightly underperform versions trained directly on images, demonstrating a high degree of transfer between the two modalities. Next, we show captioning models can also be trained using data generated from a language model, and therefore requiring almost no human-annotated data, and still acquire strong captioning competency. We demonstrate one application by training several stylistic captioning models using only text training data. We collect text with various styles from a diverse set of sources, including internet reviews, books, and generations from GPT-3, and demonstrate that models trained on this text using CLOSE can produce accurate and stylisticly correct captions for images.

Stylistic Captioning Teaser

Figure 3. Using CLOSE to learn stylistic captioning without image data.

Image Captioning

We consider two image captioning settings. First, multiple-caption, when there are multiple captions about the same scene available. In this case, we find it beneficial to use one of those captions as input and a different caption as the target during training. Multiple captions might not be always available (e.g., our text-only stylistic captioning datasets), so we also consider a single-caption setting where captions are not grouped, in which case we use the same caption as the input and target output. We validate our model by comparing it to another CLIP-based captioning model, ClipCap, and find it performs similarly when trained on images. CLOSE achieves 98.4 and 95.4 CIDEr in the multiple and single caption setting, showing high captioning competency despite not using images. Our approach is substantially better than zero-shot methods. Removing Gaussian noise results in significantly worse performance, particularly for the single caption setting, likely because simply repeating the input string becomes too easy of a task.

Model B-4 M C S
ClipCap 33.5 27.5 113.1 21.1
CLOSE w/Images 34.4 27.8 113.2 20.4
ZeroCap 2.6 11.5 14.6 5.5
Socratic Models ZS 6.9 15.0 44.5 10.1
MAGIC 12.9 17.4 49.3 11.3
CLOSE w/o Noise (Single) 4.2 12.2 16.4 6.5
CLOSE w/o Noise (Mult.) 21.9 20.6 68.7 13.5
CLOSE (Single) 28.6 25.2 95.4 18.1
CLOSE (Mult.) 29.5 25.6 98.4 18.3

Table 1. Results on the caption test set, (Single) indicates the single-caption setting and (Multi.) the multiple captioning setting.

Stylistic Captioning

We demonstrate an application of our method by applying it to the task of stylistic captioning -- constructing captions with specific writing styles. Our general approach is to gather text-only training data that exemplifies the style we want the model to use, train on them as if they were text captions, and then apply the model to images.

Stylistic Captioning Examples

Figure 4. Examples of stylistic captions produced by captioning models trained on text with CLOSE, and then applied zero-shot to images.

Training with Data from a Language Model

Next we use CLOSE to train a captioning model on synthetic data generated by a language model. We first construct a prompt that includes a natural language instruction and some example captions following an in-context learning approach.

Prompt Example

Figure 5. Prompt used to generate a synthetic caption from a language model. The language model's continuation (highlighted text) is used as a synthetic caption.

To generate a diverse set of captions, we prefix each caption with two keywords that occur in that caption, and end the prompt with two new keywords to be used in the caption to be generated. Then diverse captions can be constructed by changing the ending keyword pair. To reduce the chance of caption style affecting the quantitative evaluation, we take steps to better match the style of the COCO captions.

Visual Entailment

Visual entailment requires determining whether a premise image either entails, contradicts, or is neutral with respect to a hypothesis sentence. During training a text premise is used instead of an image. The hypothesis sentence is always text and is encoded with T5. We train on SNLI (a language only dataset) and evaluate on SNLI-VE (a vision and language dataset). Despite not using images, CLOSE achieves similar performance to the image model.

Model Val Test
CLOSE w/Images 77.0 77.7
CLIP Classifier 67.2 66.6
CLOSE w/o Noise 68.7 68.2
CLOSE 75.9 75.9

Table 2. Results on visual entailment test and val set.

VQA

To train a VQA model we use data that contains a short sentence describing a scene (encoded with the text encoder), a question (encoded with T5), and a target answer. We consider two sources of training data, first we pair COCO captions with questions about the same image from VQA 2.0. However, in this dataset, the questions might ask about details of the image not included in the caption, and thus cannot be answered by the text-only model. Hence we also train and evaluate on VQA-E which contains a subset of the VQA 2.0 questions paired with COCO captions that have been verified to contain the answer. These training sets have significantly different question distributions due to the filtering done in VQA-E (e.g., VQA-E does not include questions where the answer is "zero"), so we evaluate models either on the VQA 2.0 test-dev set or the VQA-E validation set depending on what train set was used. For VQA-E we observe only a 3.5 point drop relative to image training while surpassing the baselines. The gap is more significant on VQA 2.0, which we attribute to the sometimes poor alignment between the captions and questions, although our method is still within 5 points of the model trained on images.

Model Yes/No Num. Other All
CLOSE w/Images 80.4 48.4 64.1 67.9
CLOSE w/o Noise 76.8 36.8 53.9 59.8
CLOSE 78.2 46.0 59.5 64.3

Table 3. Results on VQA-E val set.

Model Yes/No Num. Other All
CLOSE w/Images 83.2 44.8 54.9 65.4
AP-C (ViT-B/16) 71.4 20.9 18.6 38.7
CLOSE w/o Noise 78.6 40.6 49.0 60.2
CLOSE 79.4 43.4 51.1 61.9

Table 4. Results on VQA 2.0 test-dev set.

Paper

I Can't Believe There's No Images! Learning Visual Tasks Using Only Language Data

Sophia Gu, Christopher Clark, and Aniruddha Kembhavi arXiv  2022