PRIOR

Summary

Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.

LXMERT to X-LXMERT

In order to make LXMERT paint, we first modify its input image representation to a N x N grid based features and then pass the set to an image generator. Then, we introduce three key training refinements for LXMERT:

Discrete visual representations
Uniform masking
Dropping short and local text for visual prediction pre-training objective

**Top**: Overview of the proposed X-LXMERT model. Blocks in blue are the modifications we make to LXMERT model to enable it to paint. **Bottom**: Overview of the image generation architecture. The input to the model is a natural image that is compressed to a quantized latent map of size 8 x 8 by RoI Pooling. We use a generator consisting of multiple residual blocks with SPADE layer which encodes 8 x 8 grid features.

Iterative image sampling

We employ Gibbs sampling to iteratively sample features at different spatial locations. In contrast to text generation, where left-to-right isconsidered a natural order, there is no natural orderfor generating images. We explore different sampling strategies for X-LXMERT.

Quantitative Results

X-LXMERT’s image generation capabilities rival models specialized in image generation. In fact, human annotators prefer the images from X-LXMERT than DM-GAN, a state-of-the-art competitor. In the same time, X-LXMERT retain very competitive performance on isual question answerin and reasoning benchmarks with minimal drop.

quantitative table — Comparing X-LXMERT, LXMERT and baselines on image generation, visual question answering andvisual reasoning tasks. The pairwise metric compares LXMERT and DM-GAN; numbers do not sum to 100 dueto the TIE option provided to annotators. Note that X-LXMERT and LXMERT*+Grid are the only models that areable to produce results for all tasks. *: Our re-implementation of LXMERT.

Qualitative Results

The below figures shows some qualitative analysis for text-to-image generation with X-LXMERT. The top and middle row demonstrate how the spatial visual features are gradually updated as the sampling steps proceed. The bottowm row compare X-LXMERT equipped with different sampling strategies with baselines.

intermediate random-140 — Images generated by X-LXMERT at intermediate stages of random position sampling.

intermediate gif — Animated generation process for the images above. Images are gradually improved as sampling steps proceed.

Qualitative results — Comparing images generated from baselines and X-LXMERT with different sampling strategies.

Live Demo

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Screenshot

Explore text-to-image generation with X-LXMERT in the AI2 Computer Vision Explorer.

Paper

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi • EMNLP • 2020

PDF View PDF
Semantic Scholar View and cite on Semantic Scholar

CODE

Coming soon.