Recent multi-modal transformers have achieved tate of the art performance on a variety of multimodal discriminative tasks like visual question answering and generative tasks like image captioning. This begs an interesting question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family - LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT.


In order to make LXMERT paint, we first modify its input image representation to a N x N grid based features and then pass the set to an image generator. Then, we introduce three key training refinements for LXMERT:

  1. Discrete visual representations
  2. Uniform masking
  3. Dropping short and local text for visual prediction pre-training objective

Top: Overview of the proposed X-LXMERT model. Blocks in blue are the modifications we make to LXMERT model to enable it to paint. Bottom: Overview of the image generation architecture. The input to the model is a natural image that is compressed to a quantized latent map of size 8 x 8 by RoI Pooling. We use a generator consisting of multiple residual blocks with SPADE layer which encodes 8 x 8 grid features.

Quantitative Results

X-LXMERT’s image generation capabilities rival models specialized in image generation. In fact, human annotators prefer the images from X-LXMERT than DM-GAN, a state-of-the-art competitor. In the same time, X-LXMERT retain very competitive performance on isual question answerin and reasoning benchmarks with minimal drop.

quantitative table
Comparing X-LXMERT, LXMERT and baselines on image generation, visual question answering andvisual reasoning tasks. The pairwise metric compares LXMERT and DM-GAN; numbers do not sum to 100 dueto the TIE option provided to annotators. Note that X-LXMERT and LXMERT*+Grid are the only models that areable to produce results for all tasks. *: Our re-implementation of LXMERT.

Qualitative Results

The below figures shows some qualitative analysis for text-to-image generation with X-LXMERT. The top and middle row demonstrate how the spatial visual features are gradually updated as the sampling steps proceed. The bottowm row compare X-LXMERT equipped with different sampling strategies with baselines.

intermediate random-140
Images generated by X-LXMERT at intermediate stages of random position sampling.
intermediate gif
Animated generation process for the images above. Images are gradually improved as sampling steps proceed.
Qualitative results
Comparing images generated from baselines and X-LXMERT with different sampling strategies.

Live Demo

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers Screenshot

Explore text-to-image generation with X-LXMERT in the AI2 Computer Vision Explorer.


X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, and Aniruddha Kembhavi EMNLP  2020


Coming soon.