Deep Generative Models with Discrete Latent Codes for Text-to-Image Generation

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

Perustieteiden korkeakoulu | Master's thesis

Date

2021-10-18

Department

Major/Subject

Machine Learning, Data Science and Artificial Intelligence

Mcode

SCI3044

Degree programme

Master’s Programme in Computer, Communication and Information Sciences

Language

en

Pages

49 + 0

Series

Abstract

Text-to-image generation remains challenging and comprehensive task in the area of generative models. In this work, we considered a recently proposed approach, called DALL-E. This model is based on two popular neural network architectures - Discrete Variational AutoEncoder and Transformer. The variability of possible Transformer configurations opens up an opportunity to explore the influence of different architectural choices on the model's performance. We concentrated on the way, how DALL-E processes the input sequence of text and image tokens. More specifically, we tried to check if there is any systematic advantage of using a separate text encoder instead of processing both data modalities (text and image) by the same autoregressive component (Transformer encoder). Additionally, we performed an analysis of different types of Discrete Variational AutoEncoders. For the purpose of comparison between different Transformer components of the DALL-E approach, we created a specific dataset, that we called Multi-Descriptive MNIST. This dataset consists of the descriptions and corresponding images with sequences of digits with different characteristics, like color, size or location of the canvas. Also, we conducted some experiments on the CUB dataset, that consists of birds images with corresponding textual descriptions. Finally, we developed a set of specific metrics, to compare the quality of the generated images. Since text-to-images task implies that text provides us the control over the generation process, we concentrated on the measurement of the consistency between the text and images, that are generated by this text. The idea behind the proposed metrics relies on the Contrastive Language–Image Pre-training (CLIP) model, that was recently introduced as a way to perform image classification based on the relevance between images and texts. Based on the conducted experiments, we found a statistically significant advantage of using separate text encoder for the DALL-E approach over the original method on the specifically prepared artificial dataset. Also, the model with separate text encoder was trained on CUB dataset from scratch to generate images of birds consistent with the given text.

Description

Supervisor

Ilin, Alexander

Thesis advisor

Ilin, Alexander

Keywords

deep learning, transformer, VQ-VAE, dVAE, CLIP, variational AutoEncoder

Other note

Citation