Deep Generative Models with Discrete Latent Codes for Text-to-Image Generation
Perustieteiden korkeakoulu | Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Machine Learning, Data Science and Artificial Intelligence
Master’s Programme in Computer, Communication and Information Sciences
49 + 0
AbstractText-to-image generation remains challenging and comprehensive task in the area of generative models. In this work, we considered a recently proposed approach, called DALL-E. This model is based on two popular neural network architectures - Discrete Variational AutoEncoder and Transformer. The variability of possible Transformer configurations opens up an opportunity to explore the influence of different architectural choices on the model's performance. We concentrated on the way, how DALL-E processes the input sequence of text and image tokens. More specifically, we tried to check if there is any systematic advantage of using a separate text encoder instead of processing both data modalities (text and image) by the same autoregressive component (Transformer encoder). Additionally, we performed an analysis of different types of Discrete Variational AutoEncoders. For the purpose of comparison between different Transformer components of the DALL-E approach, we created a specific dataset, that we called Multi-Descriptive MNIST. This dataset consists of the descriptions and corresponding images with sequences of digits with different characteristics, like color, size or location of the canvas. Also, we conducted some experiments on the CUB dataset, that consists of birds images with corresponding textual descriptions. Finally, we developed a set of specific metrics, to compare the quality of the generated images. Since text-to-images task implies that text provides us the control over the generation process, we concentrated on the measurement of the consistency between the text and images, that are generated by this text. The idea behind the proposed metrics relies on the Contrastive Language–Image Pre-training (CLIP) model, that was recently introduced as a way to perform image classification based on the relevance between images and texts. Based on the conducted experiments, we found a statistically significant advantage of using separate text encoder for the DALL-E approach over the original method on the specifically prepared artificial dataset. Also, the model with separate text encoder was trained on CUB dataset from scratch to generate images of birds consistent with the given text.
Thesis advisorIlin, Alexander
deep learning, transformer, VQ-VAE, dVAE, CLIP, variational AutoEncoder