forked from slds-lmu/seminar_multimodal_dl
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03-04-usecase.Rmd
96 lines (59 loc) · 14.8 KB
/
03-04-usecase.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
## Generative Art {#c03-04-usecase}
*Author: Nadja Sauter*
*Supervisor: Jann Goschenhofer*
```{r Logo, echo=FALSE, out.width="90%", fig.cap="(ref:Logo)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/Logo.png")
```
(ref:Logo) LMU logo in style of Van Gogh's Sunflower painting
As we have seen in subsection \@ref(c02-02-text2img), computers can create images only based on text prompts via multimodal deep learning. This capability is also used in digital arts in the field of ‘generative art’ or also known as ‘computer art’. The new movement comprises all artwork where the human artist cedes control to an autonomous system [@galanter2016generative]. In this way everyone, even artistically untrained people, can easily create pictures as the computer takes over the image generation. In some way, the computer becomes the artist with some sort of creativity, a distinct human ability. In this chapter, we want to give an overview about how computers improved over time in generating images and how this is used in the contemporary arts scene. For instance in Figure \@ref(fig:Logo) we used the seal of the Ludwig Maximilians University and changed the style to Van Gogh's [Sunflower painting](https://wallpaperaccess.com/full/787825.jpg) by the [Neural Stlye Transfer Algorithm ](https://www.tensorflow.org/tutorials/generative/style_transfer) and the method [CLIP + VQGAN](https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ#scrollTo=FhhdWrSxQhwg) which fuses the logo with sunflowers in a Van-Gogh-style way.
### Historical Overview
The first attempt to use AI to generate pictures was made by the engineer Alexander @mordvintsev_2015 and his "DeepDream" Software. He used Convolution Neural Networks to generate very interesting and abstract images based on the activation of a layer, visualizing the patterns learned by a neural network. Below you can see a picture of a Labrador after it was processed by the DeepDream algorithm.
```{r DeepDream, echo=FALSE, out.width="50%", fig.cap="(ref:DeepDream)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/DeepDream.png")
```
(ref:DeepDream) Picture of a Labrador processed by DeepDream ([ Google Colab](https://www.tensorflow.org/tutorials/generative/deepdream))
In the following year, @StyleTransfer investigated methods to transfer the style of pictures. This method was used to transfer the style of Van Gogh's Sunflower painting to the LMU seal at the beginning of this chapter (see Figure \@ref(fig:Logo)). Besides, below in Figure \@ref(fig:StyleTransfer2) you can see the same Labrador picture from Figure \@ref(fig:DeepDream) in [Kandinsky style](https://storage.googleapis.com/download.tensorflow.org/example_images/Vassily_Kandinsky%2C_1913_-_Composition_7.jpg).
```{r StyleTransfer2, echo=FALSE, out.width="50%", fig.cap="(ref:StyleTransfer2)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/Kandinsky.png")
```
(ref:StyleTransfer2) Picture of a Labrador with Kandinsky style [(Google Colab)](https://www.tensorflow.org/tutorials/generative/style_transfer)
Furthermore, the architecture of Generative Adversarial Networks (GANs), which was first introduced by @NIPS2014_5ca3e9b1, was used by another research group @karras2019style to create very realistic fake images with their architecture StyleGAN. For instance, one can create pictures of people who do not exist, but look totally realistic (see Figure \@ref(fig:GAN)).
```{r GAN, echo=FALSE, out.width="50%", fig.cap="(ref:GAN)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/StyleGAN.jpeg")
```
(ref:GAN) Fake face generated by [StyleGAN](https://thispersondoesnotexist.com/)
Nevertheless, it was almost impossible to control the exact output of these early forms of AI art. There was no option to make specifications of how the result should look like in detail. For instance, you always get a human face with the earlier mentioned StyleGAN application, but you cannot specify to generate a blond girl with green eyes. This can be achieved by applying the artist-critic paradigm [@8477754]: Thereby, the computer as an artist generates a picture based on what the Neural Network learned in the training phase (e.g. StyleGAN learns to generate pictures of human faces). Additionally, a critic is used to tell the computer if the output satisfies the concrete idea of the human artist. For this reason multimodal deep learning models emerged in the field of generative art. Here, one can control the output with the help of text prompting. In this way one can check if the generated picture matches the initial text description. Looking at the previous StyleGAN example, the multimodal architecture supervises whether the output picture is indeed a blond girl with green eyes or not. A new class of models for generating pictures evolved.
This idea was used by OpenAI for their models DALL-E [@DALLE] and CLIP [@CLIP] which were released in January 2021. Both architectures are critics for multimodal models. Only a few days after the release, Ryan Murdock combined CLIP (critic) with the already existing Neural Net “BigGAN” (artist) in his “The Big Sleep” software. Furthermore, @StyleGAN developed StyleCLIP, a combination of StyleGAN (artist) and CLIP (critic) to edit parts of images via text instructions. In the following months, Katherine Crowson combined CLIP as critic with the existing VQGAN algorithm as an artist. She also hooked up CLIP with guided diffusion models as artists to yield more fine-grained results. This approach was further investigated by OpenAI that published a paper [@DiffusionModels] in May 2021 about guided diffusion models. Moreover, in December 2021 they introduced GLIDE [@GLIDE], a model with CLIP or classifier-free guidance as critics and diffusion models as artists. For more technical details about text2img methods like DALL-E and GLIDE refer to subsection \@ref(c02-02-text2img)
or for text supporting CV models like CLIP at subsection \@ref(c02-04-text-support-img).
### How to use these models?
A lot of different notebooks are publicly available to apply the different pre-trained models. In general, all notebooks work pretty similar: one only needs to enter a text prompt in the code and after running the notebook the computer generates a picture based on these instructions. It is relatively easy and no prior coding knowledge is required. Moreover, there are also some API and GUI applications (e.g. [MindsEye beta](https://multimodal.art/mindseye)) where no programming knowledge is needed at all. Using these models, it is important to think about how exactly once enters the respective text prompt. One can influence the output in a desired way with little changes in the short text instruction. This is also known as "prompt engineering". For instance, in the beginning of this chapter, we entered the prompt “in the style of Van Gogh” to change the style of the LMU seal. In this context, a special trick is to append “unreal engine” [@unrealEngine] which makes the resulting pictures more realistic with higher quality. This seems surprising at first, but the models were trained on data from the internet including pictures of the software company Epic Games that has a popular 3D video game engine called "Unreal Engine". This is one of the most popular prompting tricks.
Unfortunately, OpenAI has never released DALL-E. There is only an open-source version called ruDALL-E [@ruDALLE] that was trained on Russian language data. Besides, hugging face hosts DALL-E mini [@DALLEmini] where one can generate pictures, but does not have access to the model itself. PyTorch offers a replication of the DALL-E code [@DALLEpytorch] but no trained model. Furthermore, CLIP was released without publishing the used training data. However, there exists an open source data set with CLIP embeddings called LAION-400m [@LAION]. In the following, we used different publicly available notebooks to try out the different models [CLIP + BigGAN](https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR?usp=sharing),
[CLIP + VQGAN](https://colab.research.google.com/drive/1ZAus_gn2RhTZWzOWUpPERNC0Q8OhZRTZ#scrollTo=FhhdWrSxQhwg),
[CLIP + Guided Diffusion](https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj#scrollTo=X5gODNAMEUCR),
[GLIDE](https://colab.research.google.com/github/openai/glide-text2im/blob/main/notebooks/text2im.ipynb)
with the text prompt *"a fall landscape with a small cottage next to a lake"* (see Figure \@ref(fig:comparison1)) and *"panda mad scientist mixing sparkling chemicals, artstation"* (see Figure \@ref(fig:comparison2)). The first prompt shows pretty realistic results, whereas the second prompt results in more different and "crazy" outputs. That is because the panda-prompt is more abstract than the first one and hence more difficult to illustrate. In addition, some of the notebooks run on lower resolution due to computational limitations. Besides, GLIDE is also downsized by the publisher: The released smaller model consists of 300 million parameters, whereas the unreleased model has about 3.5 billion parameters [@GLIDE]. So better results are possible with higher computational power and other implementations of the models.
```{r comparison1, echo=FALSE, out.width="100%", fig.cap="(ref:comparison1)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/fall_landscape.png")
```
(ref:comparison1) Comparison of different models with prompt "fall landscape with a small cottage next to a lake"
```{r comparison2, echo=FALSE, out.width="100%", fig.cap="(ref:comparison2)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/panda.png")
```
(ref:comparison2) Comparison of different models with prompt "panda mad scientist mixing sparkling chemicals, artstation"
### Different tasks and modalities
So far, we concentrated on the two modalities text and image. Combining both of them, one can tackle different tasks with the models mentioned above. The main usage is to generate images based on a text prompt. Therefore, one can start from noise or but is also possible to chose a real image as starting point [@qiao2022initial]. This was done in the beginning with the LMU seal by CLIP + VQGAN (see Figure \@ref(fig:Logo)): instead of starting from noise, the model started from the LMU seal as initialization and then used the prompt "in style of Van Gogh". The video captures how the model develops during fitting. In the end, the typical Van Gogh sunflowers emerge as well as what could be a part of Van Gogh's face.
Furthermore, one can edit, extend, crop and search images with models like GLIDE [@GLIDE]. For instance, @GLIDE fine-tuend the model for text-conditional image inpainting (see figure \@ref(fig:inpainting)). By marking some area in the pictures, here in green, and adding a text prompt, one can edit pictures very easily and precisely. This is quite impressive as the model needs to understand from the text prompt which object should be filled in and then do this in the correct style of the surrounding to produce a realistic outcome. Another idea is to use a sketch of a drawing and let the model fill in the details based on a text caption (see figure \@ref(fig:sketch) below). This allows controlled changes of parts of pictures with relatively little effort. In this way, GLIDE can be used to generate pictures out of random noise, but also to edit pictures in a specific way. Furthermore, it is also possible to combine other modalities as well (see more details in subsection \@ref(c03-01-further-modalities)). For instance, @WZRD accompanies custom videos with suitable audio. It is even imaginable to create sculptures with 3D-printers [@3D].
```{r inpainting, echo=FALSE, out.width="90%", fig.cap="(ref:inpainting)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/Impainting_GLIDE.PNG")
```
(ref:inpainting) Text-conditional image inpainting examples with GLIDE [@GLIDE]
```{r sketch, echo=FALSE, out.width="90%", fig.cap="(ref:sketch)", fig.align="center"}
knitr::include_graphics("./figures/03-chapter3/GLIDE_sketch.PNG")
```
(ref:sketch) Text-conditional edit from user scratch with GLIDE [@GLIDE]
### Discussion and prospects
In the last years, methods to generate images via text prompting improved tremendously and a new field of art arised. It is surprising how these models are able to create images only based on a short text instruction. This is quite impressive as AI achieved some level of creativity. It is up for discussion to which extent the computer is becoming the artist in generative arts and in this way replacing the human artist. However, there is still no direct loss function that can calculate how aesthetically pleasing a picture is [@bias]. This is probably also quite subjective and cannot be answered for everyone in the same way. Most of the time the computer works as aid for the creative process by generating multiple images. Then, the human artist can pick the best outcome or vary the text prompt to improve the output in a desired way. However, the better the AI becomes, the less the human artist needs to intervene in this process.
Furthermore, as the output becomes more and more realistic, there is the risk that these methods are abused to facilitate plagiarism or create fake content and spread misleading information [@misconduct]. After all, the outputs look totally realistic, but are completely made-up and generated by the computer. For this reason, some organisations like Open-AI do not release all their models (e.g. DALL-E) or downstream models (e.g. CLIP). On the other hand, from a scientific point of view, it is important to get access to such models to continue research.
Moreover, similarly to most Deep Learning algorithms, these models are affected by biases in the input data [@bias_ML]. For instance, @bias points out that CLIP text embeddings associate a human being more with a man than with a woman. In this way it might be more likely that our models generate a man with the text prompt "human being" than a woman. This effect needs to be further investigated and should be removed.
After all, generative arts can be used to create Non Fungible Tokens (NFT) relatively easily. NFTs are digital artworks where a special digital signature is added making them unique and in this way non-fungible [@NFT]. The digital artwork is bought and sold online, often by means of cryptocurrency. That is why this field is also called Cryptoart. This provides the perfect platform to sell generative arts. However, this trading market is quite new and controversial, similar to crypotcurrency trading in general.
In conclusion, generative arts is a new and impressive field. It combines technology with arts, two rather opposite fields. The methods are already really impressive and are still getting better and better. For instance, this year Open AI already published DALLE-2 [@DALLE2] that outperforms DALLE-1. It remains highly interesting to follow up with the developments in this field.