DALL-E 2, AI-Generated Art, and the Future of Creativity

Check out a transcript of the conversation

Paul Davenport: Welcome to AI in Plainsight, the Computer Vision podcast where we chat with the vision AI pros to discuss the latest use cases, nerd out on potential new applications, and expand our own understanding of the potential for computer vision in production. I’m Paul Davenport, Plainsight’s Director of Communications, and joining me as co-host is Bennett Glace, Plainsight’s Content Marketing Manager. How’s it going, Bennett?

Bennett Glace: It’s going great, Paul. Happy to be here.

Paul Davenport: Awesome. All right, so today we’ve invited Logan Spears, Plainsight’s, Co-Founder and CTO to join us in a conversation on one of the hottest topics in tech this year: AI-generated imagery. With solutions like DALL-E 2 and Midjourney composing impressive scenes that have won awards and attracted no shortage of controversy, Logan joins us to discuss the data science behind popular solutions, a few remaining limitations and challenges, and some of the big questions posed by the rise of AI-generated artwork. Enjoy the show.

BG and PD: Logan, welcome to the show.

Logan Spears: Hey, happy to be here.

BG: So, Logan, off mic you mentioned that DALL-E 2 is one subject that tends to come up in a lot of casual conversation.I think you said it’s the one thing that tends to come up the most at dinner parties. What sort of questions do people tend to ask?

LS: Well, I think the initial reaction is ‘how does that work?’ Right? It’s not something that people expect AI to be able to do today, but they’re seeing it. I think people have accepted that AI will beat them at chess, perform some automation work and detect, let’s say, the number of people in an image or something like that. But, I don’t think people were ready, this early, for it to be more creative than they are. I think that caught people off guard and I think there’s been a significant wow factor.

PD: Yeah, so it was very much something that we understand AI as a practical tool, but not as something that maybe competes with our own imaginations. Is that maybe something that you’re seeing with folks?

LS: I think so. I think that’s part of why it’s captured the public’s imagination and caught people off guard.

PD: All right, So we’ve obviously seen some examples of DALL-E 2’s work, but what exactly is going on there? How does this technology work? Could you unpack that a little bit for us in detail?

LS: Sure. It’s easiest to conceptually break it up into a couple different parts. So, first, you’re taking a text. A Corgi playing a trumpet, for example, and generating an image from that. And there’s a couple steps in between that help understand what a Corgi playing a trumpet means – semantically, as not just words in a specific order. There’s been a lot of development in the national language processing (NLP) space and I like to think of the massive developments as really a couple of tricks for massive scale, what’s called semi-supervised learning. In the NLP space, you take a sentence or a paragraph and you try to predict the next word or a blanked-out word within that paragraph. That requires a semantic context to be able to do that and you can do that on the scale of the entire internet. Think of all of Wikipedia, for example, and if you do that enough, you will gain language understanding. So, you can create a large model, let that run, and eventually you’ll get high-fidelity language understanding on a semantic level. That’s part one, right? Kind of understanding what’s in that text. Step two is matching that to what’s in an image. So, really reversing the problem. You’re going from an image to text, like image captioning, describing what’s in an image. You can see that with models like Open AI’s Clip. They’re also scouring the internet, but using text and image pairs. Think of Google photos or image search. Here, you have a text description or a caption. You see it in news articles for example, it’ll describe ‘Person on the Left Doing X, Y, and Z’ – that kind of thing. If you scour the internet for enough of that data, you can build a dataset and predict the text from an image. That kind of bridges the language and vision divide where you can create what’s called shared embeddings that are essentially neurons in a neural network that have a shared understanding of what a concept actually is across modality.

Then, what DALL-E 2 did in addition to that is a diffusion model technique where they actually start from noise and work backwards to generate an image. And to train this, you also reverse the problem. So, they took a lot of images and gradually added noise to them and then trained it going the other way, to make predictions going the other way. So, if you stack all that together, you have a text prompt, you start with noise, and you can generate an image that shares that semantic similarity with the text prom. You get a picture of a Corgi playing the trumpet, for example, and you get a couple of them because you can start from noise. You generate say five or six examples and the user can select which ones they like. At a conceptual level, that’s how DALL-E 2 works.

BG: In the US and several other countries around the world, patent law has held that AI can’t be credited with an invention. To me, that seems to suggest that, in the eyes of the law, at least, AI is incapable of creativity or truly independent thought. I’m wondering where you land on this issue and if you think what DALL-E 2 does is an example of creativity or how it compares to true creativity.

LS: An example in this space, from a legal precedent perspective, is a Supreme Court case from 2006. I think it was Google Books or Google and their Books Initiative versus the Author’s Guild. The Supreme Court ruled that derivative use of copyrighted works was fair use and that it was legal for Google Books to extract copyrighted material and build things like AI models, et cetera, from that derivative use. So that’s kind of the most interesting precedent in the legal space, at least for AI use. And that’s how things like DALL-E 2, for example, are able to scrape the internet and it’s legal, right? And we can see that as kind of a shared consciousness and it kind of makes sense intuitively that it’s all within fair use. As far as attribution of those models, to me, and I’ve been on a couple patents, it kind of seems like a bureaucratic issue. They have to list inventors, even if those are assigned to companies and companies own that patent. To me, it seems like an intellectual property issue more than about setting precedent as to who’s really the creator. Obviously, the government bureaucracy is not ready for AI to be treated like people today. We hear about corporations as people and that’s kind of a legal simplification. I don’t think the government is ready to say that neural networks are people today.

PD: That’s super interesting. You hit on the point that it’s not much of a philosophical question when it gets to the legal realm. It’s very much, ‘there needs to be accountability and ownership here.’ We’re not great about agreeing on things like that, so it makes sense there’s some back and forth. Another question building off of that – where is AI-generated art already seen today, beyond the DALL-E 2s? Is it in other scenarios, out in production in the world?

LS: You’re already seeing people use these models for actual production use cases. One of the first areas that I noticed was YouTube thumbnails. Creators who got an invite to DALL-E 2 and the Open AI Beta started using it for their thumbnails because they could generate more creative work faster than if they were using stock footage or hiring another artist to create something novel. Those options mean a significant time delay, not measured in seconds. There are a number of use cases where it might be more expedient to use AI-generated art. Think of video games and 2D assets. They have to generate tons and tons of assets for a game like GTA V, for example. Or, Magic: The Gathering, the card, the person or character on it, the card’s whole design. You have situations with certain companies that have to generate a vast amount of unique art and do that where margins might matter. And that’s where the AI-generated art is particularly interesting.

PD: Gotcha. So, expediency and scale are really kind of the big benefits when working with AI-generated art. That said, should artists themselves be worried, in your opinion?

LS: Well, I think what we’ve seen over the past couple hundred years is that everybody should be worried eventually, no matter what profession they’re in. Professions and jobs are constantly being supplanted by technology and people are having to be retrained and adapt to the new world order. That being said, I think some of the first areas that might be in trouble are things like stock imagery, for example, where they have to acquire content from photographers. That’s one area where it could be cannibalized. There are some other creative tasks that look like good candidates as well. Maybe logo generation, something that’s simplified or kind of a well-defined problem like generating iOS app icons or something. If that’s your job, that’s something I could see being automated pretty quickly and people might prefer that with full control and no additional cost of a human in the loop. So there definitely are areas that I think are vulnerable to this technology.

BG: Are there specific types of imagery that have been especially challenging for DALL-E 2 and similar solutions so far? And if so, why? Why these types of images?

LS: If you think about what’s challenging for AI and DALL-E, specifically, today, it’s things that aren’t represented well on the internet. There’s a systemic bias of what’s training these things, and that’s what people take pictures of and put on the internet. And that includes creative works as well. You know, historical paintings that have been scanned and put on the internet. Obviously, DALL-E 2 has the ability to fill in the gaps, right? Because it has that conceptual understanding and mapping of the text to the image. It can kind of generate novel images like a Corgi playing a trumpet, even if that doesn’t exist. But, it has to know what a Corgi is and it has to know what a trumpet is to be able to do that. That’s just an example of that. At Plainsight, we see a lot of examples of data that is not on the internet. Let’s say, QA for manufacturing, where the camera is looking at a manufacturing line all day. A manufacturer can’t use images from DALL-E 2 to simulate their assembly line today. It’s unique to them. It’s not represented on the internet, and DALL-E has no priors to kind of generate something close. It could maybe generate, you know, microchips on a factory floor or something like that from an image, but they won’t be specific to the customer and what they care about. So, I think that’s the current limits of this technology. It also can get confused on something that can seem obvious to humans, things like eyes tend to be a problem. Whereas we’re hardwired, even as infants, to detect the pattern of the face and what eyes should look like and things like that. Humans notice pretty quickly when DALL-E makes a mistake like that, so there’s maybe certain advantages ingrained in our biology.

PD: That’s always been kind of my uncanny valley take. Whenever I’ve seen an image of ‘so-and-so celebrity doing something in space,’ I’m like, ‘okay, the space stuff actually looks right, but the person’s face? That, I’m lost with.’ So that makes a ton of sense.

LS: And that’s kind of the spooky factor to it. It doesn’t understand us completely, we’re just another object or something so there’s still a key difference between the man and the machine. And one other contributing factor to some limitations is that they’ve blacklisted proper nouns. Images of specific people could be too politically charged. There’s ethical issues so OpenAI chose to restrict their model in that way. It doesn’t know who Barack Obama is, for example. Some solutions choose not to make these restrictions, but that’s been a contributing factor for DALL-E 2. It has to rely on generic people and that reduces the scope of their available training data.

PD: So no Nic Cage riding a Corgi playing a trumpet?

LS: Not for DALL-E 2, at least not yet. This technology will get there though.

BG: On the more philosophical side of things, pertaining to AI and whether or not it can be truly creative, would you characterize AI supported research as creative in its ability to sort of reach novel conclusions or detect what people might have missed?

LS: Yeah, I mean, I think creativity is the application of human potential to solve problems in novel ways that are not just completely linear, kind of connecting the dots. AI research is accomplishing some of the most interesting new developments in humanity right now, in my opinion. I think there’s gonna be a Nobel Prize for some of these novel model architectures and developments. So I would definitely categorize the research side as creativity. Another interesting segue there is, are the models themselves creative? Is AI creative? And that’s kind of a more interesting question. They’re doing basically a deterministic process, through what can be characterized as equations. So, is that creativity? I’ll probably leave that to the viewer, but it’s not really the same. We share some biological similarity to how these models are architected. You know, we’ve all heard the term neural network. This structure is biologically inspired. These models are biologically inspired. They share our method of using sensory input and generating mappings and semantic detail and specificity and being able to kind of regurgitate that through other modalities, like language. Even though there’s that kind of biological similarity in the structure – one’s deterministic. You can argue about whether the universe is deterministic as well. It has its laws of physics that underlie it, whereas the neural networks have the math that it’s taken to generate those results. I would probably say models aren’t creative. I think they’re just processes. I can run them on my computer. They definitely generate the equivalent of what a human would be credited with creativity for. So it’s difficult to say.

PD: In this larger conversation too, we kind of allude to that human in the loop, which is something that in any computer vision model is definitely necessary. If it’s gonna be for a production setting or something that’s constantly improving in a less novel application like this, where the whole point is that we wanna see something silly, or see something beautiful, or see something that our senses react to. Whereas when it’s in a place where we need humans who actually inform what that biological replication a neural network is trying to execute on, it’s a lot more important that there’s a person there because there are limitations on the capabilities. Does that pass muster?

LS: Right, if you just train DALL-E 2 on the internet it’ll hit 4Chan, it’ll hit Reddit. It’ll probably be exposed to a lot of racist content. You can’t let these things run wild. They need guardrails and they need some kind of oversight with humans in the loop to predefine those ethical constraints. So definitely humans have to be in the loop. You know, you see that in self-driving cars. What do you do when the car’s speeding and it comes down to a decision between saving the driver and saving the pedestrian? Is there a situation where it can mitigate the impact altogether? Both of those things have to be set by a human. We can’t just trust machine learning because models don’t have ethics. They don’t have feelings yet, so humans need to fill that role.

PD: Makes sense. And I think it’s funny that the example you used brought it to a material scenario. The driver versus the pedestrian is well beyond the philosophical kind of guardrails I was hoping to put on this, but a great example. Thank you so much for joining us Logan. We hope to have you back again soon.

LS: Hey, thanks for having me.

PD: Awesome, and thank you listeners for coming along for our latest episode. Stay tuned for more conversations with the AI pros about what the future looks like for computer vision in production here on AI in Plainsight.