AI Whimsy: An Image Is a Story Is an Image

Published in

cogapp

7 min readOct 11, 2023

The best paintings and photos stimulate our brains, they make us ask: what’s going on here, what’s the context, and what’s gonna happen next? They tell stories and let our brains fill in the gaps. In our latest Cogapp hackday, we asked AI this question: what will happen next in this image? And to make it more fun, we had it generate images based on these predictions.

1911 painting Nonchaloir (Repose) by John Singer Sargent, National Gallery of Art, Washington. (This image is in the public domain)

Our team wanted to create an app that could take an image as input and generate a new image to show what happens next in the story. We also wanted to give the user some options to choose from different genres, like romantic comedy, action adventure, horror, etc. What if The Scream was part of a horror movie? Or what if Nighthawks was a Film Noir? Or mix it all up: Let the Mona Lisa loose as a spy in an action flick!

AI-generated images showing us what might happen next in Nonchaloir in the genres romantic comedy, horror film, and as an action adventure.

How did we make it happen?

Story and image generation

To make our app, we communicated with a number of AI APIs to send prompts and receive replies in the form of text and images:

Input image: The user uploads an image of their choice to the app. The image can be anything: a painting, a photograph, a drawing, etc.
Image to text: The app sends the image to Replicate’s mini GPT-4 model which returns a description of the image. This model is very powerful because it lets you add a text prompt along with the image. In order to generate something that is a bit like a story rather than just a static image description, we are passing the following prompt: “Describe what is happening in this image. Do not mention that it is an image.”
Text to story: Next, we take the generated text and send it to OpenAI’s GPT-3 model to generate a small text that describes what happens next in the story. In order to get a creative and coherent story continuation for a number of different genres, we give it a prompt that specifies the style as well as the length of the story, for example “Given the following image description, generate another image description that describes what happens next in the story. The second image description should be no longer than 400 characters and should be coherent and creative. The style of the story should be [genre name].” We have a dictionary of a few genres which we pass to the prompt one after another.
Story to image: Finally, we send the newly generated stories of each genre to Replicate’s stable-diffusion model to generate a new image for each one of them.

Our app

The user interface is a simple website. Since we were working in Python, we chose Flask to generate it. The homepage shows all the images that have already been added to the site. Users can click on any of them and then get to pick any of the listed genres. When they click on one of the genre buttons, the original image fades into the new image while the story appears next to it letter by letter mimicking the look and sound of a typewriter. We felt it was an appropriate design for our storytelling app. Users can also choose to upload their own image. If they choose to do so, they will have to give some information about the source image (a title and a source at least). This will be used as the image title and to display credit information.

For each upload, our app saves the credits file (simply as markdown) and the original image alongside the generated images and texts inside a subdirectory. It takes a minute or two for a newly uploaded image to be processed: run through all the steps and communicate with the three APIs. But then the image gets added to our database and to the homepage.

What if this 16th century Field Armor (as seen in the Art Institute Chicago) was thrown into a 20th century noir adventure? This is what!

What did we learn?

We built this all in one frenzied hackday in our Brighton office. Our team consisted of a senior and a junior staff member. Both of us made generous use of our trusty GPT assistants. Neither had worked with Flask before, but chatgpt4 was very helpful in giving us the code snippets and explanations we needed to set up our app quickly and easily. In the process, we learned about Flask but also about prompting! It was amazingly easy to create the website and also to communicate with the APIs with GPT’s assistance — much faster than reading the docs and building this from scratch.

Although it turned out that refining our prototype — fixing issues and adding functionality — in the following weeks also using chatgpt4 was more challenging as the code grew more and more complex and now feels a bit like an overgrown backyard garden rather than a well-organised vegetable patch. However, the app works and does most of what we desired.

Successes and challenges

We were very impressed by the quality and diversity of the outputs. GPT-3 generated creative and humorous stories. It was perfect for the scope of the project.

For example we fed it this beautiful quilt and below you can see some of the stories generated:

Quilt of squares on a beige base, each with a unique flower or small scene with dark-skinned figures and animals. — Album Quilt from 1854 as seen at the Art Institute of Chicago

Film Noir: In the dimly lit room, Detective Malone’s sharp eyes fixated on the patchwork quilt with suspicion. As he studied the intricate design, he grew certain that hidden within its colorful threads lay the key to the mysterious heist. With gloved hands, he lifted a corner, revealing a hidden compartment beneath. Tightly concealed within, he discovered a small, folded piece of paper — an address that would lead him one step closer to unraveling the enigma.
Surrealist French Movie from the 60s: As the sun sets, the quilt begins to shimmer and pulsate with a mystic energy. Suddenly, the fabric ripples and folds, transforming into a living tapestry. The trees grow leaves of laughter, birds sing kaleidoscopic melodies, and animals weave through the vibrant threads. The quilt comes alive, inviting the viewer into a whimsical world filled with color and enchantment.
Romantic Comedy: As the camera pans out, we see a couple sitting on a picnic blanket, marveling at the quilt. They start playfully pointing out their favorite patterns and colors. Laughing, they make a pact to embark on a road trip to find fabrics that match the quilt’s vibrant shades. Little do they know, this colorful adventure will weave them together in a love story as beautiful as the patchwork masterpiece before them.

The quality of the images returned by Stable-Diffusion was diverse. It is not bad at creating images, but we did encounter its infamous shortcomings: distorted faces and missing or extra limbs. But a detailed if cumbersome list of negative prompts helped to improve the image quality significantly:

“disfigured, oversaturated, grainy, low-res, deformed, blurry, poorly drawn face, mutation, mutated, calligraphy, sign, writing, watermark, text, extra limb, poorly drawn hands, missing limb, floating limbs, extra limb, out of focus, ugly, disconnected limbs, malformed hands, blur, out of focus, long neck, disgusting, poorly drawn, childish, mutilated, mangled, old”.

One of the biggest challenges was getting mini-gpt to return prompts that were in the correct format. Consequently, a few features of the app had to be left by the roadside. For example, in our initial image description generation, we would have liked to also ask the miniGPT4 to give us the medium and artistic style of the uploaded image so that we could create matching images: photographs would generate other photographic looking images, the Mona Lisa would generate oil paintings in Da Vinci’s style and so on. We tried using miniGPT for this but the results it produced — while correctly identifying art styles and mediums of images — were simply not reliably formatted. We tried telling it to give us the result in the form:

medium: oil on canvas, art style: impressionism

But unfortunately this only worked 80 or 90 percent of the time and had we tried to use this information programmatically, would have created errors or some very unforeseen results.

Nevertheless we are pleased with the results and had a lot of fun along the way. We have made the results available for you to view although unfortunately you won’t be able to upload your own images as the costs of opening this experiment to the world would be unpredictable. But you are welcome to visit our overgrown garden and have a look at the strange and wonderful images and stories we have grown: https://whatnext.pythonanywhere.com/

The team consisted of two members: Tristan is Technical Director and Scout is QA Lead at Cogapp. Cogapp is a digital agency specialising in the cultural sector. Please get in touch if you’d like to hear more about our other hack-day projects or any of our other work.

If you’re interested in joining us for a hack day, there’s more information on our website.

We’re on Twitter or you can contact us via our website.