Test Driving ChatGPT-4o (Part 5)

Convert Image-to-Text, Then Recreate Image From Text

In this post, I ask multimodal ChatGPT-4o to convert a photo to text, then recreate the photo using only its text description.

I’m interested to see how lossy the text description layer will be, in a sense acting as a hidden representation layer.

I’m also curious what prompt engineering techniques will have the most impact.

This is the 5th installment in my series, Test Driving ChatGPT-4o.

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

Part 4 — spatial reasoning IQ test.

I share this experiment real-time on Youtube if you prefer watching:

I give ChatGPT a photo of myself with my two samoyed puppies, Hugs and Bubble, in our backyard with the beautiful mountains of SLC behind us:

Sabrina Ramonov @ sabrina.dev

Experiment 1: Naive Approach

I start with a straightforward naive approach.

I simply ask ChatGPT to create a detailed description of the image and then generate an image from the description.

Sabrina Ramonov @ sabrina.dev

Not bad for a first try!

The generated image includes a mountain range, picnic table, two dogs, Santa Claus pajamas, and my Oakley sunglasses.

However, the image didn't capture many details such as: the black metal fence, me sitting on top of the picnic table, the color of my hair, the color of the scenery (predominantly green rather than yellow), the tree behind the gate, etc.

Experiment 2: Meta-Prompting

Next, I try something more advanced: meta-prompting.

Instead of a naive prompt, I ask ChatGPT to act as an expert prompt engineer and write a detailed prompt to accomplish this image-to-text-to-image task.

Given this brief and simple meta-prompt, ChatGPT is able to create a comprehensive prompt that includes covering the scene, setting, foreground, background, lighting, people, animals, and additional details.

I personally love meta-prompting, especially when I’m feeling lazy or stuck, unsure how to further improve a prompt to get what I want.

Sabrina Ramonov @ sabrina.dev

I copy the prompt generated by ChatGPT and start a new session.

Using the generated prompt, ChatGPT’s resulting description of the photo is very thorough!

It mentions details like my age (“in her late 20s” 🤣 I love you ChatGPT), my clothing details, and the breed of my puppies (samoyed):

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

There are many improvements from the previous photo:

  • tree behind me

  • black metal fence

  • 2 white fluffballs samoyeds

  • me sitting on top of the picnic table

  • color of scenery is green landscape and blue skies

But I’m disappointed the person doesn’t look like me!

If only I could fix that…

Experiment 3: Chain-of-Thought

For my 3rd experiment, I combine meta-prompting from the last experiment with a chain-of-thought technique.

I ask ChatGPT to take a deep breath and proceed step by step.

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Hmm… I like this image, it’s appealing, and the person looks more like me.

It shows details the previous image did not, such as:

  • my yellow socks

  • my black sandals

  • tree behind the gate

  • word “SURF” on my shirt

However, the generated image still missed the mark:

  • only 1 fluff cloud instead of 2

  • weird 2-tiered black metal fence

  • lots of backyard furniture that shouldn’t be there

  • strange picnic table, looks really annoying to get in/out of

ChatGPT got many precise features correct in its text description

… yet failed to generate an image containing these features.

For example:

She is wearing a long-sleeved dark blue shirt with the word “SURF” visible on her left sleeve.

Sabrina Ramonov @ sabrina.dev

ChatGPT’s description is correct, but the generated image contains the word “SURF” not on my sleeve. Meanwhile, the symbols that ChatGPT put on my sleeve don’t appear to form real words.

Experiment 4: Interactive Prompting

Next, I try to iteratively refine specific details by giving ChatGPT direct feedback on what aspects of the image I want to modify.

I update the prompt from my last experiment, adding the instructions below, hoping to get ChatGPT to add more descriptive detail about me (the subject) and the background:

First, identify the subject of the image.

Second, identify the background of the image.

Be sure to include a lot of detail about the subject and the background separately.

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

I love the background here!

I also love her hair… my hair does NOT look like that, many thanks ChatGPT!

The biggest issue: doggos are all wrong, adorable yes, but wrong.

I continue the same chat session and engage in interactive prompting.

I tell ChatGPT:

  • Keep the background and scenery exactly the same

  • Only replace the 2 dogs with 2 samoyed dogs

  • The dogs shouldn't be wearing a sweater

  • Both dogs are white and fluffy

The result?

Sabrina Ramonov @ sabrina.dev

The dogs are more accurate, but the background changed way too much.

The picnic table is all wrong again, we don’t have a stone base below our fence, we don’t have trees or bushes in our backyard within the fence…

I have no idea how that samoyed is not tumbling over. Physics?!

Most importantly — what happened to my shirt?!?

NSFW ChatGPT!

It seems ChatGPT translated “a moment of leisure and happiness” in its generated text description to “half her shirt is gone” 😆 

Makes me wonder where that training data came from…

Sabrina Ramonov @ sabrina.dev

This experiment highlights a major challenge I’ve encountered multiple times with Gen AI and image generation:

It’s hard to maintain consistency while refining specific elements.

Experiment 5: 9×9 Grid

Finally, I try having ChatGPT analyze the image piecewise. Perhaps if ChatGPT creates a detailed description of each piece of the original photo, it will be better able to generate precise details.

First, break the image into a grid of nine equally sized sections.

Then, analyze and describe each section in detail.

Last, piece the 9 sections together.

The result?

Sabrina Ramonov @ sabrina.dev

Epic fail.

This created a 9×9 grid of images, where each grid emphasized different aspects of the original image.

Not what I’m looking for, but this would make cute canvas art home deco!

Bonus: Updated Meta-Prompt with Example

@billyrubano8268 on Youtube accurately pointed out I forgot to copy and paste the “example” in the prompt generated by ChatGPT.

So I re-ran the meta-prompting experiment a few more times and finally got my favorite image so far!

Yay, ChatGPT, meta-prompting for the win!

Sabrina Ramonov @ sabrina.dev

Conclusion

Through these experiments, I discovered that converting an image to a text description and then recreating it is fairly complex.

The text description layer can be quite lossy, making it difficult to capture every detail accurately.

To be clear, the newest model ChatGPT-4o is analyzing the image and converting it to text description.

But the last step, generating an image from text, presumably uses DALL-E as OpenAI has not yet released GPT-4o powered image generation.

Out of all tests, the meta-prompting technique seemed to work best.

Thanks for following along with my journey!

This concludes part 5 of my series Test Driving ChatGPT-4o!

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

Part 4 — spatial reasoning IQ test.