Sabrina Ramonov 🍄
Posts
Test Driving ChatGPT-4o (Part 5)

Test Driving ChatGPT-4o (Part 5)

Convert Image-to-Text, Then Recreate Image From Text

Sabrina Ramonov
May 20, 2024

In this post, I ask multimodal ChatGPT-4o to convert a photo to text, then recreate the photo using only its text description.

I’m interested to see how lossy the text description layer will be, in a sense acting as a hidden representation layer.

I’m also curious what prompt engineering techniques will have the most impact.

This is the 5th installment in my series, Test Driving ChatGPT-4o.

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

Part 4 — spatial reasoning IQ test.

I share this experiment real-time on Youtube if you prefer watching:

Experiment 1: Naive Approach
Experiment 2: Meta-Prompting
Experiment 3: Chain-of-Thought
Experiment 4: Interactive Prompting
Experiment 5: 9×9 Grid
Bonus: Updated Meta-Prompt with Example
Conclusion

I give ChatGPT a photo of myself with my two samoyed puppies, Hugs and Bubble, in our backyard with the beautiful mountains of SLC behind us:

Sabrina Ramonov @ sabrina.dev

Experiment 1: Naive Approach

I start with a straightforward naive approach.

I simply ask ChatGPT to create a detailed description of the image and then generate an image from the description.

Sabrina Ramonov @ sabrina.dev

Not bad for a first try!

The generated image includes a mountain range, picnic table, two dogs, Santa Claus pajamas, and my Oakley sunglasses.

However, the image didn't capture many details such as: the black metal fence, me sitting on top of the picnic table, the color of my hair, the color of the scenery (predominantly green rather than yellow), the tree behind the gate, etc.

Experiment 2: Meta-Prompting

Next, I try something more advanced: meta-prompting.

Instead of a naive prompt, I ask ChatGPT to act as an expert prompt engineer and write a detailed prompt to accomplish this image-to-text-to-image task.

Given this brief and simple meta-prompt, ChatGPT is able to create a comprehensive prompt that includes covering the scene, setting, foreground, background, lighting, people, animals, and additional details.

I personally love meta-prompting, especially when I’m feeling lazy or stuck, unsure how to further improve a prompt to get what I want.

Sabrina Ramonov @ sabrina.dev

I copy the prompt generated by ChatGPT and start a new session.

Using the generated prompt, ChatGPT’s resulting description of the photo is very thorough!

It mentions details like my age (“in her late 20s” 🤣 I love you ChatGPT), my clothing details, and the breed of my puppies (samoyed):

Sabrina Ramonov @ sabrina.dev

There are many improvements from the previous photo:

tree behind me
black metal fence
2 white fluffballs samoyeds
me sitting on top of the picnic table
color of scenery is green landscape and blue skies

But I’m disappointed the person doesn’t look like me!

If only I could fix that…

Experiment 3: Chain-of-Thought

For my 3rd experiment, I combine meta-prompting from the last experiment with a chain-of-thought technique.

I ask ChatGPT to take a deep breath and proceed step by step.

Sabrina Ramonov @ sabrina.dev

Hmm… I like this image, it’s appealing, and the person looks more like me.

It shows details the previous image did not, such as:

my yellow socks
my black sandals
tree behind the gate
word “SURF” on my shirt

However, the generated image still missed the mark:

only 1 fluff cloud instead of 2
weird 2-tiered black metal fence
lots of backyard furniture that shouldn’t be there
strange picnic table, looks really annoying to get in/out of

ChatGPT got many precise features correct in its text description

… yet failed to generate an image containing these features.

For example:

She is wearing a long-sleeved dark blue shirt with the word “SURF” visible on her left sleeve.

Sabrina Ramonov @ sabrina.dev

ChatGPT’s description is correct, but the generated image contains the word “SURF” not on my sleeve. Meanwhile, the symbols that ChatGPT put on my sleeve don’t appear to form real words.

I’ve seen this before — ChatGPT’s difficulty generating precise details from a precise description.

Experiment 4: Interactive Prompting

Next, I try to iteratively refine specific details by giving ChatGPT direct feedback on what aspects of the image I want to modify.

I update the prompt from my last experiment, adding the instructions below, hoping to get ChatGPT to add more descriptive detail about me (the subject) and the background:

First, identify the subject of the image.

Second, identify the background of the image.

Be sure to include a lot of detail about the subject and the background separately.

Sabrina Ramonov @ sabrina.dev

I love the background here!

I also love her hair… my hair does NOT look like that, many thanks ChatGPT!

The biggest issue: doggos are all wrong, adorable yes, but wrong.

I continue the same chat session and engage in interactive prompting.

I tell ChatGPT:

Keep the background and scenery exactly the same
Only replace the 2 dogs with 2 samoyed dogs
The dogs shouldn't be wearing a sweater
Both dogs are white and fluffy

The result?

Sabrina Ramonov @ sabrina.dev

The dogs are more accurate, but the background changed way too much.

The picnic table is all wrong again, we don’t have a stone base below our fence, we don’t have trees or bushes in our backyard within the fence…

I have no idea how that samoyed is not tumbling over. Physics?!

Most importantly — what happened to my shirt?!?

NSFW ChatGPT!

It seems ChatGPT translated “a moment of leisure and happiness” in its generated text description to “half her shirt is gone” 😆

Makes me wonder where that training data came from…

Sabrina Ramonov @ sabrina.dev

This experiment highlights a major challenge I’ve encountered multiple times with Gen AI and image generation:

It’s hard to maintain consistency while refining specific elements.

Experiment 5: 9×9 Grid

Finally, I try having ChatGPT analyze the image piecewise. Perhaps if ChatGPT creates a detailed description of each piece of the original photo, it will be better able to generate precise details.

First, break the image into a grid of nine equally sized sections.

Then, analyze and describe each section in detail.

Last, piece the 9 sections together.

The result?

Sabrina Ramonov @ sabrina.dev

Epic fail.

This created a 9×9 grid of images, where each grid emphasized different aspects of the original image.

Not what I’m looking for, but this would make cute canvas art home deco!

Bonus: Updated Meta-Prompt with Example

@billyrubano8268 on Youtube accurately pointed out I forgot to copy and paste the “example” in the prompt generated by ChatGPT.

Source

So I re-ran the meta-prompting experiment a few more times and finally got my favorite image so far!

Yay, ChatGPT, meta-prompting for the win!

Sabrina Ramonov @ sabrina.dev

Conclusion

Through these experiments, I discovered that converting an image to a text description and then recreating it is fairly complex.

The text description layer can be quite lossy, making it difficult to capture every detail accurately.

To be clear, the newest model ChatGPT-4o is analyzing the image and converting it to text description.

But the last step, generating an image from text, presumably uses DALL-E as OpenAI has not yet released GPT-4o powered image generation.

Out of all tests, the meta-prompting technique seemed to work best.

Thanks for following along with my journey!

This concludes part 5 of my series Test Driving ChatGPT-4o!

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

Part 4 — spatial reasoning IQ test.

Connect with me on LinkedIn!

Sabrina Ramonov

www.linkedin.com/in/sabr1na