- Sabrina Ramonov
- Posts
- Test Driving ChatGPT-4o (Part 4)

# Test Driving ChatGPT-4o (Part 4)

## ChatGPT-4o vs. Spatial Reasoning IQ Test

Today I test **ChatGPT-4o** **multimodal abilities on a spatial reasoning IQ test**.

This is the 4th installment in my series, Test Driving ChatGPT-4o.

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

# Problem Statement

Here’s a sample IQ test question test that requires **spatial reasoning:**

Sample question from IQ test

# Solution

The 3rd column equals the union of the 1st column and 2nd column.

The same holds true for rows:

The 3rd row equals the union of the 1st row and 2nd row.

Therefore, the **solution is figure 8.**

# Overview of Experiments

My goal is to understand:

**Can ChatGPT can reliably solve spatial reasoning IQ test questions?**

I’ll try various prompts and techniques, including removing the answer options, asking ChatGPT to visualize the answer, as well as the Chain of Thought prompt engineering technique.

**Here are my experiments, from simple to complex:**

Naive prompt with options

Naive prompt without options

Naive Prompt Without Options — Visualize Answer

Naive Prompt Without Options — Solve First, Then Visualize Answer

Naive Prompt Without Options — Chain of Thought, Then Generalize

# 1. Naive Prompt with Options

### Prompt

First, I give ChatGPT-4o an image of the question and ask it to solve the IQ test represented by the image.

Sabrina Ramonov @ sabrina.dev

### ChatGPT-4o Output

`To solve the IQ test, we need to identify the pattern among the shapes in the matrix and determine which of the options completes the series.`

`Let's analyze the given pattern:`

`The matrix consists of three rows and three columns, with the last cell in the third row missing.`

`Each cell contains a diamond shape with some shading.`

`Here is the step-by-step reasoning to identify the pattern:`

**Row-wise Analysis:****First Row:**`The diamond shapes in the first row are all divided diagonally, with one half shaded and the other half unshaded.`

`The direction of shading is consistent (upper left to lower right).`

**Second Row:**`The diamond shapes in the second row are also divided diagonally, but the shading direction changes.`

`The shading alternates directions (upper right to lower left).`

**Third Row:**`Following the previous pattern, the third row should have diamonds divided diagonally with consistent shading.`

`The shading should follow the alternating pattern, which would mean the missing diamond should have the same shading direction as the first row (upper left to lower right).`

**Column-wise Analysis:****First Column:**`The first column shows the diamonds divided with a consistent shading pattern.`

**Second Column:**`The second column follows the shading pattern with alternating directions.`

**Third Column:**`The third column should follow the shading pattern seen in the other two columns.`

**Options Analysis:**`Option 1: The diamond is divided, with shading inconsistent with the required pattern.`

`Option 2: The diamond is divided, with shading inconsistent with the required pattern.`

`Option 3: The diamond is divided, with shading consistent with the required pattern.`

`Option 4: The diamond is divided, but the shading is inconsistent with the required pattern.`

`Option 5: The diamond is divided, with shading inconsistent with the required pattern.`

`Option 6: The diamond is divided, but the shading is inconsistent with the required pattern.`

`Option 7: The diamond is not divided.`

`Option 8: The diamond is divided, with shading consistent with the required pattern.`

`Conclusion: The correct answer should be consistent with both the row and column patterns. Based on this analysis, the option that best fits the shading direction and pattern is Option 3.`

`Thus, Option 3 is the correct answer.`

### My Thoughts

I’m impressed ChatGPT tried to describe the problem image in detail.

Makes me wonder if IQ tests were part of its training dataset?

It **correctly identified** there are 8 answer options.

It **correctly identified** how to approach the problem:

`To solve the IQ test, we need to identify the pattern among the shapes in the matrix and determine which of the options completes the series.`

**ChatGPT narrows down the answers to 3 and 8:**

`Option 3: The diamond is divided, with shading consistent with the required pattern.`

`Option 8: The diamond is divided, with shading consistent with the required pattern.`

**Sadly, it chose option 3 (wrong) instead of option 8 (correct).**

So close!

Is it just because option 3 comes before option 8 in the list? 🤔

# 2. Naive Prompt Without Options

Next, I remove the answer options from the image.

Will this make it** easier or harder for ChatGPT?**

### Prompt

Sabrina Ramonov @ sabrina.dev

### ChatGPT-4o Output

`The IQ test shown in the image seems to be asking for a pattern completion. The provided shapes form a sequence, and the task is to identify the missing piece that correctly completes the pattern.`

`To solve this, let's analyze the patterns in the rows:`

**First Row:**`The first shape is a square with a black triangle in the top left corner.`

`The second shape is a square with a black triangle in the top right corner.`

`The third shape is a square with a black triangle covering the entire top half.`

**Second Row:**`The first shape is a square with a black triangle in the bottom left corner.`

`The second shape is a square with a black triangle in the bottom right corner.`

`The third shape is a square with a black triangle covering the entire bottom half.`

**Third Row:**`The first shape is a square with a black triangle covering the left half.`

`The second shape is a square with a black triangle covering the right half.`

`The third shape is missing.`

`From this analysis, we can infer that the pattern involves a sequence where each subsequent shape in a row shows a progression of filling in black triangles. The missing shape in the third row should be a square with a black triangle covering the entire shape, just like in the previous rows.`

`Therefore, the correct answer to complete the pattern would be a square completely filled with a black triangle.`

### My Thoughts

**Wow, GPT4-o got it!**

Its reasoning is sound:

`From this analysis, we can infer that the pattern involves a sequence where each subsequent shape in a row shows a progression of filling in black triangles`

It **almost** describes the correct answer:

`The missing shape in the third row should be a square with a black triangle covering the entire shape, just like in the previous rows. Therefore, the correct answer to complete the pattern would be a square completely filled with a black triangle.`

I expected ChatGPT to describe the missing shape as a black square, rather than a **“square with a black triangle covering the entire shape”**.

Nonetheless, its final answer **“square completely filled with a black triangle”** would work as long as the triangle is big enough to cover the square!

Counts as a win in my book!

# 3. Naive Prompt Without Options — Visualize Answer

Next, I continue hiding the answer options.

I also ask GPT-4o to output the image of the correct answer and describe the image it produced.

### Prompt

Sabrina Ramonov @ sabrina.dev

### ChatGPT-4o Output

`The correct answer for the IQ test is the middle-center diamond, which is all white. Here is the image produced based on this description:`

Sabrina Ramonov @ sabrina.dev

### My Thoughts

Wow, quite sad — **totally different and unexpected!**

The final answer is just wrong:

`The correct answer for the IQ test is the middle-center diamond`

The generated image is cool… but wrong.

It’s almost like ChatGPT-4o lost the ability to solve the spatial reasoning problem because I asked it to output an image.

I wonder why multimodal image generation throws things off so much…

# 4. Naive Prompt Without Options — Solve First, Then Visualize Answer

I try again with a slight variation.

I ask GPT-4o to **first **solve the spatial reasoning problem, **second** generate a visualization of the answer, and **third** describe the image generated.

### Prompt

Sabrina Ramonov @ sabrina.dev

### ChatGPT-4o Output

`The image shows a 3x3 grid of diamond shapes with varying shading patterns. Each diamond contains one or two black triangles. The bottom right space is the solution for the puzzle, which is a diamond with a black triangle in the bottom left corner.`

Sabrina Ramonov @ sabrina.dev

### My Thoughts

Yikes, a lot went wrong again…

The solution to the IQ test problem is wrong.

The generated image doesn’t match the proposed solution.

ChatGPT-4o seems to struggle with precise image generation.

I wonder — **is the issue that GPT-4o wants to output the image first?**

# 5. Naive Prompt Without Options — Chain of Thought, Then Generalize

I decide to whip out my reliable trusted partner, Chain of Thought, a prompt engineering technique that typically improves LLM accuracy on logic and reasoning tasks.

I ask ChatGPT-4o to first solve the IQ test and append:

**take a deep breath and work on this problem step-by-step**.

Like before, I ask ChatGPT to generate an image of the solution and describe the image generated.

### Prompt

Sabrina Ramonov @ sabrina.dev

### ChatGPT-4o Output

`To solve the IQ test in the image, we need to identify the pattern in the sequence of shapes. Let's analyze each row and column to determine the pattern:`

**Row-wise Analysis:**`The shapes in each row are diamonds, each divided into four quadrants, with one or two quadrants shaded.`

**Column-wise Analysis:**`The first column has the top-left quadrant shaded.`

`The second column has the top-right quadrant shaded.`

`The third column has the bottom-right quadrant shaded.`

`The blank shape at the end seems to be the one we need to determine based on the patterns observed.`

**Observations:**

`Each shape in the same column has a consistent shading pattern based on the specific quadrant.`

`The last shape in the third row should follow the same shading pattern as the first shape in the first column, where the top-left quadrant is shaded.`

**Solution:**

`The missing shape should be a diamond with the top-left quadrant shaded.`

`Let's generate an image of the solution:`

```
{
"prompt": "A grid with three rows and three columns of diamond shapes, each divided into four quadrants. The shapes in the first column have the top-left quadrant shaded, the shapes in the second column have the top-right quadrant shaded, and the shapes in the third column have the bottom-right quadrant shaded. The missing shape in the last position of the grid should have the top-left quadrant shaded.",
"size": "1024x1024"
}
```

Sabrina Ramonov @ sabrina.dev

**Description of the Generated Image:**

`The image shows a 3x3 grid of diamond shapes, each divided into four quadrants. The shading pattern is as follows:`

`The first column has shapes with the top-left quadrant shaded.`

`The second column has shapes with the top-right quadrant shaded.`

`The third column has shapes with the bottom-right quadrant shaded.`

`The missing shape in the last position of the grid has the top-left quadrant shaded, completing the pattern.`

### My Thoughts

Wow, a detailed response!

ChatGPT-4o starts by analyzing the image, applying Chain of Thought.

But now, the **solution is wrong.**

`The missing shape should be a diamond with the top-left quadrant shaded.`

Interestingly, it tried to generate an image that contains all the original patterns with the solution filled in at the end.

I wonder if it’s due to the probabilistic nature of LLMs, further complicated by the difficulty of generating precise geometry from a prompt.

# Conclusion

Sadly, GPT-4o **cannot reliably solve spatial reasoning IQ test questions.**

…yet!

Today, it struggles with complex tasks and precise image generation.

Interestingly, even generating an image given a description is not consistent — especially for fairly simple definitive shapes.

I’m impressed GPT-4o solved it once (Test #2 Naive Prompt Without Options), although the description of the answer was slightly strange. Also surprised no special techniques (e.g. Chain of Thought) were used!

**Could it be a fluke?**

In any case, GPT-4o’s reasoning attempts are fun to investigate!

This concludes part 4 of my series **Test Driving ChatGPT-4o**!

For part 1, click *here* — D&D story, real-time data, SAT math problem.

For part 2, click *here*. — complex math problem + multimodal + CoT.

For part 3, click *here*. — image generation of conceptual opposites.