Test Driving ChatGPT-4o (Part 4)

ChatGPT-4o vs. Spatial Reasoning IQ Test

Today I test ChatGPT-4o multimodal abilities on a spatial reasoning IQ test.

This is the 4th installment in my series, Test Driving ChatGPT-4o.

Part 1 — D&D story, real-time data, SAT problem.

Part 2 — multimodal + CoT to solve math problem.

Part 3 — image generation of conceptual opposites.

Problem Statement

Here’s a sample IQ test question test that requires spatial reasoning:

Sample question from IQ test

Solution

The 3rd column equals the union of the 1st column and 2nd column.

The same holds true for rows:

The 3rd row equals the union of the 1st row and 2nd row.

Therefore, the solution is figure 8.

Overview of Experiments

My goal is to understand:

Can ChatGPT can reliably solve spatial reasoning IQ test questions?

I’ll try various prompts and techniques, including removing the answer options, asking ChatGPT to visualize the answer, as well as the Chain of Thought prompt engineering technique.

Here are my experiments, from simple to complex:

  1. Naive prompt with options

  2. Naive prompt without options

  3. Naive Prompt Without Options — Visualize Answer

  4. Naive Prompt Without Options — Solve First, Then Visualize Answer

  5. Naive Prompt Without Options — Chain of Thought, Then Generalize

1. Naive Prompt with Options

Prompt

First, I give ChatGPT-4o an image of the question and ask it to solve the IQ test represented by the image.

Sabrina Ramonov @ sabrina.dev

ChatGPT-4o Output

To solve the IQ test, we need to identify the pattern among the shapes in the matrix and determine which of the options completes the series.

Let's analyze the given pattern:

  1. The matrix consists of three rows and three columns, with the last cell in the third row missing.

  2. Each cell contains a diamond shape with some shading.

Here is the step-by-step reasoning to identify the pattern:

  1. Row-wise Analysis:

    • First Row:

      • The diamond shapes in the first row are all divided diagonally, with one half shaded and the other half unshaded.

      • The direction of shading is consistent (upper left to lower right).

    • Second Row:

      • The diamond shapes in the second row are also divided diagonally, but the shading direction changes.

      • The shading alternates directions (upper right to lower left).

    • Third Row:

      • Following the previous pattern, the third row should have diamonds divided diagonally with consistent shading.

      • The shading should follow the alternating pattern, which would mean the missing diamond should have the same shading direction as the first row (upper left to lower right).

  2. Column-wise Analysis:

    • First Column:

      • The first column shows the diamonds divided with a consistent shading pattern.

    • Second Column:

      • The second column follows the shading pattern with alternating directions.

    • Third Column:

      • The third column should follow the shading pattern seen in the other two columns.

  3. Options Analysis:

    • Option 1: The diamond is divided, with shading inconsistent with the required pattern.

    • Option 2: The diamond is divided, with shading inconsistent with the required pattern.

    • Option 3: The diamond is divided, with shading consistent with the required pattern.

    • Option 4: The diamond is divided, but the shading is inconsistent with the required pattern.

    • Option 5: The diamond is divided, with shading inconsistent with the required pattern.

    • Option 6: The diamond is divided, but the shading is inconsistent with the required pattern.

    • Option 7: The diamond is not divided.

    • Option 8: The diamond is divided, with shading consistent with the required pattern.

Conclusion: The correct answer should be consistent with both the row and column patterns. Based on this analysis, the option that best fits the shading direction and pattern is Option 3.

Thus, Option 3 is the correct answer.

My Thoughts

I’m impressed ChatGPT tried to describe the problem image in detail.

Makes me wonder if IQ tests were part of its training dataset?

It correctly identified there are 8 answer options.

It correctly identified how to approach the problem:

To solve the IQ test, we need to identify the pattern among the shapes in the matrix and determine which of the options completes the series.

Sabrina Ramonov @ sabrina.dev

ChatGPT narrows down the answers to 3 and 8:

Option 3: The diamond is divided, with shading consistent with the required pattern.

Option 8: The diamond is divided, with shading consistent with the required pattern.

Sabrina Ramonov @ sabrina.dev

Sadly, it chose option 3 (wrong) instead of option 8 (correct).

So close!

Is it just because option 3 comes before option 8 in the list? 🤔 

2. Naive Prompt Without Options

Next, I remove the answer options from the image.

Will this make it easier or harder for ChatGPT?

Prompt

Sabrina Ramonov @ sabrina.dev

ChatGPT-4o Output

The IQ test shown in the image seems to be asking for a pattern completion. The provided shapes form a sequence, and the task is to identify the missing piece that correctly completes the pattern.

To solve this, let's analyze the patterns in the rows:

  1. First Row:

    • The first shape is a square with a black triangle in the top left corner.

    • The second shape is a square with a black triangle in the top right corner.

    • The third shape is a square with a black triangle covering the entire top half.

  2. Second Row:

    • The first shape is a square with a black triangle in the bottom left corner.

    • The second shape is a square with a black triangle in the bottom right corner.

    • The third shape is a square with a black triangle covering the entire bottom half.

  3. Third Row:

    • The first shape is a square with a black triangle covering the left half.

    • The second shape is a square with a black triangle covering the right half.

    • The third shape is missing.

From this analysis, we can infer that the pattern involves a sequence where each subsequent shape in a row shows a progression of filling in black triangles. The missing shape in the third row should be a square with a black triangle covering the entire shape, just like in the previous rows.

Therefore, the correct answer to complete the pattern would be a square completely filled with a black triangle.

My Thoughts

Wow, GPT4-o got it!

Its reasoning is sound:

From this analysis, we can infer that the pattern involves a sequence where each subsequent shape in a row shows a progression of filling in black triangles

Sabrina Ramonov @ sabrina.dev

It almost describes the correct answer:

The missing shape in the third row should be a square with a black triangle covering the entire shape, just like in the previous rows. Therefore, the correct answer to complete the pattern would be a square completely filled with a black triangle.

Sabrina Ramonov @ sabrina.dev

I expected ChatGPT to describe the missing shape as a black square, rather than a “square with a black triangle covering the entire shape”.

Nonetheless, its final answer “square completely filled with a black triangle” would work as long as the triangle is big enough to cover the square!

Counts as a win in my book!

3. Naive Prompt Without Options — Visualize Answer

Next, I continue hiding the answer options.

I also ask GPT-4o to output the image of the correct answer and describe the image it produced.

Prompt

Sabrina Ramonov @ sabrina.dev

ChatGPT-4o Output

The correct answer for the IQ test is the middle-center diamond, which is all white. Here is the image produced based on this description:

Sabrina Ramonov @ sabrina.dev

My Thoughts

Wow, quite sad — totally different and unexpected!

The final answer is just wrong:

The correct answer for the IQ test is the middle-center diamond

Sabrina Ramonov @ sabrina.dev

The generated image is cool… but wrong.

It’s almost like ChatGPT-4o lost the ability to solve the spatial reasoning problem because I asked it to output an image.

I wonder why multimodal image generation throws things off so much…

4. Naive Prompt Without Options — Solve First, Then Visualize Answer

I try again with a slight variation.

I ask GPT-4o to first solve the spatial reasoning problem, second generate a visualization of the answer, and third describe the image generated.

Prompt

Sabrina Ramonov @ sabrina.dev

ChatGPT-4o Output

The image shows a 3x3 grid of diamond shapes with varying shading patterns. Each diamond contains one or two black triangles. The bottom right space is the solution for the puzzle, which is a diamond with a black triangle in the bottom left corner.

Sabrina Ramonov @ sabrina.dev

My Thoughts

Yikes, a lot went wrong again…

The solution to the IQ test problem is wrong.

The generated image doesn’t match the proposed solution.

ChatGPT-4o seems to struggle with precise image generation.

I wonder — is the issue that GPT-4o wants to output the image first?

5. Naive Prompt Without Options — Chain of Thought, Then Generalize

I decide to whip out my reliable trusted partner, Chain of Thought, a prompt engineering technique that typically improves LLM accuracy on logic and reasoning tasks.

I ask ChatGPT-4o to first solve the IQ test and append:

take a deep breath and work on this problem step-by-step.

Like before, I ask ChatGPT to generate an image of the solution and describe the image generated.

Prompt

Sabrina Ramonov @ sabrina.dev

ChatGPT-4o Output

To solve the IQ test in the image, we need to identify the pattern in the sequence of shapes. Let's analyze each row and column to determine the pattern:

  1. Row-wise Analysis:

    • The shapes in each row are diamonds, each divided into four quadrants, with one or two quadrants shaded.

  2. Column-wise Analysis:

    • The first column has the top-left quadrant shaded.

    • The second column has the top-right quadrant shaded.

    • The third column has the bottom-right quadrant shaded.

The blank shape at the end seems to be the one we need to determine based on the patterns observed.

Observations:

  • Each shape in the same column has a consistent shading pattern based on the specific quadrant.

  • The last shape in the third row should follow the same shading pattern as the first shape in the first column, where the top-left quadrant is shaded.

Solution:

The missing shape should be a diamond with the top-left quadrant shaded.

Let's generate an image of the solution:

{
  "prompt": "A grid with three rows and three columns of diamond shapes, each divided into four quadrants. The shapes in the first column have the top-left quadrant shaded, the shapes in the second column have the top-right quadrant shaded, and the shapes in the third column have the bottom-right quadrant shaded. The missing shape in the last position of the grid should have the top-left quadrant shaded.",
  "size": "1024x1024"
}

Sabrina Ramonov @ sabrina.dev

Description of the Generated Image:

The image shows a 3x3 grid of diamond shapes, each divided into four quadrants. The shading pattern is as follows:

  • The first column has shapes with the top-left quadrant shaded.

  • The second column has shapes with the top-right quadrant shaded.

  • The third column has shapes with the bottom-right quadrant shaded.

  • The missing shape in the last position of the grid has the top-left quadrant shaded, completing the pattern.

My Thoughts

Wow, a detailed response!

ChatGPT-4o starts by analyzing the image, applying Chain of Thought.

But now, the solution is wrong.

The missing shape should be a diamond with the top-left quadrant shaded.

Sabrina Ramonov @ sabrina.dev

Interestingly, it tried to generate an image that contains all the original patterns with the solution filled in at the end.

I wonder if it’s due to the probabilistic nature of LLMs, further complicated by the difficulty of generating precise geometry from a prompt.

Conclusion

Sadly, GPT-4o cannot reliably solve spatial reasoning IQ test questions.

…yet!

Today, it struggles with complex tasks and precise image generation.

Interestingly, even generating an image given a description is not consistent — especially for fairly simple definitive shapes.

I’m impressed GPT-4o solved it once (Test #2 Naive Prompt Without Options), although the description of the answer was slightly strange. Also surprised no special techniques (e.g. Chain of Thought) were used!

Could it be a fluke?

In any case, GPT-4o’s reasoning attempts are fun to investigate!

This concludes part 4 of my series Test Driving ChatGPT-4o!

For part 1, click here — D&D story, real-time data, SAT math problem.

For part 2, click here. — complex math problem + multimodal + CoT.

For part 3, click here. — image generation of conceptual opposites.