AI-Powered Infinite Test Prep (Part 2 - SAT Math)

Gen AI Eats SAT Math Practice Tests

This is Part 2 of the series, AI-Powered Infinite Test Prep.

In Part 1, I shared my mom-in-law’s struggles with the DMV test in English.

This inspired me to build PassMyTests.com which generates infinite test practice questions while translating explanations into her native language.

I contemplated other standardized exams, similar in multiple-choice format, but haunting many millions more people:

The dreaded SAT!

I'm embarrassed to reveal how many hours I’ve spent studying for it.

I’d shove big SAT prep books into my backpack, heading off to Barnes & Noble.

I was one of those kids!!

SAT Math Test

I decided to tackle the SAT Math test first.

My MVP goal:

Generate infinite conceptually plausible and realistic SAT Math questions, while translating the answer explanations for non-native English speakers.

First, I repurposed my DMV prompt:

Your task is to act as a terminal for written test taking the SAT Math Test.

Produce a random 4-choice question.

The question should pulled be from a random set of 500 questions.

The question should be in English.

Finally, return the output in the valid JSON of the following format:

{

"body": "Body of the question",

"choices": ["Choice A", "Choice B", "Choice C", "Choice D"],

"answer": <index of the current choice e.g. A is 0, B is 1, etc.>,

"explanation": "Short explanation of the answer",

"translation": "${translation}"

}

Sabrina Ramonov @ sabrina.dev

Next, I made a new route in my MVP:

Here’s a sample question and explanation translated to Spanish:

This MVP didn’t take long, but now the hard part begins.

Next, I share several challenges I encountered.

Challenge 1: Missing Correct Answer

Sometimes, ChatGPT fails to include the correct answer as an option:

Sabrina Ramonov @ sabrina.dev

In the above image, the correct answer is 600 square meters.

But this was not an option.

To fix this, I added a verification step to my prompt:

Before returning the output, solve the question.

Check that the correct answer is present in the list of 4 choices. 

Sabrina Ramonov @ sabrina.dev

Much better.

Sample questions generated by ChatGPT-4:

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

However, even if the correct answer is in the answer list, sometimes the "answer” index in the output does not match the correct answer index.

Recall my prompt:

"answer": 
<index of the current choice e.g. A is 0, B is 1, etc.>,

I haven’t improved this.

I’ll try assigning a random number between 0 and 3 to the right answer, then inserting it into the JSON output.

Challenge 2: In-Context Learning

Nonetheless, initial results seem promising.

It’s been years since I studied for a test, but the questions looked plausible.

I was curious:

Would in-context learning improve it?

In-context learning is an effective prompt engineering technique where you include examples directly in your prompt, enabling ChatGPT to tackle new tasks without finetuning.

I added 4 real SAT Math questions to my prompt:

Your task is to act as a terminal for written test taking the SAT Math Test.

Produce a random 4-choice question.

The question should pulled be from a random set of 500 questions.

The question should be in English.

Before returning the output, solve the question.

Check that the correct answer is present in the list of 4 choices.

Finally, return the output in the valid JSON of the following format:

{

"body": "Body of the question",

"choices": ["Choice A", "Choice B", "Choice C", "Choice D"],

"answer": <'correct_index'>,

"explanation": "Answer explanation, show your work step by step",

"translation": "${translation}"

}

Below are the examples of the type of questions you should generate. Each example is delimited by triple quotes. Use these example to create various questions.

"""

Maria is staying at a hotel that charges $99.95 per night plus tax for a room. A tax of 8% is applied to the room rate, and an additional onetime untaxed fee of $5.00 is charged by the hotel. Which of the following represents Maria’s total charge, in dollars, for staying x nights?

A. (99.95 + 0.08x) + 5

B. 1.08(99.95x) + 5

C. 1.08(99.95x + 5)

D. 1.08(99.95 + 5)x

"""

"""

x^2 + y^2 = 153

y = −4x

If (x, y) is a solution to the system of equations above, what is the value of x^2?

 

A. -51

B. 3

C. 9

D. 144

"""

"""

The function f is defined by f(x) = 2x³ + 3x² + cx + 8, where c is a constant. In the xy-plane, the graph of f intersects the x-axis at the three points (−4, 0), (1/2, 0), and (p, 0). What is the value of c?

A. -18

B. -2

C. 2

D.10

"""

"""

A researcher wanted to know if there is an association between exercise and sleep for the population of 16-year-olds in the United States. She obtained survey responses from a random sample of 2000 United States 16-year-olds and found convincing evidence of a positive association between exercise and sleep. Which of the following conclusions is well supported by the data?

A. There is a positive association between exercise and sleep for 16-year-olds in the United States.

B. There is a positive association between exercise and sleep for 16-year-olds in the world.

C. Using exercise and sleep as defined by the study, an increase in sleep is caused by an increase of exercise for 16-year-olds in the United States.

D. Using exercise and sleep as defined by the study, an increase in sleep is caused by an increase of exercise for 16-year-olds in the world.

"""

Sabrina Ramonov @ sabrina.dev

Here are questions generated with in-context learning:

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Sabrina Ramonov @ sabrina.dev

Unfortunately, the questions are repetitive and substantially less random after applying in-context learning.

Even ChatGPT agrees!

In the last image, ChatGPT admits:

“It seems the same question was generated again.“ 

For now, I removed in-context learning from my prompt.

Challenge 3: Probabilistic Outputs

As I discussed in a previous post on LLM Ops, LLMs are probabilistic.

There is a non-zero probability that occasionally it returns an invalid JSON, despite my explicit request for a valid JSON output.

I encountered the first JSON failure after only n=20 generations:

{
  "body": "In a geometry class, the measures of the three angles of a triangle are \(3x\), \(4x\), and \(5x\) degrees. Which of the following could be the measure of the smallest angle?",
  "choices": ["\(20^\circ\)", "\(30^\circ\)", "\(40^\circ\)", "\(50^\circ\)"],
  "answer": 0,
  "explanation": "In any triangle, the sum of the three angles is always \(180^\circ\). Therefore, \(3x + 4x + 5x = 180\). Solving for x gives \(12x = 180\), so \(x = 15\). The smallest angle would then be \(3x = 3(15) = 45^\circ\), which is closest to \(40^\circ\).",
  "translation": "${translation}"
}

ChatGPT mixed LaTeX and JSON without escaping!

In a real LLM-powered app, if I did not have robust error handling in place, this could crash the app.

An imperfect bandaid, I added another verification step:

Confirm a valid JSON output before finalizing your answer.

Sabrina Ramonov @ sabrina.dev

But there’s still no guarantee because LLM outputs are probabilistic.

A faster and cheaper approach, avoiding another LLM call, would be to assume the JSON output could be malformed and fix it with a custom parser.

Only if the parser fails, then retry with ChatGPT.

Just as I was congratulating myself for this brilliant 80/20 solution, I ran into another issue caused by probabilistic outputs!

On generation N=1000, I got this perplexing and very wrong JSON output:

  • JSON just repeats itself

  • translation field is the fully translated JSON

  • answer index is 2 (x = 6) but the JSON output is 1 (x = 5)

{
    "body": "What is the solution to the equation 4x - 7 = 17?",
    "choices": ["A. x = 10", "B. x = 5", "C. x = 6", "D. x = 4"],
    "answer": 1,
    "explanation": "Add 7 to both sides of the equation to isolate 4x. This gives 4x = 24. Then, divide by 4 on both sides to solve for x, which results in x = 6.",
    "translation": "{
    "body": "¿Cuál es la solución de la ecuación 4x - 7 = 17?",
    "choices": ["A. x = 10", "B. x = 5", "C. x = 6", "D. x = 4"],
    "answer": 1,
    "explanation": "Suma 7 a ambos lados de la ecuación para aislar 4x. Esto da como resultado 4x = 24. Luego, divide entre 4 en ambos lados para resolver x, lo que da como resultado x = 6.",
    "translation": ""
  }
}

Perhaps ChatGPT is deliberately trying to entertain me…

Ugh, and I thought I had solved Challenge #1.

Definitely haven’t solved Challenge #3.

Traditional software engineering deals with structured data.

LLMs deal with unstructured natural language and probabilistic outputs.

I’m still grappling with this new reality.

Conclusion

Altogether, I enjoyed building this MVP and see many ways to improve it.

I believe Gen AI will massively transform traditional test prep and education more broadly, increasing accessibility especially across multiple languages.

Here’s a slice of my “I Should Really Do That” list:

  • Evaluation criteria and methodology — given 4 different prompts and different models, generate N=100,1000,10000 questions, and evaluate each question based on a set of criteria

  • Restrict question types -- e.g. LINEAR_EQUATIONS, QUADRATIC_EQUATIONS, GEOMETRY_TRIANGLE, LINEAR_ALGEBRA

  • Multimodality — generate both an image and the question

  • Create brand new questions based on concepts — given a concept along with axioms and theorems, generate a question that utilizes it

If you have feedback or requests for PassMyTests.com

DM me anytime @Sabrina_Ramonov or Linkedin.