Running a gpt-oss eval suite against LM Studio on a Mac

OpenAI's gpt-oss models come with an eval suite, which is described in their Verifying gpt-oss implementations cookbook. I figured out how to run it on my Mac against their gpt-oss-20b model hosted locally using LM Studio, using uv.

TLDR: Here's the recipe that worked for me:

mkdir /tmp/aime25_openai
OPENAI_API_KEY=x \
  uv run --python 3.13 --with 'gpt-oss[eval]' \
  python -m gpt_oss.evals \
  --base-url http://localhost:1234/v1 \
  --eval aime25 \
  --sampler chat_completions \
  --model openai/gpt-oss-20b \
  --reasoning-effort low \
  --n-threads 2

You'll need to install the model in LM Studio first - you can start by clicking the "Use Model in LM Studio" button on their openai/gpt-oss-20b page.

When you load the model make sure to increase the context length available to it - the default, 4096, is too short for some of the exercises.

The above command runs 240 prompts and can take several hours. Add --debug to run just 5 prompts which is a lot faster.

The options

Let's break this down a little.

I'm using uv run for the benchmark. This means I get all of the dependencies installed automatically without having to worry about setting up a virtual environment myself.

The results

The script wrote out three files to my /tmp/aime25_openai directory. I've shared them here:

Here's that summary JSON in full:

{
  "chars": 7282.25,
  "chars:std": 4356.9365532638485,
  "score": 0.45416666666666666,
  "score:std": 0.4978948740000801
}

That score is the most important number - the eval suite assigns a 1 for each correct answer and a 0 for incorrect answers and then displays the average across all questions, which here is 45.4%. chars is the average length of response, whitespace included. The :std things show the standard deviation of each.

An example benchmark question

Here's one example of a shorter question and answer from the HTML document:

user
Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.
Please reason step by step, and put your final answer within \boxed{}.
assistant
We need in base b: numbers: 17_b = 1*b+7 = b+7. 97_b=9*b+7.

Condition: (b+7) divides (9b+7). Compute division: 9b+7 = 9(b+7)-56? Let's compute: 9(b+7)=9b+63 so difference: (9b+7)-(9b+63)= -56. So 9b+7 ≡ -56 mod (b+7). Therefore divisibility requires b+7 divides 56.

Because remainder is -56; so need b+7 | 56.

Also base>9, and digits must be less than base: in 97_b digit 9 < b => b≥10. Already >9.

So let d = b+7 positive divisor of 56. Divisors of 56: 1,2,4,7,8,14,28,56. Then b = d-7. Need b>9 so d-7>9 → d>16. So d in divisors >16: 28 and 56.

Check b=21 and b=49. Both base≥10. Also digits 1,7,9 okay (since b>9). So bases 21 and 49.

Sum = 70.In base (b) we have

[ 17_{b}=1\cdot b+7=b+7,\qquad 97_{b}=9\cdot b+7=9b+7 . ]

For (17_b) to divide (97_b),

[ (b+7)\mid (9b+7). ]

Compute the remainder of (9b+7) upon division by (b+7):

[ 9b+7 = 9(b+7)-56 . ]

Hence

[ 9b+7 \equiv -56 \pmod{,b+7}. ]

Thus we require

[ b+7 \mid 56. ]

Let (d=b+7).
(d) must be a positive divisor of (56):

[ 1,;2,;4,;7,;8,;14,;28,;56 . ]

The base must satisfy (b>9), i.e. (d-7>9 \Rightarrow d>16).
From the list, the eligible divisors are (28) and (56).

Therefore

[ b = 28-7 = 21,\qquad b = 56-7 = 49. ]

Both bases satisfy the digit condition ((9<b)).

The sum of all such integer bases is

[ 21+49=70 . ]

[ \boxed{70} ]

Correct Answer: 70

Extracted Answer: 70

Score: 1.0

All of the questions end with the exact same text, which is added by this prompt template:

Please reason step by step, and put your final answer within \boxed{}.

(Love that we're saying "please" to these thing!)

Each question in the benchmark has an expected correct answer, and the test harness compares that to the extracted \boxed{X} value.

The implementation of the eval

The gpt_oss/evals/aime_eval.py Python file is less than 100 lines long.

It loads two separate newline-delimited JSON files and combines them into one collection of questions:

Yes, it turns out this eval is just 30 questions total! Each line in those files looks something like this:

{"question": "Find the sum of all integer bases $b>9$ for which $17_{b}$ is a divisor of $97_{b}$.", "answer": "70"}

So why did the eval take 3.5 hours to run? Because the questions are each run eight times for that specific eval, as defined by this code here, for a total of 240 prompts.

Problems I had getting this to work

Unfortunately the eval only writes results to disk at the end of the run, so it's possible to only find out about problems after it has been chugging away for hours!

Created 2025-08-16T20:38:16-07:00, updated 2025-08-16T21:15:00-07:00 · History · Edit