Here's a more concrete example where GPT-OSS 20B performed very well IMHO. I tested it against Gemma 3 12B, Phi 4 Reasoning 14B, Qwen 2.5-coder 14B.
The prompt is modeled as a part of an agent of sorts, and the "human" question is intentionally ill-posed to emulate people saying the wrong thing.
The prompt begins with asking the model to convert a question into matlab code, add any assumptions as comments at the start of the coder, or if it's not possible then output four hash marks followed by an reason why.
The (ill-posed) question is "What's the cutoff frequency for an LC circuit with R equals 500 ohm and C equals 10 nanofarrad?"
Gemma 3 took the bait and treated R as L and proceeded to calculate the cutoff frequency of an LC circuit[1], completely ignoring the resulting mismatch of units. It did not comment at all. Completely wrong answer.
Qwen 2.5-coder detected the ill-posed nature, but instead decided to substitute a dummy value for L before calculating the LC circuit answer. On the upside it did add the comments saying this, so acceptable in that regard.
Phi 4 Reasoning reasoned for about 3 minutes before deciding to assume the question is about an RC circuit. It added this as a comment, and correctly generated the code for an RC circuit. So good answer, but slow.
GPT-OSS reasoned for 14 seconds, and determined the question was ill posed, thus outputting the hash marks followed by The cutoff frequency of an LC circuit cannot be determined with only R and C provided; the inductance L is required. Good answer, and fast.
Mostly because I had it downloaded already and I'm mostly interested in models that fit on my 16GB GPU. But since you asked, I ran the same questions through both 30B models in the q4_k_m variant, as GPT-OSS 20B is also quantized to about q4.
First the ill-posed question:
Qwen 3 Coder gave very similar answer to Phi 4, though included a more long-winded explanation in the comments. So not bad, but not great either.
Qwen 3 Thinking thought for a good minute before deciding the question was ill-posed and return the hash marks. However the following explanation was not as good as GPT-OSS, IMHO: The question is unclear because an LC circuit (without resistance) does not have a "cutoff frequency"; cutoff frequency applies to filter circuits like RC or RLC. Additionally, the inductance (L) value is missing for calculating resonant frequency in an RLC circuit. The given R and C values are insufficient without L.
Sure, an unloaded LC filter doesn't have a cutoff frequency, but in all normal cases the load is implied[1] and so the LC filter does have a cutoff frequency. So more thinking to get to a worse answer.
The SQL question:
Qwen 3 Coder did identify the same pitfall as GPT-OSS, however didn't flag it as clearly as GPT-OSS, mostly because it also flagged some unnecessary stuff so got drowned. It did make the same assumption about evenly dividing, and overall the answer was about as good. However the speed on my computer was roughly half the number of tokens per second as GPT-OSS, at just ~9 tokens/second.
Qwen 3 Thinking thought for 3 minutes, yet managed to miss the key aspect, thus giving everyone the pizza. And it did so at the same slow pace as Qwen 3 Coder.
The SQL question requires a somewhat large context due to the large table definitions, and being a larger model it required pushing more layers to the CPU, which I assume is the major factor in the speed drop.
So overall Qwen 3 Coder was a solid contender, but on my PC much slower. If it could run entirely on GPU I'd certainly try it a lot more. Interestingly Qwen 3 Thinking was just plain worse. Perhaps not tuned to other tasks besides coding?
Thank you for testing, I will test GPT-OSS for my use case as well. If you're interested I have 8 GB VRAM, 32 GB RAM and get around 21 token/s with tensor offloading, I would assume that your setup should be even faster than mine with the optimizations. I use the IQ4_KSS quant (by ubergarm on hf) with ik_llama.cpp with this command:
To determine the cutoff frequency (fc ) for an RC circuit (since you've provided resistance R and capacitance C, but not inductance L), we can use the following formula:
[.... calculation]
So, the cutoff frequency is approximately 31.83 kHz.
Note:
If you intended to ask about an RLC circuit (with both R, L, and C), please provide the inductance L value, and I can calculate the cutoff frequency for that case as well. The formula would then involve both L and C.
The prompt is modeled as a part of an agent of sorts, and the "human" question is intentionally ill-posed to emulate people saying the wrong thing.
The prompt begins with asking the model to convert a question into matlab code, add any assumptions as comments at the start of the coder, or if it's not possible then output four hash marks followed by an reason why.
The (ill-posed) question is "What's the cutoff frequency for an LC circuit with R equals 500 ohm and C equals 10 nanofarrad?"
Gemma 3 took the bait and treated R as L and proceeded to calculate the cutoff frequency of an LC circuit[1], completely ignoring the resulting mismatch of units. It did not comment at all. Completely wrong answer.
Qwen 2.5-coder detected the ill-posed nature, but instead decided to substitute a dummy value for L before calculating the LC circuit answer. On the upside it did add the comments saying this, so acceptable in that regard.
Phi 4 Reasoning reasoned for about 3 minutes before deciding to assume the question is about an RC circuit. It added this as a comment, and correctly generated the code for an RC circuit. So good answer, but slow.
GPT-OSS reasoned for 14 seconds, and determined the question was ill posed, thus outputting the hash marks followed by The cutoff frequency of an LC circuit cannot be determined with only R and C provided; the inductance L is required. Good answer, and fast.
[1]: https://en.wikipedia.org/wiki/LC_circuit#Resonance_effect