Can attest that the distribution is odd from the test set that we sampled.
We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.
I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.
Since you are here, when I was reading the paper I wondered -- when they show the "zero-shot solve rates", does that mean that they are basically running the same experiment code, but without the prompts that call `few_shot_response` (i.e. they are still trying each question with every expert prefix, and every critique?) It wasn't clear to me at a glance.
We've already run the compute to run the zero-shot GPT model on all of the datapoints in the provided test set. We're going through the process now of grading them manually (our whole fraternity is chipping in!) and should have the results out relatively soon.
I can say that, so far, it's not looking good for that 90% correct zero-shot claim either.