Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>wouldn't we expect it to have worse quality than o1?

That's tricky, you can optimize a model to do real well on synthetic benchmarks.

That said, DeepSeek performs a bit worse than GPT-4 in general and substantially wrong on benchmarks like ARC which is designed with this in mind.



Are you sure you checked R1 and not V3? By default, R1 is disabled in their UI.

  Prompt: Find an English word that contains 4 'S' letters and 3 'T' letters.

  Deepseek-R1: stethoscopists (correct, thought for 207 seconds)

  ChatGPT-o1: substantialists (correct, thought for 188 seconds)

  ChatGPT-4o: statistics (wrong) (even with "let's think step by step")
In almost every example I provide, it's on par with o1 and better than 4o.

>substantially wrong on benchmarks like ARC which is designed with this in mind.

Wasn't it revealed OpenAI trained their model on that benchmark specifically? And had access to the entire dataset?


That prompt means nothing. Check out the benchmarks.

Also, compare V3 to 4o and R1 to o1, that's the right way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: