Doing taxes using a few small forms designed together by the same agency is not as impressive as you think it is. The instructions are literally printed on the form in English for the kind of people who you consider dumber than ChatGPT.
It quickly breaks down even at 8k with legislation that is even remotely nontrivial.
> The instructions are printed, yet I, and many other people, hire an accountant to do our taxes.
I can mow my lawn yet I still hire landscapers. That doesn't say anything about the difficulty of cutting grass or the intelligence of a DeWalt lawnmower but about specialization and economic tradeoffs - like the liability insurance accountants carry for their client work.
> What if someone finds a good practical way to expand the context length to 10M tokens? Do you think such model won't be able to do your task?
Not based on the current architecture (aka predict next token). It already fails at most of my use cases at 32K by default, unless I go to great lengths to tune the prompt.
> It seems like you have an opportunity to compare 8k and 32k GPT-4 variants (I don't) - do you notice the difference?
32K works better for my use case but requires much more careful prompt "engineering" to keep it from going off the rails. In practice, actually getting full 32K use out of it is a disaster since the connection will drop and I have to resend the entire context with a "continue" message, costing upwards of $10 for what should cost $2-4 per call. I haven't actually tried 32K on as much as a whole USC Title because that would costs thousands.
It quickly breaks down even at 8k with legislation that is even remotely nontrivial.