I'm unconvinced by the article criticism's, given they also employ their feels and few citations.
> I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! (...) Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.
A fundamental part of applied research is simplifying a real-world phenomenon to better understand it. Dismissing that for this many parameters, for such a simple problem, the LLM can't perform out of distribution just because it's not big enough undermines the very value of independent research. Tomorrow another model with double the parameters may or may not show the same behavior, but that finding will be built on top of this one.
Also, how do _you_ know that reasoning is emergent, and not rationalising on top of a compressed version of the web stored in 100B parameters?
Feels like running psychology experiment with fruit flies because it's cheaper and extrapolating results to humans because it's almost the same thing but smaller.
I'm sorry but the only hallucination here is that of the authors here. Does it really need to be said again that interesting results happen when you scale up only?
This whole effort would be interesting if they did and plotted result while scaling something up.
Autotools looks like a great idea: try to compile some programs and see if it works to learn the specifics of your environment. Yet I share the feeling from Julia that I hope to never have to learn how it works.
Nobody else knows how it works, which is why every configure script checks for a working fortran compiler. that is also why you can never be sure things like cross compiling work even though it can.
i use cmake - not a great system but at least it isn't hard and it always works. I've heard good things about a few others as well. There is no reason to use autotools.
My configure scripts never checked for a FORTRAN compiler. I'm not going to claim that using autotools is a pleasant experience, but it's not that bad, and it's very well documented.
Here is an expert saying there is a problem and how it killed its research effort, and yet you say that things are the same as ever and nothing was killed.
1. I am not discrediting the expert in any way, if anything, I think their decision to quit is understandable - there is now a challenge that arose during his research that is not in their interest to pursue (information pollution is not research in corpus linguistics / NLP).
2. I never said that things are the same as ever, quite the opposite actually. I am saying the world evolves constantly. It's naive to say company X/Y/Z killed something or made something unusable, when there is constant inevitable change. We should focus on how to move forward giving this constraint, and not dwell on times where the web was so much 'cleaner' and 'nicer', more manageable etc.
This is probably the more liberating and insightful paper I've ever read on programming (close 2nd: "Teach Yourself Programming in 10 years"). So many things I already knew from experience, but had never seen been addressed so clearly and _embraced_ at the same time.
"Broken" is a sliding scale, and it's unfeasible to refuse engaging at all times.
If you are a multi-billion dollar company creating a new integration, you can demand that your small supplier provide an RFC-4180 compliant file, and even refuse to process it if its schema or encoding is not conformant.
If you are the small supplier of a multi-billion dollar company, you will absolutely process whatever it is that they send you. If it changes, you will even adapt your processes around it.
TFA proposes a nice format that is efficient to parse and in some ways better than CSV, another ways are not. Use it if you can and makes sense.
I agree up to a point. It is a kind of tug-o-war, and yes, the weight of each side plays an important role there.
Nevertheless, even in projects where my services are talking to something that's bigger, I will, at the very least ask "why cant it be RFC compliant? is there a reason?". And without blowing my own horn overly much, but quite a few systems larger than mine have changed because someone asked that question.
I've read a comment here some years ago of someone discovering ASCII field delimiters and excited to use them. They then discovered that those characters are only used in three places: the ASCII spec, their own code, and the data from the first client where he tried to use this solution.
Any file format needs a well-specified escape strategy, because every file format is binary and may contain binary data. CSV is kinda bad not only because, in practice, there's no consensus escaping, but also because we don't communicate what the chosen escaping is!
I think a standard meta header like follows would do wonders to improve interchangeability, without having to communicate the serialization format out-of-band.
To me it's wild that the problem was solved back in the early 1960s (and really, well before that) but everyone just ignored it because of reasons and now we're stuck with a sub-optimal solution.
God created electromagnetism as a way to transmit power between the fusion plant and its simulated planet, and is now delighted that we also use it to trade Pokemons back and forth.
> I appreciate that research has to be done on small models, but we know that reasoning is an emergent capability! (...) Even if you grant that what they’re measuring is reasoning, I am profoundly unconvinced that their results will generalize to a 1B, 10B or 100B model.
A fundamental part of applied research is simplifying a real-world phenomenon to better understand it. Dismissing that for this many parameters, for such a simple problem, the LLM can't perform out of distribution just because it's not big enough undermines the very value of independent research. Tomorrow another model with double the parameters may or may not show the same behavior, but that finding will be built on top of this one.
Also, how do _you_ know that reasoning is emergent, and not rationalising on top of a compressed version of the web stored in 100B parameters?