Every big new model release we see benchmarks like ARC and Humanity's Last Exam climbing higher and higher. My question is, how do we know that these benchmarks are not a part of the training set used for these models? It could easily have been trained to memorize the answers. Even if the datasets haven't been copy pasted directly, I'm sure it has leaked onto the internet to some extent.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.
Even if the benchmark themselves are kept secret, the process to create them is not that difficult and anyone with a small team of engineers could make a replica in their own labs to train their models on.
Given the nature of how those models work, you don't need exact replicas.
There's a checkbox on whether you want to use it or not in the settings page, does this not change these settings?
I don't feel opposed to them changing the browser in principle--certainly there have been many improvements to web browsers over the years. Is privacy the concern here?
If the checkbox you're referring to is the "Use AI to suggest tabs and a name for tab groups" one, then I can't see what setting it changes. It's not the browser.ml.enable flag. I tried unchecking it, restarting the browser, and that flag was unaffected. This is in version 144.0.2.
Searching for "AI" shows one other setting: "Quickly access bookmarks, tabs from your phone, AI chatbots, and more without leaving your main view." But I'd already disabled that apparently. Despite that, there are plenty of flags that were enabled mentioned in the article.
Last I checked there wasn’t and you still had to fiddle with a few about:config options to actually turn off all the ai stuff. I would be fine with it if it was just a settings page rather than hidden settings.
One thing I think this article overlooks is that Argentina was a superpower, at least before the Panama canal was built. Before that, pretty much all shipping between the Atlantic and the Pacific had to go south around Argentina and Chile. Buenos Aires was one of the best stops along that route, and so it became one of the richest places on earth. After the Panama canal was built most of this traffic dropped off, and so did Argentina's fortunes. It's just so far away from everywhere that it has never been as geographically significant since.
Seems like Argentina was wealthy till the 1940s the Panama Canal was completed in 1914. I visited buenos Aries twenty years ago and it reminded me of Paris. Grand old architecture, big buildings wide avenues. Something happened in the latter half of the 20th century that caused it to decline and stagnate. I always thought it was dictatorships, civil unrest and hyperinflation, but maybe those are symptoms and not causes.
Militarily they where powerful however they bought that power they didn't build it (UK was primary supplier of their battleships when they had their arms races with Chile and Brazil respectively) so it was a bit of a glass hammer situation.
I have a theory: all these people reporting degrading model quality over time aren't actually seeing model quality deteriorate. What they are actually doing is discovering that these models aren't as powerful as they initially thought (ie. expanding their sample size for judging how good the model is). The probabilistic nature of LLM produces a lot of confused thinking about how good a model is, just because a model produces nine excellent responses doesn't mean the tenth response won't be garbage.
They test specific prompts with temperature 0. It is of course possible that all their tests prompts were lucky, but still then, shouldn't you see an immediate drop followed by a flat or increasing line?
Also, from what I understand from the article, it's not a difficult task but an easily machine checkable one, i.e. whether the output conforms to a specific format.
If it was random luck, wouldn't you expect about half the answers to be better? Assuming the OP isn't lying I don't think there's much room for luck when you get all the questions wrong on a T/F test.
With T=0 on the same model you should get the same exact output text. If they are not getting it, other environmental factors invalidate the test result.
TFA is about someone running the same test suite with 0 temperature and fixed inputs and fixtures on the same model over months on end.
What’s missing is the actual evidence. Which I would love of course. But assuming they’re not actively lying, this is not as subjective as you suggest.
Yes exactly, my theory is that the novelty of a new generation of LLMs’ performances tends to cause an inflation in peoples’ perceptions of the model, with a reversion to a better calibrated expectation over time. If the developer reported numerical evaluations that drifted over time, I’d be more convinced of model change.
your theory does not hold up for this specific article as they carefully explained they are sending identical inputs into the model each time and observing progressively worse results with other variables unchanged. (though to be fair, others have noted they provided no replication details as to how they arrived at these results.)
I see your point but no, it's getting objectively worse. I have a similar experience of casually using chatgpt for various use cases, when 5 dropped i noticed it was very fast but oddly got some details off. As time moved on it became both slower and the output deteriorated.
but I use local models and sometimes the same ones for years already, and the consistency and expectations there is noteworthy, while I also have doubts about the quality consistency I have from closed models in the cloud. I don't see these kind of complaints from people using local models, which undermines the idea that people were just wowed three months ago and less impressed now.
so perhaps it's just a matter of transparency
but I think there is consistent fine tuning occuring, alongside filters added and removed in an opaque way in front of the model
I've been reading a lot of (human-written) books lately, and one thing this has made abundantly clear to me is that AI writing just doesn't stack up. For one AI writing is often completely wrong about the details. But it also just tends to be bland and superficial. If you want a 5-minute summary of something, sure, it can do a passable job. But if I want something substantial and carefully thought out, I'll choose a book written by a human expert every time.
Maybe this will change at some point in the future, but for now there's no way I would substitute a well-written book on a subject for AI slop. These models are trained on human-written material anyway, why not just go straight to the source?
Certainly we are not perfect, but I think overall Canada has done more for the world to uphold human rights and freedoms than otherwise. When the government does act against "individual freedom", it is usually for the good of larger society. For instance because of firearm restrictions Canadian citizens are (or used to be?) free from getting shot with firearms on the streets. Is it a perfectly free society? No, but for the most part people here have it pretty good. I'd wager most of the immigrants moving here are much more free than they were in their home countries
> When the government does act against "individual freedom", it is usually for the good of larger society.
This line of thinking can be used to justify anything. That’s why it’s important to protect the individual and their rights, even in face of what a majority - who can be unjust - wants. And speech in particular, is so fundamental to the idea of freedom, that it should be almost absolutely protected. A constitutional guarantee of free speech and privacy is critical.
You're posting that reply on a message board that, strictly speaking, does not have free speech. If I started flaming you my post would get removed pretty quickly. This forum is heavily moderated. Does that make it a better, or worse, place for discussion?
It’s one among a large number of forums you can choose from, not a monopoly, with no restriction by the government, who has a monopoly on violence and the ability to take away your time or money.
By that line of thinking, Canada is just one country you can choose to live in. Certainly people here have the choice to move to other countries that they think has more freedom. I have a hard time thinking of any other countries that entirely fit that criteria at the moment.
You are arguing in bad faith. People as individuals, including Canadians, deserve freedom of speech, without threat of fines or jail time. Most people can’t just move to another country. They can however move to another website as an alternative to HN.
I have never once received threats of fines or jail time for their speech, nor have any of the Canadians I know. Are you aware that freedom of opinion and expression is very clearly spelled out in the Canadian charter of rights?
Canadians have Freedom of Expression, which is a stronger protection than Freedom of Speech, but the Canadian legal interpretation of that freedom allows for constraints based on hate, obscenity, and a few legal constraints that are common across most "Free Speech" jurisdictions (libel, defamation, etc).
There are cases where people have been charged, fined, and even jailed for "expression", but those are largely limited to cases where folks are promoting violence against specific groups (including hate speech, for example, teaching holocaust denialism, promoting anti-semitism or pro-racist ideologies calling for violence).
There are certainly cases where there has been government over-reach, but that is why we have courts, and in general, the courts in Canada tend towards a more broad interpretation of Freedom of Expression. Are there specific cases in Canada that you can cite where people haven't enjoyed Freedom of Expression (including freedom of speech, which is protected under the broader umbrella of Expression).
This is a nice "just so" explanation, but I don't think it is telling the full story, or even most of it. Sure tax policy probably has an impact, but so do interest rates, AI, tariffs, inflation, geopolitical turmoil, rampant speculation and hype cycles, etc. If this tax policy is so important why didn't it save the dot com crash from happening? Why are tech industries outside the US seeing similar hiring downturns? It's a boom and bust industry, we're in the bust, and it seems unlikely that one bad tax policy is the culprit.
Interesting how there is no mention of how the training data for this was collected. This does sound quite a bit better than Meta's MusicGen, but then again that model was also trained on a small licensed dataset.
But I am looking forward to trying it out. I find Gemini to be great as handling large-context tasks, and Google's inference costs seem to be among the cheapest.