Hacker Newsnew | past | comments | ask | show | jobs | submit | Turskarama's commentslogin

I think ideally you need to practice both slow AND fast. You need to practice slow so you can notice and work on small details that can be skipped over with speed, and you need to practice fast because some things are legitimately different at speed and you won't learn how to deal with them only going slow.

As my guitar teacher used to say, “Slow is fast”. Mastering the techniques slowly and increasing speed until you’re “at speed” is the way to go.

Like riding a bike, you start slow with training wheels (or a helicopter parent) and work your way up to Yolo no-hander off that kicker ramp at 40 kph.


Ironically, training wheels are actually a bad way to teach a kid how to ride a bike. They teach kids bad habits (like turning the handlebars to steer rather than leaning).

Balance bikes are a better first step and are actually really fun compared to training wheels.


Shame you can't do this with something like juggling. :D

I suppose you can somewhat metaphorically replace speed with numbers there. In that juggling four balls is a lot like three, but faster. Getting the initial three going, though... Grrr.


You can practice with two balls, sound the same motions you would with three. And if you really want to focus you can practice making a consistent toss with one ball. Two is probably better bang for the buck.

Right, I was just pushing the idea that you can't always literally slow things down. That said, no reason you couldn't pantomime juggling really slowly. To be honest, I wouldn't be surprised if that is a legit path to getting going?

> Shame you can't do this with something like juggling

You need not limit yourself to a single gravitational constant.

I look forward to video clips of Elon juggling on Mars!


You can also use light handkerchiefs that fall slower to the ground than balls, pins, or flaming chainsaws.

I forgot about the handkerchief trick to slow things down.

Funny. I can do 3 balls. I can do 3 clubs. I can do 3.

4? My brain revolts.


Even dumber, for me, is that I can easily juggle two in either hand. Try to do the same in both hands at the same time? Brain basically recoils in horror.

The metronome of music practice is the idea, here. You don't just do something slowly. You deliberately constrain yourself to a controlled speed and ratchet it up as you go.

The problem with publicly disclosing these is that if lots of people adopt them they will become targeted to be in the model and will no longer be a good benchmark.

Yeah, that's part of why I don't disclose.

Obviously, the fact that I've done Google searches and tested the models on these means that their systems may have picked up on them; I'm sure that Google uses its huge dataset of Google searches and search index as inputs to its training, so Google has an advantage here. But, well, that might be why Googles new models are so much better, they're actually taking advantage of some of this massive dataset they've had for years.


This thought process is pretty baffling to me, and this is at least the second time I've encountered it on HN.

What's the value of a secret benchmark to anyone but the secret holder? Does your niche benchmark even influence which model you use for unrelated queries? If LLM authors care enough about your niche (they don't) and fake the response somehow, you will learn on the very next query that something is amiss. Now that query is your secret benchmark.

Even for niche topics it's rare that I need to provide more than 1 correction or knowledge update.


I have a bunch of private benchmarks I run against new models I'm evaluating.

The reason I don't disclose isn't generally that I think an individual person is going to read my post and update the model to include it. Instead it is because if I write "I ask the question X and expect Y" then that data ends up in the train corpus of new LLMs.

However, one set of my benchmarks is a more generalized type of test (think a parlor-game type thing) that actually works quite well. That set is the kind of thing that could be learnt via reinforcement learning very well, and just mentioning it could be enough for a training company or data provider company to try it. You can generate thousands of verifiable tests - potentially with verifiable reasoning traces - quite easily.


Ok, but then your "post" isn't scientific by definition since it cannot be verified. "Post" is in quotes because I don't know what you're trying to but you're implying some sort of public discourse.

For fun: https://chatgpt.com/s/t_694361c12cec819185e9850d0cf0c629


I didn't see anyone claiming any 'science'? Did I miss something?

I guess there's two things I'm still stuck on:

1. What is the purpose of the benchmark?

2. What is the purpose of publicly discussing a benchmark's results but keeping the methodology secret?

To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.


1. The purpose of the benchmark is to choose what models I use for my own system(s). This is extremely common practice in AI - I think every company I've worked with doing LLM work in the last 2 years has done this in some form.

2. I discussed that up-thread, but https://github.com/microsoft/private-benchmarking and https://arxiv.org/abs/2403.00393 discuss some further motivation for this if you are interested.

> To me it's in the same spirit as claiming to have defeated alpha zero but refusing to share the game.

This is an odd way of looking at it. There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.


I see the potential value of private evaluations. They aren't scientific but you can certainly beat a "vibe test".

I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

> There is no "winning" at benchmarks, it's simply that it is a better and more repeatable evaluation than the old "vibe test" that people did in 2024.

Then you must not be working in an environment where a better benchmark yields a competitive advantage.


> I don't understand the value of a public post discussing their results beyond maybe entertainment. We have to trust you implicitly and have no way to validate your claims.

In principle, we have ways: if nl's reports consistently predict how public benchmarks will turn out later, they can build up a reputation. Of course, that requires that we follow nl around for a while.


As ChatGPT said to you:

> A secret benchmark is: Useful for internal model selection

That's what I'm doing.


My question was "What's the value of a secret benchmark to anyone but the secret holder?"

The root of this whole discussion was a post about how Gemini 3 outperformed other models on some presumably informal question benchmark (a"vibe test"?). When asked for the benchmark, the response from the op and and someone else was that secrecy was needed to protect the benchmark from contamination. I'm skeptical of the need in the op's cases and I'm skeptical of the effectiveness of the secrecy in general. In a case where secrecy has actual value, why even discuss the benchmark publicly at all?


The point is that it's a litmus test for how well the models do with niche knowledge _in general_. The point isn't really to know how well the model works for that specific niche. Ideally of course you would use a few of them and aggregate the results.

I actually think "concealing the question" is not only a good idea, but a rather general and powerful idea that should be much more widely deployed (but often won't be, for what I consider "emotional reasons").

Example: You are probably already aware that almost any metric that you try to use to measure code quality can be easily gamed. One possible strategy is to choose a weighted mixture of metrics and conceal the weights. The weights can even change over time. Is it perfect? No. But it's at least correlated with code quality -- and it's not trivially gameable, which puts it above most individual public metrics.


It's hard to have any certainty around concealment unless you are only testing local LLMs. As a matter of principle I assume the input and output of any query I run in a remote LLM is permanently public information (same with search queries).

Will someone (or some system) see my query and think "we ought to improve this"? I have no idea since I don't work on these systems. In some instances involving random sampling... probably yes!

This is the second reason I find the idea of publicly discussing secret benchmarks silly.


I learned in another thread there is some work being done to avoid contamination of training data during evaluation of remote models using trusted execution environments (https://arxiv.org/pdf/2403.00393). It requires participation of the model owner.

Because it encompasses the very specific way I like to do things. It's not of use to the general public.

The problem with integrating a chat bot is that what you are effectively doing is the same thing as adding a single bookmark, except now it's taking up extra space. There IS no advantage here, it's unnecessary bloat.

The computer chips used for AI generate significantly more heat than the chips on the JWST. The JWST in total weighs 6.5 tons and uses a mere 2kw of power, which is the same as 3 H100 GPUs under load, each of which will weight what, 1kg?

So in terms of power density you're looking at about 3 orders of magnitude difference. Heating and cooling is going to be a significant part of the total weight.


They didn't actually fix this until a couple of months after they publicly revealed that this was the reason the game was so big and a lot of people pointed out how dumb it is. I saw quite a few comments saying that people put it on their storage HDD specifically because it was too big to fit on their SSD. Ironic. They could have got their own data quite a bit earlier during development, not nearly two years after release!


It actually is somewhat a limit of the technology. LLMs can't go back and modify their own output, later tokens are always dependent on earlier tokens and they can't do anything out of order. "Thinking" helps somewhat by allowing some iteration before they give the user actual output, but that requires them to write it the long way and THEN refactor it without being asked, which is both very expensive and something they have to recognize the user wants.


Coding agents can edit their own output - because their output is tool calls to read and write files, and so it can write a file, run some check on it, modify the file to try to make it pass, run the check again, etc


There is one absolutely massive one, and that's that for the first time the problem is truly global. Other famines have been caused either by war or local droughts, both of which affect only a population in a limited area and crucially, can be somewhat mitigated by importing food from elsewhere. You can't import food if there are global food shortages.


It's the same issue, if you have a higher voltage then you can get more power without increasing current.

For example in Australia a standard house circuit is 10 Amps, but because it's at 240V we can get 2400 Watts (realistically more like 2300) out of a _standard_ wall outlet that is in every room of your house.


It's not the same issue. The vast majority of kitchens in the US have 20 amp circuits (so 2,400 watts peak, 1,920 watts continuous) exclusively. It's a bog standard receptacle (NEMA 5-20R instead of 5-15R) that's backwards compatible with 15 amp plugs. In fact these days most 5-15R receptacles have identical guts to their 20 amp counterparts save for the additional provision for a horizontal blade.

The electrical code (NEC) has started moving towards requiring 20 amp circuits in other rooms and more 20 amp circuits in kitchens.


But they're staying shy of the amp limit on purpose. So designing for 20 amps would be somewhat of a boost but not enough. While doubling voltage would actually fix the problem.


You're going to stay below the circuit breaker rating no matter the voltage. Nobody's going to put a 2,400 watt heater in a dishwasher designed to be used on a circuit that tops out at 2,400 watts because:

a.) I'm going to go out on a limb and suggest that most countries will place limits similar to the NEC's 80% rule.

b.) There are other high current draw devices in a dishwasher that will have to run concurrently like the water pumps.

Same with things like electric kettles. You're not going to find 1,800 watt kettles in the US even though they're designed for circuits rated at that. A quick peek at the kettles available in Australia show that most top out at 2,200 watts for the same reasons.

In the context of a dishwasher 240V would only get you more powerful heaters than you could run in the US if the circuits were rated at more than 10 amps. Voltage isn't the issue.


You know what, I didn't read the middle comment in this thread closely enough before my first reply. You're right that an Australian circuit doesn't help much, and the voltage on such a circuit is useless.

A UK circuit on the other hand would fix everything. It has the same number of amps (or maybe more), but double the voltage.

The problem isn't purely amps or volts, but in general home circuits tend to have a similar number of amps, and higher power usually goes hand in hand with higher voltage. That's the sense in which voltage fixes the problem. A US appliance staying well within amp limits has a lot less power than a UK appliance staying well within amp limits.


Meanwhile, here in Germany, we have 230V, but every standard wall outlet is rated for 16A continuous load over 1 hour so you can get 3.6 kW on each circuit.

Your standard home has a supply of 3 phase power @ 35A (southern Germany) or 63A (northern and western Germany), I think only the former GDR is at standard 3x25A, because like in many former Communist countries they had to save on expensive copper and aluminium, and since a lot of the GDR was heated by steam-based central district heating systems, you didn't need that much power anyway.


Lot's old homes and flats here limited to 5A or 3A 220v. If you don't use electric heating your power demands go down substantially, though 3A is a bit small these days.


Everything? No. But routine stuff will NEVER be denied. If your doctor thinks you need a scan, you're getting the scan. I have quite literally NEVER heard of someone in my country (Australia) going bankrupt from medical bills. It can happen but the rate is so low it's not something anyone ever worries about happening to them.


Routine stuff is never denied in the US either. I've never had one thing denied ever and I even have a weird condition that requires expensive testing to diagnose and even more expensive treatment (narcolepsy). The insurance companies will throw up annoying bureaucracy like prior authorizations, and made me switch medication to generic when it came out (reasonable) and then back from the generic to another brand name when it came out (WTF??), but never actually a denial.


Never denied eh? Interesting.

I had an MRI denied for a partial pectoral rupture. Which was a routine diagnostic as a precursor to open shoulder surgery to determine the extent and location of the rupture to figure out if surgery was absolutely necessary and to prep a viable surgical plan.

I had to fight the insurance company with the assistance of both my surgical and non-surgical sports medicine doctors.

The good news though appears to be that I imagined the entire thing, because denials for routine things never happen.


Gemini states that an ultrasound is just as good for diagnosis as an MRI and is much cheaper. Could it be that the denial was correct and the mistake was from your doctor ordering a less cost effective diagnostic test, not from the insurance company?

And from my personal experience with narcolepsy, AI is a much better doctor than most human doctors.


Odd that your experience would be so different from mine. I routinely experience denials.

To give an example, about 60 to 80% of the time, when I visit the dentist for a regular cleaning the charge is denied and I have to submit additional paperwork to convince them to pay it. I can't think of any more simple and basic procedure than that.

I have no idea why your experience with healthcare in the US is so much better, but I can assure you that there are many people whose experience is more like mine.


Dental is an entirely different system: dental insurance isn't actually insurance.


I am in 100% agreement with you about dental insurance, but that's a completely separate system.


Let me blow your mind for a second: this problem is not insurmountable in language design, C is not a perfect language, and nulls are bad design.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: