Hacker Newsnew | past | comments | ask | show | jobs | submit | tgtweak's commentslogin

Was honestly thinking "yeah nice Google, now do it for Android" since the worst offenders are apps (looking at you, Tiktok)

I actually don't think this is true -

Traditional processors, even highly dedicated ones like TMUs in gpus, still require being preconfigured substantially in order to switch between sin/cos/exp2/log2 function calls, whereas a silicon implementation of an 8-layer EML machine could do that by passing a single config byte along with the inputs. If you had a 512-wide pipeline of EML logic blocks in modern silicon (say 5nm), you could get around 1 trillion elementary function evaluations per second on 2.5ghz chip. Compare this with a 96 core zen5 server CPU with AVX-512 which can do about 50-100 billion scalar-equivalent evaluations per second across all cores only for one specific unchanging function.

Take the fastest current math processors: TMUs on a modern gpu: it can calculate sin OR cos OR exp2 OR log2 in 1 cycle per shader unit... but that is ONLY for those elementary functions and ONLY if they don't change - changing the function being called incurs a huge cycle hit, and chaining the calculations also incurs latency hits. An EML coprocessor could do arcsinh(x² + ln(y)) in the same hardware block, with the same latency as a modern cpu can do a single FMA instruction.


So to clarify, you think that replacing every multiplication with 24 transcendental function evaluations (12 eml(x, y), each of which evaluates exp(x) and ln(y) plus the subtraction; see the paper's Fig 2) is somehow a win?

The fact that addition, subtraction, and multiplication run quickly on typical processors isn't arbitrary--those operations map well onto hardware, for roughly the same reasons that elementary school students can easily hand-calculate them. General transcendental functions are fundamentally more expensive in time, die area, and/or power, for the same reasons that elementary school students can't easily hand-calculate them. A primitive where all arithmetic (including addition, subtraction, or negation) involves multiple transcendental function evaluations is not computationally faster, lower-power, lower-area, or otherwise better in any other practical way.

The comments here are filled with people who seem to be unaware of this, and it's pretty weird. Do CS programs not teach computer arithmetic anymore?


For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous for bulk static transcendal functions. Where this becomes interesting is when combining them OR when chaining them where today they must come back out to the main process for reconfiguration and then re-issued.

Practical terms: Jacobian (heavily used in weather and combustion simulation): The transcendental calls, mostly exp(-E_a/RT), are the actual clock-cycle bottleneck. The GPU's SFU computes one exp2 at a time per SM. The ALU then has to convert it (exp(x) = exp2(x × log2(e))), multiply by the pre-exponential factor, and accumulate partial derivatives. It's a long serial chain for each reaction rate.

The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

GPU (PreJacGPU) is currently the state of the art for speed on these simulations - a moderate width 8-depth EML machine could process a very complex Jacobian as fast as the gpu can evaluate one exp(). Even on a sub-optimal 250mhz FPGA, an entire 50x50 Jacobian would be about 3.5 microseconds vs 50 microseconds PER Jacobian on an A100.

If you put that same logic path into an ASIC, you'd be about 20x the fPGA's speed - in the nanoseconds per round. And this is not like you're building one function into an ASIC it's general purpose. You just feed it a compiled tree configuration and run your data through it.

For anything like linear algebra math, which is also used here, you'd delegate that to the dedicated math functions on the processor - it wouldn't make sense to do those in this.


> The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

I think you're missing the reason why the GPU kicks you out of the fast path when you need that special function. The special function evaluation is fundamentally more expensive in energy, whether that cost is paid in area or time. Evaluation of the special functions with throughput similar to the arithmetic throughput would require much more area for the special functions, which for most computation isn't a good tradeoff. That's why the GPU's designers chose to make your exp2 slow.

Replacing everything with dozens of cascaded special functions makes everything uniform, but it's uniformly much worse. I feel like you're assuming that by parallelizing your "EML tree" in dedicated hardware that problem goes away; but area isn't free in either dollars or power, so it doesn't.


There is a huge market for "its faster" at the cost of efficiency, but I don't think your claim that an EML hardware block would be inherently less inefficient than the same workload running on a GPU. If you think it would be, back it up with some numbers.

A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU, in the realm of ~0.1mm2 on a 5nm process node (collectively including the FMA units behind it), at it's entirety about 1% of the CPU die. None of this suggests that even a ~500 wide 10-stage EML pipeline would be consuming anywhere near the power of a modern datacenter GPU (which wastes a lot of it's energy moving things from memory to ALU to shader core...).

Not sure if you're arguing from a hypothetical position or practical one but you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.


> you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.

What? Unless the thing you want to compute happens to be exactly that eml() function (no multiplication, no addition, no subtraction unless it's an exponential minus a log, etc.) or almost so, it is unquestionably less efficient. If you believe otherwise, then please provide the eml() implementation of a practically useful function of your choice (e.g. that Arrhenius rate). Then we can count the superfluous transcendental function evaluations vs. a conventional implementation, and try to understand what benefit could outweigh them.

> A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU

Can you explain where you got that conclusion? And what do you think a "10-stage EML pipeline" would be useful for? Remember that the multiply embedded in your Arrhenius rate is already 8 layers and 12 operations.

Also, can you confirm whether you're working with an LLM here? You're making a lot of unsupported and oddly specific claims that don't make sense to me, and I'm trying to understand where they're coming from.


This could have some interesting hardware implications as well - it suggests that a large dedicated silicon instruction set could accelerate any mathematical algorithm provided it can be mapped to this primitive. It also suggests a compiler/translation layer should be possible as well as some novel visualization methods for functions and methods.

I'm not too familiar with the hardware world, but does EML look like the kind of computation that's hardware-friendly? Would love for someone with more expertise to chime in here.

Yes actually, it is very regular which usually lends itself to silicon implementations - the paper event talks about this briefly.

I think the bigger question is whether it will be more energy-optimal or silicon density-optimal than math libraries that are currently baked into these processors (FPUs).

There are also some edge cases "exp(exp(x))" and infinities that seem to result in something akin to "division by zero" where you need more than standard floating-point representations to compute - but these edge cases seem like compiler workarounds vs silicon issues.


A similar function operating on the real domain for powers and logs of 2 would be extremely hardware friendly. You can build it directly out of the floating point format. First K significand bits index a LUT. Do that for each argument and subtract them.

It gets a bit more difficult for the complex domain because you need rotation.


This paper seems to suggest that a chip with 10 pipeline stages of EML units could evaluate any elementary function (table 4) in a single pass.

I'm curious how this would compare to the dedicated sse or xmx instructions currently inside most processor's instruction sets.

Lastly, you could also create 5-depth or 6-depth EML tree in hardware (fpga most likely) and use it in lieu of the rust implementation to discover weight-optimal eml formulas for input functions much quicker, those could then feed into a "compiler" that would allow it to run on a similar-scale interpreter on the same silicon.

In simple terms: you can imagine an EML co-processor sitting alongside a CPUs standard math coprocessor(s): XMX, SSE, AMX would do the multiplication/tile math they're optimized for, and would then call the EML coprocessor to do exp,sin,log calls which are processed by reconfiguring the EML trees internally to process those at single-cycle speed instead of relaying them back to the main CPU to do that math in generalized instructions - likely something that takes many cycles to achieve.


You could also make an analog EML circuit in theory, using electrical primitives that have been around since the 60s. You could build a simple EML evaluator on a breadboard. Things like trig functions would be hard to reproduce, but you could technically evaluate output in electrical realtime (the time it takes the electrical signal to travel though these 8-10 analog amplifier stages).

Has Back not produced any c++ code from his thesis or days in University? That would be more useful for satoshi-profiling than his written prose, I would think.

Also whether he primarily used Windows as Satoshi seems to have done - Back feels like more of a GNU/Linux kind of guy, but I could be wrong

Share the harness for that browser linux OS task :)

Adless monetization is a very difficult challenge - something I've worked on many times over the years (compute-monetization, shopping-commission monetization, payment interchange monetization) and it's always been very difficult to compete with ads. I think the waterfox approach of "ads if you're OK with it" opt-in and sane defaults is the better one, but it's very difficult to make ends meat compared to competitors offering full monetization on by default when you're only getting it (and getting less per search) if users opt-in.

Compute/resource monetization is the one after all these years that has done the best at replacing ads as a means of monetization for users, and it requires a very intelligent scheduling system + ethical ecosystem to work (most have just tried running crypto miners that cost users more electricity than they earn).


Also, the fact waterfox has almost zero telemetry out of the box and doesn't even have a bonafide number for active users as a result is a good hint at how sincere the "anti-tracking" ethos is applied. Most open source applications have some form of telemetry, even "anonymous" telemetry - but this does really stand out.


"Save time by changing your default browser to edge and enabling onedrive"

"just tips bro"


I know about 4 friends that have left their parent company, built a killer product that same company didn't think to build or didn't believe in, only to get acquired by that same company after a few years... some have done it multiple times.

I think this falls in exactly that situation. You see how janky these national companies are doing things, plot out a disruptive course, then disrupt them in a particular region so that you can extrapolate how much that will hurt at national scale and force a buyout that's way beyond the multiple you bought those small operators for.


How did they deal with non-competes? Are they in CA or somewhere those aren't enforceable?


Sounds like they didn't build competing products, they built products that'd have been very valuable to the parent company.


Ask for forgiveness, not permission!


codex is basically a ripgrep wrapper at this point :)


Ludes in this version... definitely 1984


Just the word ludes took me straight back to playing this in highschool!

What a throwback


I played some Web version in the late '90s or very early '00s, which is my main experience with the game, and it had 'ludes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: