More

GistNoesis · 2026-01-09T18:46:50 1767984410

I like the game up until 7 or 8 notes, but it keeps adding note.

I couldn't find a setting to freeze the difficulty where it's comfortable and where the melody can still be construed to make sense.

When adding more notes, it breaks the flow and turn a training for pitch practicing into a memory game for rain man, even more so when we make a mistake and must redo the melody partially.

vunderba · 2026-01-09T19:59:40 1767988780

Hmm - I was hoping that the Practice Mode would cover this, but I think there's probably some room to add an option where you can freeze the difficulty level - I'll see if I can add this later in the evening.

The "construed melody" is a harder problem. I've been playing around with the idea of using a markov model or even borrowing what CPU Bach did to try to create more coherent melodies over time.

Thanks for the feedback!

viraptor · 2026-01-10T00:21:05 1768004465

Transformers are nice. You can train a very minimal network that can output reasonable sequences very easily. They won't be high quality, or too pretty, but they will "make sense" way more than randomness and (usually) change keys in a coherent way.

vunderba · 2026-01-10T02:55:44 1768013744

Hey viraptor! I haven't even thought about using transformers but that sounds like a great idea. The current generator is just a standard random walk across the major/minor intervals and could definitely use some TLC!

GistNoesis · 2026-01-02T12:14:10 1767356050

Have you checked his physical keyboard ?

My laptop is getting old, and some keys need to be pressed with more insistence and more accurately for them to register properly. It also breaks the flow, and muscle memory for things like passwords. It also lead to letter inversions, because the tonic accent need to be put on letter which need to be pressed more, rather than on the first letter of the word. It's driving me crazy but unfortunately computer are too expensive for now (and it's probably only getting worse).

GistNoesis · 2026-01-02T11:31:52 1767353512

Or use lasers : https://www.youtube.com/watch?v=DCQ2CbfGs6g&t=440s

ssernikk · 2026-01-02T13:43:47 1767361427

Lasers in space are fun! We[1] are actually doing this for real but automated and inversed -- launching a satellite with a laser to beam data down to Earth. Like these searchlights, but from orbit!

[1] A bunch of students at https://satlab.agh.edu.pl

GistNoesis · 2025-12-24T07:48:30 1766562510

If it's not done properly, and you happen at any point in the chain to put black blocks on a compressed image (and PDF do compress internal images), you are leaking some bits of information in the shadow casted by the compression algorithm : (Self-plug : https://github.com/unrealwill/jpguncrop )

GistNoesis · 2025-12-24T08:35:22 1766565322

And that's just in the non-adversarial simple case.

If you don't know the provenance of images you are putting black box on (for example because of a rogue employee intentionally wanting to leak them, or if the image sensor of your target had been compromised to leak some info by another team), your redaction can be rendered ineffective, as some images can be made uncroppable by construction .

(Self-plug : https://github.com/unrealwill/uncroppable )

And also be aware that compression is hiding everywhere : https://en.wikipedia.org/wiki/Compressed_sensing

ThePowerOfFuet · 2025-12-24T09:02:55 1766566975

>Let's crop it anyway

That is not cropping.

https://en.wikipedia.org/wiki/Cropping_(image)

>Cropping is the removal of unwanted _outer_ areas from a photographic or illustrated image.

GistNoesis · 2025-12-24T13:40:51 1766583651

Please forgive my outside the box use of word.

I used it at the time as a reference to the "PNG aCropalypse" ( https://news.ycombinator.com/item?id=35208721 where I originally shared it in a comment).

The algorithm does also work if you remove the outer areas of the photo.

layla5alive · 2025-12-24T08:55:00 1766566500

Right, using stenography to encode some parity bits into an image so that lost information can be reconstructed seems like an obvious approach - all sorts of approaches you could use, akin to FEC. Haven't looked at your site yet, will be interested to see what you've built :)

Edit: I checked it out, nice, I like the lower res stenography approach, can work very nicely with good upscaling filters - gave it a star :)

333c · 2025-12-24T09:38:01 1766569081

steganography — stenography is courtroom transcription

helterskelter · 2025-12-24T14:22:32 1766586152

People protect their secrets from stenographers with steganography.

RobotToaster · 2025-12-24T10:54:51 1766573691

Somewhat related, I once sent a FOI request to a government agency that decided the most secure way to redact documents was to print them, use a permanent marker, and then scan them. Unfortunately they used dye based markers over laser print, so simply throwing the document into Photoshop and turning up the contrast made it readable.

cout · 2025-12-24T11:03:56 1766574236

I remember noticing that a teacher in high school had used white-out to hide the marks for the correct multiple choice answer on final exam practice questions before copying them. Then she literally cut-and-pasted questions from the practice questions for the final. I did mediocre on the essay, but got the highest score in the class on the multiple choice questions, because I could see little black dots where the white out was used.

RamRodification · 2025-12-24T08:29:19 1766564959

I was thinking I understand what's going on but then I came to the image showing the diff and I don't understand at all how that diff can unredact anything.

OtherShrezzing · 2025-12-24T08:44:05 1766565845

It's not that you can unredact them from scratch (you could never get the blue circle back from this software). It's that you can tell which of the redacted images is which of the origin images. Investigative teams often find themselves in a situation where they have all four images, but need to work out which redacted files are which of the origins. Take for example, where headed paper is otherwise entirely redacted.

So with this technique, you can definitively say "Redacted-file-A is definitely a redacted version of Origin-file-A". Super useful for identifying forgeries in a stack of otherwise legitimate files.

Also good for for saying "the date on origin-file-B is 1993, and the file you've presented as evidence is provable as origin-file-b, so you definitely know of [whatever event] in 1993".

RamRodification · 2025-12-24T09:15:37 1766567737

Ok thanks. That sounds reasonable.

>... and therefore you can unredact them

from that readme is just not true then I guess?

user34283 · 2025-12-24T11:24:58 1766575498

I mean, even the "crop" isn't used at all correctly, is it?

I think the word should be "redact".

OneMorePerson · 2025-12-24T18:49:46 1766602186

I'm trying to understand this cause it sounds fascinating but I don't get it. I don't have an advanced understanding of compression so that might be part of why.

If you compare an image to another image, you could guess by compression what is under the blocked part, that makes some sense to me conceptually, what I don't get is for the PDF specifically why does it compressing the black boxes I put have any risk? It's compressing the internal image which is just the black box part? Or are you saying the whole screenshot is an internal image?

GistNoesis · 2025-12-21T14:42:42 1766328162

The problem of computers is the problem of time : How to obtain a consistent causal chain !

The classical naive way of obtaining a consistent causal chain, is to put the links one after the other following the order defined by the simulation time.

The funnier question is : can it be done another way ? With the advance of generative AI, and things like diffusion model it's proven that it's possible theoretically (universal distribution approximation). It's not so much simulating a timeline, but more sampling the whole timeline while enforcing its physics-law self-consistency from both directions of the causal graph.

In toy models like game of life, we can even have recursivity of simulation : https://news.ycombinator.com/item?id=33978978 unlike section 7.3 of this paper where the computers of the lower simulations are started in ordered-time

In other toy model you can diffusion-model learn and map the chaotic distribution of all possible three-body problem trajectories.

Although sampling can be simulated, the efficient way of doing it necessitate to explore all the possible universes simultaneously like in QM (which we can do by only exploring a finite number of them while bounding the neighbor universe region according to the question we are trying to answer using the Lipschitz continuity property).

Sampling allows you to bound maximal computational usage and be sure to reach your end-time target, but at the risk of not being perfectly physically consistent. Whereas simulating present the risk of the lower simulations siphoning the computational resources and preventing the simulation time to reach its end-time target, but what you could compute is guaranteed consistent.

Sampled bottled universe are ideal for answering question like how many years must a universe have before life can emerge, while simulated bottled universe are like a box of chocolate, you never know what you are going to get.

The question being can you tell which bottle you are currently in, and which bottle would you rather get.

whatever1 · 2025-12-21T22:29:51 1766356191

Causality also is not a universal thing. Some things just coexist and obey to some laws.

Does the potential cause current? No, they coexist.

asplake · 2025-12-21T16:35:53 1766334953

I’m not sure Einstein would allow your concept of “simulation time”. Events are only partially ordered.

GistNoesis · 2025-12-04T21:26:40 1764883600

What they need is not so much memory but memory bandwidth.

For training, their models have a certain number of memory needed to store the parameters, and this memory is touched for every example of every iteration. Big models have 10^12 (>1T )parameters, and with typical values of 10^3 examples per batch, and 10^6 number of iteration. They need ~10^21 memory accesses per run. And they want to do multiple runs.

DDR5 RAM bandwidth is 100G/s = 10^11, Graphics RAM (HBM) is 1T/s = 10^12. By buying the wafer they get to choose which types of memory they get.

10^21 / 10^12 = 10^9s = 30 years of memory access (just to update the model weights), you need to also add a factor 10^1-10^3 to account for the memory access needed for the model computation)

But the good news is that it parallelize extremely well. If you parallelize you 1T parameters, 10^3 times, your run time is brought down to 10^6 s = 12 days. But you need 10^3 *10^12 = 10^15 Bytes of RAM by run for weight update and 10^18 for computation (your 120 billions gigabytes is 10^20, so not so far off).

Are all these memory access technically required : No if you use other algorithms, but more compute and memory is better if money is not a problem.

Is it strategically good to deprive your concurrents from access to memory : Very short-sighted yes.

It's a textbook cornering of the computing market to prevent the emergence of local models, because customers won't be able to buy the minimal RAM necessary to run the models locally even just the inferencing part (not the training). Basically a war on people where little Timmy won't be able to get a RAM stick to play computer games at Xmas.

lysace · 2025-12-04T21:54:54 1764885294

Thanks - but this seems like fairly extreme speculation.

> if money is not a problem.

Money is a problem, even for them.

GistNoesis · 2025-11-29T08:22:32 1764404552

In the video, in the continuous version the game never end and highlight the "loser" strategy.

When you are behind the optimal play is to make a gamble, which most likely will make you even worse. From the naive winning side it seems the loser is just doing a stupid strategy of not following the optimal dichotomy strategy, and therefore that's why they are losing. But in fact they are a "player" doing not only their best, but the best that can be done.

The infinite sum of ever smaller probabilities like in Zeno's paradox, converge towards a finite value. The inevitable is a large fraction of the time, you are playing catch-up and will never escape.

You are losing, playing optimally, but slowly realising the probabilities that you are a loser as evidence by the score which will most likely go down even more next round. Most likely the entire future is an endless sequence of more and more desperate looking losing bets, just hoping to strike it once that will most likely never comes.

In economics such things are called "traps", for example the poverty trap exhibits similar mechanics. Where even though you display incredible ingenuity by playing the optimal game strategy, most of the time you will never escape, and you will need to take even more desperate measures in the future. That's separating the wheat from the chaff from the chaff's perspective or how you make good villains : because like Bane in Batman there are some times (the probability is slim but finite) where the gamble pays off once and you escape the hell hole you were born in and become legend.

If you don't play this optimal strategy you will lose slower but even more surely. The optimal strategy is to bet just enough to go from your current situation to the winning side. It's also important not to overshoot : this is not always taking moonshots, but betting just enough to escape the hole, because once out, the probabilities plays in your favor.

GistNoesis · 2025-11-12T11:22:45 1762946565

TLDR : 1B particles ~ 3s per iterations

For examples like particle simulations, on a single node with a 4090 GPU everything running on GPU without memory transfer to the CPU:

-The main bottleneck is memory usage : available 24GB, Storing the particles 3 position coordinates, + 3 velocity coordinates, 4 bytes by number (float32) = Max 1B particles

-Then GPU memory bandwidth : if everything is on the GPU you get between 1000GB/s of global memory access and 10000GB/s when shared memory caches are hit. The number of memory access is roughly proportional to the number of effective collisions between your particles which is proportional to the number of particles so around 12-30 times ( see optimal sphere packing number of neighbors in 3d, and multiply by your overlap factor). All in all for 1B particles, you can collision them all and move them in 1 to 10s.

If you have to transfer things to the CPU, you become limited by the PCI-express 4.0 bandwidth of 16GB/s. So you can at most move 1B particles to and from the GPU, 0.7 times per second.

Then if you want to store the particle on disk, instead of RAM because your system is bigger, then you can either use a M2 ssd (but you will burn them quickly) which has a theoretical bandwidth of 20GB/s so not a bottleneck, or use a network storage over 100Gb/s (= 12.5GB/s) ethernet, via two interfaces to your parameter server which can be as big as you can afford.

So to summarize so far : 1B particles takes 1 to 10s per iteration per GPU. If you want to do smarter integration schemes like Rk4, you divide by 6. If you need 64 bits precisions you divide by 2. If you only need 16bits precisions you can multiply by 2.

The number of particle you need : Volume of the box / h^3 with h the diameter of the particle = finest details you want to be able to resolve.

If you use an adaptive scheme most of your particles are close to the surface of objects so O( surface of objects / h^2 ) with h=average resolution of the surface of the mesh. But adaptive scheme is 10 times slower.

The precision of the approximation can be bounded by Taylor formula. SPH is typically order 2, but has issues with boundaries, so to represent a sharp boundary the h must be small.

If you want higher order and sharp boundaries, you can do Finite Element Method, instead. But you'll need to tessellate the space with things like Delaunay/Voronoi, and update them as they move.

dahart · 2025-11-12T14:58:12 1762959492

Might be worth starting with a baseline where there’s no collision, only advection, and assume higher than 1fps just because this gives higher particles per second but still fits in 24GB? I wouldn’t be too surprised if you can advection 100M particles at interactive rates.

GistNoesis · 2025-11-12T16:01:34 1762963294

The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]dt), is 1000GB/s / 24GB = 42 iteration per second. If you only have 100M you can have 10 times more iteration.

But that's without any rendering, and non interacting particles which are extremely boring unless you like fireworks. (You can add a term like v[] += gdt for free.) And you don't need to store colors for your particles if you can compute the colors from the particle number with a function.

Rasterizing is slower, because each pixel of the image might get touched by multiple particles (which mean concurrent accesses in the GPU to the same memory address which they don't like).

Obtaining the screen coordinates is just a matrix multiply, but rendering the particles in the correct depth order requires multiple pass, atomic operations, or z-sorting. Alternatively you can slice your point clouds, by mixing them up with a peak-shaped weight function around the desired depth value, and use an order independent reduction like sum, but memory accesses are still concurrent.

For the rasterizing, you can also use the space partitioning indices of the particle to render to a part of the screen independently without concurrent access problems. That's called "tile rendering". Each tile render the subset of particles which may fall in it. (There are plenty of literature in the Gaussian Splatting community).

dahart · 2025-11-13T15:28:01 1763047681

> The theoretical maximum rate for 1B particle advection (Just doing p[] += v[]ddt), is 1000GB/s / 24GB =41.667/s 42 iteration per second.

Just to clarify, the 24GB comes from multiplying 1B particles by 24 bytes? Why 24 bytes? If we used float3 particle positions, the rate would presumably be mem_bandwidth / particle_footprint. If we use a 5090, then the rate would be 1790GB/s / 12B = 146B particles / second (or 146fps of 1B particles).

> non interacting particles which are extremely boring

You assumed particle-particle collision above, which is expensive and might be over-kill. The top comment asked simply about the maximum rate of moving particles. Since interesting things take time & space, the correct accurate answer to that question is likely to be less interesting than trading away some time to get the features you proposed; your first answer is definitely interesting, but didn’t quite answer the question asked, right?

Anyway, I’m talking about other possibilities, for example interaction with a field, or collision against large objects. Those are still physically interesting, and when you have a field or large objects (as long as they’re significantly smaller footprint than the particle data) they can be engineered to have high cache coherency, and thus not count significantly against your bandwidth budget. You can get significantly more interesting than pure advection for a small fraction of the cost of particle-particle collisions.

Yes if you need rendering, that will take time out of your budget, true and good point. Getting into the billions of primitives is where ray tracing can sometimes pay off over raster. The BVH update is a O(N) algorithm that replaces the O(N) raster algorithm, but the BVH update is simpler than the rasterization process you described, and BVH update doesn’t have the scatter problem (write to multiple pixels) that you mentioned, it’s write once. BVH update on clustered triangles can now be done at pretty close to memory bandwidth. Particles aren’t quite as fast yet, AFAIK, but we might get there soon.

GistNoesis · 2025-11-07T08:25:27 1762503927

The choice of the dynamic computation graph [1] of PyTorch made it easier to debug and implement, leading to higher adoption, even though running speed was initially slower (and therefore training cost higher).

Other decisions follow from this one.

Tensorflow started with static and had to move to dynamic at version 2.0, which broke everything. Fragmentation between tensorflow 1, tensorflow 2, keras, jax.

Pytorch's compilation of this computation graph erased the remaining edge of Tensorflow.

Is the battle over ? From a purely computational point, Pytorch solution is very far from optimal and billions of dollars of electricity and GPUs are burned every year, but major players are happy with circular deals to entrench their positions. So at the pace of current AI code development, probably one or two years before Pytorch is old history.

[1] https://www.geeksforgeeks.org/deep-learning/dynamic-vs-stati...

saagarjha · 2025-11-07T10:49:18 1762512558

Someone’s got to prototype the next generation of architectures.

Uehreka · 2025-11-07T09:17:13 1762507033

> at the pace of current AI code development, probably one or two years before Pytorch is old history.

Ehhh, I don’t know about that.

Sure, new AI techniques and new models are coming out pretty fast, but when I go to work with a new AI project, they’re often using a version of PyTorch or CUDA from when the project began a year or two ago. It’s been super annoying having to update projects to PyTorch 2.7.0 and CUDA 12.8 so I can run them on RTX 5000 series GPUs.

All this to say: If PyTorch was going to be replaced in a year or two, we’d know the name of its killer by now, and they’d be the talk of HN. Not to mention that at this point all of the PhDs flooding into AI startups wrote their grad work in PyTorch, it has a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at. I don’t even know what that would be.

Bear in mind that it took a few years for Tensorflow to die out due to lock in, and we all knew about PyTorch that whole time.

GistNoesis · 2025-11-07T12:01:00 1762516860

> a lot of network lock-in that an upstart would have to overcome by being way better at something PyTorch can never be good at

Higher level code migration to the newer framework, is going to 0. You ask your favorite agent (or intern) to port and check that the migration is exact. We already see this in the multitude of deep-learning frameworks.

The day one optimization trick that PyTorch can't do but another framework can, which reduce your training cost 10x and PyTorch is going the way of the dodo.

The day one architecture which can't be implemented in PyTorch get superior performance, and it's bye bye python.

We see this with architectures which require real-time rendering like Gaussian Splatting (Instant Nerf), or the caching strategies for LLM sequence generation.

Pytorch's has 3 main selling point :

- Abstracting away the GPU (or device) specific code, which is due to nvidia's mess : custom optimized kernels, which you are forced to adapt to if you don't want to write custom kernels.

If you don't mind writing optimized kernels, because the machine write them. Or if you don't need Cuda because you can't use Nvidia hardware because for example you are in China. Or if you use custom silicon, like Grok and need your own kernels anyway.

- Automatic differentiation. It's one of its weak point, because they went for easy instead of optimal. They shut themselves off some architectures. Some language like Julia because of the dynamic low-level compilation can do things Pytorch won't even dream about, (but Julia has its own problems mainly related to memory allocations). Here with the pytorch's introduction of the "scan function"[2] we have made our way full circle to Theano, Tensorflow's/Keras ancestor, which is usually the pain point of the bad automatic differentiating strategy chosen by Pytorch.

The optimal solution like all physics Phds which wrote simulations know, is writing custom adjoint code with 'Source Code Transformation' or symbolically : it's not hard but very tedious so it's now a great fit for your LLM (or intern or Phd candidate running 'student gradient descent') if you prove or check your gradient calculation is ok.

- Cluster Orchestration and serialization : a model can be shared with less security risks than arbitrary source code, because you only share weights. A model can be splitted between machines dynamically. But this is also a big weakness because your code rust as you become dependent of versioning, you are locked with the specific version number your model was trained on.

[2] "https://docs.pytorch.org/xla/master/features/scan.html

morshu9001 · 2025-11-07T15:18:26 1762528706

What would stop PyTorch from implementing whatever optimization trick becomes important? Even if it requires a different API.

GistNoesis · 2025-11-07T16:40:17 1762533617

There are two types of stops : soft stops, and hard stops.

- Soft stops is when the dynamic graph computation overhead is too much, which mean you can still calculate, but if you were to write the function manually or with a better framework, you could be 10x faster.

Typical example involve manually unrolling a loop. Or doing kernel fusion. Other typical example is when you have lots of small objects or need to do loops in python because it doesn't vectorize well. Or using the sparsity efficiently by ignoring the zeros.

- Hard stop is when computing the function become impossible, because the memory needed to do the computation in a non optimal way explode. Some times you can get away with just writing customized kernels.

The typical example where you can get away with it are custom attention layers.

Typical example where you can't get away are physics simulations. Like for example the force is the gradient of energy, but you have n^2 interactions between the particles, so if you use anything more than 0 memory preserved during the forward pass per interaction, your memory consumption explode. And typically with things like Lagrangian or Hamiltonian neural networks where you look the discover dynamics of an energy conserving system, you need to be able differentiate at least three times in a row.

There are also energy expanding stops, where you need to find work-around to make it work like if you want to have your parameters changing shape during the optimization process like learning point clouds of growing size, and they spread you thin so they won't be standardized.

GistNoesis · 2025-11-02T11:15:40 1762082140

>computing the exact gradient using calculus

First of all, gradient computation with back-prop (aka reverse-mode automatic differentiation) is exact to numerical precision (except for edge-cases that are not relevant here) so it's not about the way of computing the gradient.

What Andrej is trying to tell is that when you create a model, you have freedom of design in the shape of the loss function. And that in this design what matters for learning is not so much the value of the loss function, but its slopes, and curvature (peaks and valleys).

The problematic case being flat valleys, surrounded by straight cliffs, (picture the grand canyon).

Advanced optimizers in deep learning like "Adam", are still first-order, with diagonal approximation of the curvature, which mean the optimizer in addition to the gradient it has an estimate of the scale sensitivity of each parameter independently. So the cheap thing it can reasonably do is modulate the gradient with this scale.

The length of the gradient vector, being often problematic, what optimizers would usually do was something called "line-search", which is determine the optimal step-size along this direction. But the cost of doing that is usually between 10-100 evaluation of the cost function which is often not worth the effort in the noisy stochastic context, compared to just taking a smaller step multiple times.

Higher-order optimizers necessitate that the loss function is twice differentiable, so non-linearities like relu, which are cheap to calculate can't be used.

Lower-order global optimizers don't even necessitate the gradient, which is useful when the energy-function landscape has lots of local minima, (picture an egg-box).