I know this subject quite well and I will later publish a detailed article.
The real run-time cost of memory management done well in a modern game engine written without OOP features is extremely low.
We usually use a few very simple specialized memory allocators, you'd probably be surprised by how simple memory management can be.
The trick is to not use the same allocator when the lifetime is different.
Some resources are allocated once and basically never freed.
Some resources are allocated per level and freed all at once at the end.
Some resources are allocated during a frame and freed all at once when a new frame starts.
And lastly, a few resources are allocated and freed randomly, and here the cost of fragmentation is manageable because we're talking about a few small chunks (like network packets)
+1. We have a large Rust code base, and we forbid Vec and the other collections.
Instead, we have different types of global arenas, bump allocators, etc. that you can use. These all pre-allocate memory once at start up, and... that's it.
When you have well defined allocation patterns, allocating a new "object" is just a "last += 1;` and once you are done you deallocate thousands of objects by just doing `last -= size();`.
That's ~0.3 nanoseconds per allocation, and 0.x nano-seconds to "free" a lot of memory.
For comparison, using jemalloc instead puts you at 15-25 ns per allocation and per deallocation, with "spikes" that go up to 200ns depending on size and alignment requirements. So we are talking here a 100-1000x improvement, and very often the improvement is larger because these custom allocators are more predictable, smaller, etc. than a general purpose malloc, so you get better branch prediction, less I-cache misses, etc.
Do you use any public available crate for those allocators? Would love to take a look. I'm currently trying to write a library for no-std which requires something like that. I currently have written a pool of bump allocators. For each transaction you grab an allocator from the pool, allocate as many objects from it as necessary, and then everything gets freed back to the pool. However it's a bit hacky right now, so I'm wondering whether there is already something better out there.
> Do you use any public available crate for those allocators?
Not really, our bump allocator is ~50 LOC, it just allocates a `Box<[u8]>` with a fixed size on initialization, and stores the index of the currently used memory, and that's it.
We then have a `BumpVec<T>` type that uses this allocator (`ptr`, `len`, `cap`). This type has a fixed-capacity, it cannot be moved or cloned, etc. so it ends up being much simpler than `Vec`.
Exactly this. Writing a basic allocator can give you a significant bang for your buck. Hard-coding an application specific arena allocator is trivial. The harder part is being diligent enough to avoid use after free/dangling pointers.
I'm a huge Java nerd. I love me some G1/Shenandoah/ZGC/Zing goodness. But once you're writing a program that to the point that you're tuning memory latency in many games anyway, baking in your application's generational hypothesis is pretty easy. Even in Java services you'll often want to pool objects that have odd lifetimes.
This is exactly what's continually drawing me to Zig for this sort of thing. Any function that requires memory allocation has to either create a new allocator for itself or explicitly accept an existing one as an argument, which seems to make it a lot easier/clearer to manage multiple allocators for this exact use case.
This sounds very interesting. Please do write an article abput the topic! Seems like you would introduce specific allocators for maybe different games.
I buy this - specialized memory allocators, after all I've used mbufs. So which languages would be good or recommended for coercing types when using memory allocators?
It would be interesting to see what the impact of adding hints about allocation lifetime to malloc would be. I have to suspect someone has already written that paper and I just don't know the terminology to look for it.
I'm reminded of an ancient 8031 C compiler. It didn't support reentrancy. Which seems like a bad thing until you realize that the program could be analyzed as a acyclic graph. The compiler then statically allocated and packed variables based on the call tree. Programs used tiny amounts of memory.
Given the problems people have with Rust and the borrow checker, I would guess that people will not accurately predict how long an allocation will live (it's a similar issue with profiling---programmers usually guess wrong about where the time is being spent).
as an example point, the Go garbage collector clears heaps of 18gb in sub-millisecond latencies. If I'm understanding the problem at hand (maybe I'm not!), given an engine running at a target framerate of 144 frames per second, you're working with a latency budget of about 7ms per frame. Do you always use all 7ms, or do you sometimes sleep or spin until the next frame to await user input?
We can also look at it from the other direction: if your engine is adjusting its framerate dynamically based on the time it takes to process each frame, and you can do the entire work for a frame in 10ms, does that give you a target of 100 fps? If you tack on another half millisecond to run a GC pause, would your target framerate just be 95 fps?
And what do you do when the set of assets to be displayed isn't deterministic? E.g., an open world game with no loading times, or a game with user-defined assets?
There is no general answer to this question.
Frame latency, timing and synchronization is a difficult subject.
Some games are double or triple buffered.
Rendering is not always running at the same frequency as the game update.
The game update is sometimes fixed, but not always.
I've had very awful experience with GC in the past, on Android, the game code was full C/C++ with a bit of Java to talk to the system APIs, I had to use the camera stream.
At the time (2010) Android was still a hot mess full of badly over engineered code.
The Camera API was simply triggering a garbage collect about every 3-4 frames, it locked the camera stream for 100ms (not nanoseconds, milliseconds!)
The Android port of this game was cancelled because of that, it was simply impossible to disable the "feature".
You didn't have a bad experience with GC in the past, you had a bad experience with a single GC implementation, one which was almost certainly optimized for throughput and not latency and in a language that pushes you toward GC pressure by default. :)
I never worked with Unity myself but I worked with people using Unity as their game engine, they all had problems with stuttering caused by the GC at some point.
You can try to search Unity forums about this subject, you'll find hundreds or thousands of topics.
What really bothers me with GC is that it solves a pain I never felt, and creates a lot of problems that are usually more difficult to solve.
This is a typical case of the cure being (much) worse than the illness.
What is Unity’s STW time? Is it optimized for latency? If not, you’re using the wrong collector. The pain point it solves is time spent debugging memory issues and generally a slower pace of development, but of course if you’re using a collector optimized for throughout, you’ll have a bad time. Further, a low-latency GC will pay off more for people who haven’t been using C or C++ for 10 years than those who have.
It is very nice to say that, in theory, a GC could work very well for performance demanding games. But until someone builds that GC, in an environment where everything else is suitable for games also, it is academic. We can't actually build a game with a theoretical ecosystem.
Of course, this is a theoretical conversation. My point is that you don’t refute the claim, “low latency GCs are suitable for game dev” with “I used a high latency GC and had a bad time”. Nor do you conclude that GCs inherently introduce latency issues based on a bad experience with a high latency GC. There is neither theory nor experiment that supports the claim that GCs are inherently unsuitable for game dev.
"Clearing an 18GB heap" that's full of 100MB objects that are 99% dead is different than clearing an 18GB heap of 1KB objects that are 30% (not co-allocated, but randomly distributed across the whole heap).
this is exactly the kind of dismissive hand-waving that is frustrating. The 18gb heap example is from data collected on production servers at Twitter. Go servers routinely juggle tens of thousands of connections, tens of thousands of simultaneous user sessions, or hundreds of thousands of concurrently running call stacks. We're essentially never talking about 100mb objects since the vast majority of what Go applications are doing is handling huge numbers of small, short-lived requests.
I'm not a game developer, just a programming language enthusiast with probably above average understanding of how difficult the problem this is.
Can you point out in the post where they expand on my point? The only this I see is this:
> Again we kept knocking off these O(heap size) stop the world processes. We are talking about an 18Gbyte heap here.
which is exactly my point - even if you remove all O(heap size) locks, depending on the exact algorithm it might still be O(number of objects) or O(number of live objects) - e.g. arena allocators are O(1) (instant), generational copying collectors are O(live objects), while mark-and-sweep GCs (including Go's if I understand correctly after skimming over your link) are O(dead objects) (the sweeping part). Go's GC seems to push most of that out of Stop-The-World pause, instead it offloads it to mutator threads instead... Also, "server with short-lived requests" is AFAIK a fairly good usecase for a GC - most objects are very short-lived, so it would be mostly garbage with simple live object graph...
Still, a commendable effort. Could probably be applied to games as well, though likely different specific optimisations would be required for their particular usecase. I think communication would be better if you expanded on this (or at least included the link) in your original post.
There are plenty of known GC algorithms with arbitrarily bounded pause times. They do limit the speed at which the mutator threads can allocate, but per-frame allocations can be done the same way they are done in a non-GC'd environment (just allocate a pool and return everything to it at the end of each frame), and any allocations that are less frequent will likely be sufficiently rare for such algorithms.
I know people who have written games in common-lisp despite no implementations having anything near a low-latency GC. They do so by not using the allocator at all during the middle of levels (or using it with the GC disabled, then manually GCing during level-load times).
Best comment here. I have spent some time evaluating Go for making games. Considering the fact that for 95% of games it doesn't really matter if the language has a GC it could have been a very nice option.
Currently CGO just sucks. It adds a lot of overhead. The next problem is target platforms. I don't have access to console SDKs but compiling for these targets with Go should be a major concern.
Go could be a nice language for making games but with the state it is in the only thing is good for is hobby desktop games.
Go's GC has latencies that would be good enough for a great many games, but its throughput might become problematic, and in general, as a language it is problematic for the kinds of games that would worry about GC pauses, since FFI with c libraries is very expensive. If that could be solved Go might be very appealing for many many games.
The real run-time cost of memory management done well in a modern game engine written without OOP features is extremely low.
We usually use a few very simple specialized memory allocators, you'd probably be surprised by how simple memory management can be.
The trick is to not use the same allocator when the lifetime is different.
Some resources are allocated once and basically never freed.
Some resources are allocated per level and freed all at once at the end.
Some resources are allocated during a frame and freed all at once when a new frame starts.
And lastly, a few resources are allocated and freed randomly, and here the cost of fragmentation is manageable because we're talking about a few small chunks (like network packets)