You might be correct, but it's not obvious to me unsafe is really faster. I recall reading that using unsafe invalidates most of the other optimizations because it's opaque to the JIT what you are doing. Also, if there are BCE opportunities, I feel pretty confident OpenJDK will add them, rather than being unsympathetic.
In terms of performance: I realize that this is a somewhat "toy" issue, and it's a sample size of 1, but for the currently ongoing "One Billion Row Challenge"[1] (an ongoing Java performance competition related to parsing and aggregating a 13 GB file), all of the current top-performers are using Unsafe. More specifically, the use of Unsafe appears to have been the change for a few entries that allowed getting below the 3-second barrier in the test.
Submissions to this challenge are also hampered by the lack of prefetch intrinsic, vpshufb intrinsic, aesenc intrinsic, and graal's complete lack of vector intrinsics. As a total outsider to the Java ecosystem, it makes it seem like nobody in a place to make changes really knows or cares about enabling high-throughput code to run in the JVM.
Also, graal can do some vectorization, and definitely has some libraries with vector intrinsics, e.g. truffle’s regex implementation uses similar algorithms to simdjson.
I think I was fairly specific? There's no way for a user to do vpshufb, aesenc, or a prefetch instruction. One would expect the former two things to be in jdk.incubator.vector, where they are not present. The last thing was explicitly removed, here's a link to one part of the process of it being removed: https://bugs.openjdk.org/browse/JDK-8068977
Code that uses jdk.incubator.vector does not actually get compiled to vector instructions under graal. This is why the top submission that uses jdk.incubator.vector is the only top solution that does not use graal.
I don't doubt that "similar algorithms to simdjson" may be used, but simdjson uses instruction sequences that are impossible to generate using the tools provided.
I mention these instructions because vpshufb comes up a lot in string parsing and formatting, because prefetch is useful for hiding latency when doing bulk accesses to a hash table, and because aesenc is useful for building cheap hash functions such as ahash.
For my solution (https://github.com/dzaima/1brc, uses jdk.incubator.vector significantly), switching all array reads/writes to identical Unsafe ones results in the solution running ~100x slower, so there's certainly truth to Unsafe messing with optimization - I suppose it means that the VM must consider all live objects to potentially have mutated. (as for the sibling comment, indeed, lack of vpshufb among other things can be felt significantly while attempting to use jdk.incubator.vector)
As a non-java developer, this is surprising to me. I always assumed that bounds checks were "almost free" because the branch predictor will get it right 99% of the time. I can see that being not-true for interpreted runtimes (because the branch predictor gets bypassed, essentially), but I thought Java was JITed by default?
Perhaps the paper answers my question, but I'll admit I'm being lazy here and would much appreciate a tl;dr.
(with bounds checks) "Telling the Rust allocator to avoid zeroing the memory when allocating for Brotli improves the speed to 224 MB/s."
(without bounds checks) "Activating unsafe mode results in another gain, bringing the total speed up to 249MB/s, bringing Brotli to within 82% of the C code."
224MB/s -> 249MB/s (11% Brotli compression perf difference just by eliminating bounds checks)
Couple issues with the linked article, the rust compiler has matured a lot since it was written I didn’t look at the benchmark or the code, but many bounce checks are elided if you use iterators instead of array access and lastly, I would want to see ZSTD benchmark written in normative rust.
I don’t doubt that all of those things in the article were true back then, but that was eight years ago. Wow.
Sure, you're not wrong, but my focus was that the benchmark is indicative of the raw performance impact of bounds checks (when they are unable to elided) in a real world algorithm (as opposed to a micro-benchmark).
With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.
It could be an argument that bounds checks make up a small percentage of total application performance. However, I've profiled production Java servers where >50% of the CPU was encryption/compression. JDK implementations of those algorithms are heavily impacted by (and commonly fail to elide) bounds checks.
>With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.
Adding explicit checks does work to a certain degree but it can change with the compiler, and it requires to keep checking the generated assembly - not fun (no unsafe, either but still)
Some 12-13y back (time does fly) Cliff Click (hotspot architect) had a series of blogs on optimizations, incl. the lattice checks (i.e. within bounds). Was quite insightful. It was called: "Too Much Theory" [0]
>What is optimized today may not be tomorrow.
Exactly. (Also most developers will have exceptionally hard time maintaining such code)
I understand where you are coming from, but without comparing the generated assembly, we are comparing an implementation of bounds checks. I think we should have a bounds check instruction and operates concurrently.
I should play around this with this using a couple RISC-V cores.
Bounds checks are not 'free' but predicted ones are cheap. There are way to convince the JIT to remove them - in cases it can prove the index doesn't exceed the byte array, so explicit checks with the length of the array may improve the performance.
The stuff gets worse with DirectByteBuffers as the JIT has to work harder. Unsafe allows to 'remove' all bound checks, but it may prevent some other optimizations.
I see this mentioned a few times in this thread, but I haven't experienced this in practice (and I've written a lot of unsafe in Java). Are there any examples of this?
There are tricks you can do, but it introduces load on the TLB. You basically restrict accesses to 32 bits of memory and isolate that space in virtual memory with its own prefix.
Perf impact due to bounds checks can be experienced in statically compiled languages as well (like Go/Rust). Although branch predictor improvements likely narrow the gap, they do not eliminate it.
Code cache is not completely relevant afaik - you can easily replicate in micro-benchmarks.
I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.
If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost, besides taking up place in i$, so I would assume that that is responsible for the difference.
> the story hasn't changed an awful lot since 2004
Um... yeah it has. For starters, hotspot wasn't even a part of the JVM at that point. But further newer JVM additions like the enhanced for loop eliminate a ton of conditions where someone would run into bounds checking. Doing a naked `a[i]` is simply not common java code.
The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.
> Doing a naked `a[i]` is simply not common java code.
It is extremely common in performance sensitive code,
1) graphics & rendering
2) networking
3) buffers
> But further newer JVM additions like the enhanced for loop eliminate a ton of conditions
> The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.
There are more comments in this thread that clarify further, but Java is very commonly unable to eliminate bounds checks. You can test all of these things yourself with a quick benchmark - don't take my word for it! The JIT is not as great at this as common rhetoric claims it is.
It's a writing error on my end (clear since I qualify the percentage afterwards), so I think focusing on it distracts from the point (which is why I say it's pedantic).
2004 had java 1.4 + hotspot. 1.5 (w/ generics and stuff) was about to be released. Hotspot was there, so were the Direct Buffers.
Also accessing byte arrays and direct buffers is extremely common - if you do just "business logic" jazz - it does not happen, though. However, every hashmap needs that, pretty much each hash lookup is a direct a[hash]
This problem is not going to go away so easily. Numerous core Java classes (like BufferedInputStream) use synchronized. I count 1600+ usages in java.base. The blocking issue means it's _much_ easier to accidentally run into this, rather than waving it away as an unlikely edge case.
I personally ran into this Using the built in com.sun webserver, with a virtual thread executor. My VPS only has two CPUs which means the FJP that virtual threads run on only have 2 active threads at a time. I ran into this hang when some of the connection hung, blocking any further requests from being processed.
As the JEP states, pinning due to synchronized is a temporary issue. We didn't want to hold off releasing virtual threads until that matter is resolved (because users can resolve it themselves with additional work), but a fix already exists in the Loom repository, EA builds will be offered shortly for testing, and it will be delivered in a GA release soon.
Those who run into this issue and are unable or unwilling to do the work to avoid it (replacing synchronized with j.u.c locks) as explained in the adoption guide [1] may want to wait until the issue is resolved in the JDK.
I would strongly recommend that anyone adopting virtual threads read the adoption guide.
The problem is that it's rare to write code which uses no third-party libraries, and these third-party libraries (most written before Java virtual threads ever existed) have a good chance of using "synchronized" instead of other kinds of locks; and "synchronized" can be more robust than other kinds of locks (no risk of forgetting to release the lock, and on older JVMs, no risk of an out-of-memory while within the lock implementation breaking things), so people can prefer to use it whenever possible.
To me, this is a deal breaker; it makes it too risky to use virtual threads in most cases. It's better to wait for a newer Java LTS which can unmount virtual threads on "synchronized" blocks before starting to use it.
> have a good chance of using "synchronized" instead of other kinds of locks; and "synchronized" can be more robust than other kinds of locks (no risk of forgetting to release the lock, and on older JVMs, no risk of an out-of-memory while within the lock implementation breaking things),
I haven't professionally written Java in years, however from what I remember synchronized was considered evil from day one. You can't forget to release it, but you better got out of your way to allocate an internal object just for locking because you have no control who else might synchronize on your object and at that point you are only a bit of syntactic sugar away from a try { lock.lock();}finally{lock.unlock();} .
The fact that the monitor is public rarely causes issues, and in those cases where it's used on internal objects, it's not really public anyhow.
There's an additional benefit to using the built in monitors, and that has to do with heap allocation. The data structure for managing it is allocated lazily, only when contention is actually encountered. This means that "synchronized" can be used as a relatively low cost defensive coding practice in case an object which isn't intended to be used by multiple threads actually is.
Is there a similarly low-level synchronization mechanism that doesn't work this way? .NET's does the same thing.
I guess I might have preferred if both Java and .NET had chosen to use a dedicated mutex object instead of hanging the whole thing off of just any old instance of Object. But that would have its own downsides, and the designers might have good reason to decide that they were worse. Not being able to just reuse an existing object, for example, would increase heap allocations and the number of pointers to juggle, which might seriously limit the performance of multithreaded code that uses a very fine-grained locking scheme.
In .net async won where lock and mutex does not work (lock is like synchronized, not exactly the same, tough). That’s why most libraries use SemaphoreSlim which would work with green threads. But that’s more because of the ecosystem. I’ve barley stumble upon lock’s and mutex is mostly used in the main method since it acquires a real os mutex, not really a cheap thing but for GUIs it’s clever to check if the app is running. Most libs that use system.threading.task use semaphoreslim tough.
Yeah, definitely. But for a fair comparison I think you have to look at how .NET did things before async/await hit the scene. And, for that, the aspect of the design in question is quite similar between the two.
Early .Net is hardly an independent data point from early Java. Not only was .Net directly influenced by Java, it also had to support a direct migration from the Microsoft JVM specific Visual J++ to J#.
The handful of languages I know either do not have a top level object class that supports a randomized set of features ( C++ ) or prioritize a completely different way of concurrent execution ( Python, JavaScript ).
Hi Ron. Thanks a lot for the amazing work you are doing on loom and whole JVM platform. EA builds and GA release you mentioned can make it into 22 or you meant EA build for 23?
We make scalable graphics rendering servers to stream things like videogames across the web. When we started the project to switch to virtual threads we had that as number one on the big board. "Rewrite for reentrant locks."
Maybe we have more fastidious engineers than a normal company would since we are in the medical space? But even the juniors were reading and familiarizing themselves on how to properly lock in loom's infancy.
All that only to point out that, yes, they had communicated the proper use of reentrant locks long ago.
I do understand what you're saying from an engineering management perspective though. That effort cost a fortune. Especially when you have the FDA to deal with.
It was more than worth it though! In the world of cloud providers, efficiency is money.
We use the same technologies to deliver, say, remote CT review capability, that you would use to stream a videogame. It's just far more likely that the audience I'm communicating with, HN, is familiar with the requirements of videogame streaming, than it is that they are familiar with remote medical dataset viewing. Obviously the requirements or our use case are far more stringent, but no need to go into all that to illustrate the point made.
1 - Use virtual threads with reentrant locks if you need to do "true heavy" scaling.
2 - Kind of implied, but since you gave the opportunity to make it explicit with your comment =D, there is no need to waste your life on earning no money in videogames when the medical industry is right there willing to pay you 10x as much for the same skills. (Provided your skill is in the hard backend engine and physics work. They pay more for the ML too, if I'm being honest.)
In Virtual Threads: An Adoption Guide part there is:
When using virtual threads, if you want to limit the concurrency of accessing some service, you should use a construct designed specifically for that purpose: the Semaphore class.
That language only obliquely mentions the issue. It is nowhere near clear and direct enough for someone who is just, for example, using a third-party library that is affected. And then it's stuck inside detailed documentation that anyone who wasn't personally planning on adopting virtual threads is unlikely to read.
This seems like it's at least vaguely headed in the direction of that famous scene from early in The Hitchhiker's Guide to the Galaxy:
“But the plans were on display…”
“On display? I eventually had to go down to the cellar to find them.”
“That’s the display department.”
“With a flashlight.”
“Ah, well, the lights had probably gone.”
“So had the stairs.”
“But look, you found the notice, didn’t you?”
“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”
I would like to take this opportunity to thank pron and the amazing jdk developers for working on a state of the art runtime and language ecosystem and providing it for free. Please ignore the entitled, there are many many happy Dev's who can't thank you all enough.
People always forget that things that only happen every few million times, can happen fairly frequently on a busy server. This has bitten me numerous times. The nature of a lot of these types of issues is that they are hard to detect and hard to reproduce.
Virtual threads are nice for unblocking legacy code but they aren't without issues. There are better options for new code with less trade offs on the jvm as well. I've recently been experimenting with jasync-postgresql (there's a mysql variant as well) as an alternative to JDBC in Kotlin. It's a nice library. It does have some limitations and is a bit on the primitive side. But it appears to be somewhat widely used in various database frameworks for Scala, Java, and Kotlin.
Databases and database frameworks are an area on the JVM where there just is a huge amount of legacy code built on threads and blocking IO. It's probably one of the reasons Oracle worked on virtual threads as migrating away from these frameworks is unlikely to ever happen in a lot of code bases. So, waving a magic wand and making all that code non blocking is very attractive. But of course that magic has some hard limitations and synchronize blocks are one of those. I imagine they are working on improving that further.
> Virtual threads are nice for unblocking legacy code but they aren't without issues. There are better options for new code with less trade offs on the jvm as well.
The designers of Project Loom would say the exact opposite. The whole push behind Project Loom and similar models (Go's oft-praised "goroutines" runtimes being another one) is motivated by Threads being a much better fit for async behavior in a fundamentally procedural language like Java or Go than promise-based frameworks like async/await.
The whole motivation of Project Loom is to make the simple thing (spawning threads to handle blocking IO) the fast thing as well (by actually replacing the blocking IO with efficient async IO OS calls and managing the threads internally). Project Loom will be considered a full success if the next generation Java web server does something akin to "new Thread(() -> {executeHandlerFunc(conn); }.Start(); " for each incoming connection, just like the Go built-in web server.
I think it's not that black and white. Clearly they made a choice to be backwards compatible. Not because Java Threads have a nice API (not even close) but because a lot of legacy code that will never be changed uses it. Including all the ugly bits that you shouldn't be using. Like a lot of the low level synchronization primitives that date back to the early days of Java. It's an impressive bit of work but they made some compromises to make things work. A new API would have been easier, would have had less overhead, and be nicer to use. But backwards compatibility with legacy code was a big goal.
It mostly works fine and it's an impressive bit of engineering. But it has some really ugly failure modes in combination with hacky legacy code designed for real threads. So, you can't blindly assume things to just work. Hence the deadlocks.
Many Java servers already work the way you outline. It's just that they are a bit tedious to use with the traditional Java frameworks. Which is one reason I like using Spring's webflux with Kotlin instead. Just way nicer when it's all exposed via co-routines.
There are two separate choices. One is the choice of whether to implement green threads in the JVM at all, or whether to use async/await, or some other type of concurrency primitive. The other is whether to expose the new concurrency primitive using a new API or an existing one.
You could say the second choice, the specific API, was done, at least to some extent, for backwards compatibility reasons. I wouldn't agree, but I think there is at least some argument to be made. Here is one of the designer's explanation [0]:
> We also realized that implementing the existing thread API, so turning it into an abstraction with two different implementations won't add any runtime overhead. I also found that when talking about Java's new user mode threads back when this feature was in development, and back when we still called them fibers, every time I talked about them at conferences, I kept repeating myself and explaining that fibers are just like threads. After trying a few early access releases of the JDK with a fiber API, and then a thread API, we decided to go with the thread API.
However, the choice of adding a new concurrency primitive to Java in the form of green threads instead of others was very very clearly not done for backwards compatibility's sake. Ron Pressler (who is active here as 'pron') has several talks on the advantages of green threads over async/await that you can look at [0][1]. The designers of Go also had the same belief, and also chose to add green threads as the fundamental built-in concurrency primitive in Go, obviously not for backwards compatibility reasons in their case.
There might be some justification for comparing any one particular thing to the worst possible particular thing if those things have something in common. The only feature the two things you picked have in common is the word 'java'.
Appeal to expertise.
Appeal to authority is a falacy when the authority is not an expert in the requisite domain.
eg: we don't care what a policeman thinks about astrophysics, we do care what the astrophysicist says.
My understanding is that that highest performance webserver is nginx. And it uses async internally.
IMO, virtual threads is a better general purpose language feature because it avoids function coloring and is generally easier to reason about, but it may not result in the highest performance Java webserver.
NGINX is a native C implementation, so it has to be carefully written to use the OS's native high-performance IO and native OS threads.
The purpose of project Loom is to abstract that away from Java application code. The runtime can use the most efficient IO for the given platform (ideally io_uring on Linux or IOCP on Windows, for example) even if the application code calls the old blocking File.Write(). The application can then use simple APIs and code patterns, but still get massive performance.
With Loom, you can easily have 20,000 virtual threads servicing 20,000 concurrent HTTP requests and each "blocked" in IO, while only using, say, 100 OS threads that are polling an IOCP. A normal Linux box can typically only handle around maybe 1000 threads across all running processes.
Most application webservers (by default) handle one request per thread. For mostly IO bound stuff (which many projects are), it makes sense to me that threads become a bottleneck in relatively ordinary scenarios.
The scenario where your IO could handle way more than a thousand concurrent requests if only the thread overhead was reduced? When does that ever happen?
Each OS thread costs memory. With the version of Java I have, the default is to allocate 1MB of stack for each thread. So, 10,000 threads would require 10,000 MB of RAM even if we configured ulimit to allow that many threads. In contrast, asking the kernel to do buffered reads of 10,000 files in parallel requires much less memory - especially if most of those are actually the same physical file. Of course, they won't be read fully in parallel.
For example, this program:
var threads = new Thread[20000];
for (int i = 0; i < 20000; i++) {
threads[i] = Thread.ofVirtual().start(() -> {
try {
Files.copy(FileSystems.getDefault().getPath("abc.txt"), System.out);
} catch (IOException e) {
System.err.println("Error writing file");
e.printStackTrace();
}});
}
for (int i = 0; i < 20000; i++) {
threads[i].join();
}
Run as `java Test > ./cde.txt` takes about 4.5s to run on my WSL2 system with 2 cores, writing a 2 GB file (with abc.txt having 100KB); even this would be within the HTTP timeout, though users would certainly not be happy. Pretty sure a native Linux system on a machine beefy enough to be used as a web server would have no problem serving even larger files over a network like this.
1. You are not solving a real problem. The use case you describe (basically a CDN) is already exotic, the scenario where such a system would have already been implemented with Java and its basic IO seems implausible.
2. You did not compare against fewer threads to see if threads are actually the bottleneck rather than IO. Also, all your threads are competing for stdout.
The lack of support for synchronized isn't a fundamental or hard limit, it's just that the HotSpot implementation is complicated for performance reasons and they put off rewriting that code until later. They're indeed working on that now and in some future version I guess wait/notify and synchronized blocks will start to work. After all, you can easily transform such code into an equivalent that does work.
The system property jdk.tracePinnedThreads triggers a stack trace when a thread blocks while pinned. Running with -Djdk.tracePinnedThreads=full prints a complete stack trace when a thread blocks while pinned, highlighting native frames and frames holding monitors. Running with -Djdk.tracePinnedThreads=short limits the output to just the problematic frames.
After exploring a few constant access serialization formats, I had to pass on Capn Proto in favor of Apache Avro. Capn has a great experience for C++ users, but Java codegen ended up being too annoying to get started with. If Capn Proto improved the developer experience for the other languages people write, I think it would really help a lot.
The StreamObserver API came at a time (2015) when it seems liked RxJava was going to take over. That didn't end up happening, but the API is still around. While it is more cumbersome, some things are /impossible/ to do with the Go style blocking. For example, try cancelling out of a Recv() call. The only way is to tear the entire Stream down. Goroutines never successfully married select {} and sync.Cond, or context Cancel. These are needed to successfully back out of a blocking statement. Unfortunately, that can't be done, and a goroutine that blocks is really stuck there. The only saving grace is that goroutines are relatively cheap (2-4K of memory?), and it's okay if a few O(100K) of them get stuck.
This is impractical a lot of the time. There are a class of problems where you want to put 2^N + 1 possible values in an N bit container, and it won't ever fit cleanly. Null is that 1 extra value that won't fit cleanly.
Another case is an array based queue. It can be implemented with head+tail pointers, or size+offset. However, there will always be an ambiguity with either, because two words of memory aren't enough to represent all possible states of the queue.
What isn't clear to me is why tail calls need to be implemented in WASM, rather than in the compiler? The post linked to Josh Habermans post on tail calls, which show how tail calls can help the compiler decide where to inline (cool!). But that was needed for the C++ text, not the LLVM code. It feels like tail calls are too high level of a concept to be in an "assembly" language.
If a function calls itself using tail-recursion, a compiler can turn that into a loop without too much trouble. However, if it's tail-calling a different function then that becomes more difficult; it would have to merge the two functions into a single WASM function. And if the tail-call is indirect (through a function pointer) then it is impossible to turn into a loop.
> It feels like tail calls are too high level of a concept to be in an "assembly" language.
WASM is a bit higher level than typical assembly languages. It doesn't have unrestricted "goto", so there's no way to implement tail calls optimization the "hard way".
> It feels like tail calls are too high level of a concept to be in an "assembly" language.
In x86, the difference is between `call some_fn` and `jmp some_fn` (perhaps after a `call _guard_check_icall` to implement Control Flow Guard)
In WASM, the difference is between `call[_indirect] some_fn` and `return_call[_indirect] some_fn`, which is just `jmp some_fn` with a funny name - and similar control flow integrity constraints as imposed by CFG. Why not just use `jmp some_fn`? Because WASM lacks a generic `jmp some_fn`, as they've baked https://webassembly.org/docs/security/#control-flow-integrit... into the arch, which seems fair enough.
To add to other reasons, my opinion is: Webassembly right now is an extremely easy target for compilation. You can create your own compiler absolutely easily and get plenty of V8 optimizations for free. It's like LLVM, but probably much more accessible for hobby projects. You just parse text, build AST and dump AST to WebAssembly.
If WebAssembly would implement tail calls, implementing them in an original language requires zero efforts, they would just work.
But implementing tail calls using some kind of optimization, like function rewrite to loop is far from easy.
So my personal hope is that Webassembly would include features that are possible but very hard to develop in a hobby project. GC is another example. Even primitive toy GC is a serious project.
Because the general tail call problem isn't self recursion - that is indeed trivial to flatten, but co-recursion, or recursion when you don't know the target.
co-recursion (f calls g calls h calls f, etc) requires you create a single function containing the bodies of all functions that will be called via tail recursion, put them all in a loop, and have a single stack frame capable of holding all the independent frames concurrently. It's doable, but taking my dumb example you'd need to do this for each of f, g, and h, or you have to convert those functions into wrapper functions for your one giant mega function. The mega function then has issues if it has non-tail recursive calls that re-enter as its own frame is large. The stack frame is large because WASM is structured: the bytecode can't just treat the stack as a general purpose blob of memory.
Dynamic recursion is a case where ahead of time you quite simply cannot have compile time flattening, e.g.
int f(int (*g)()) {
return g();
}
Languages that directly target machine code can do this because they just have to rebuild the stack frame at the point of the call and perform a jump rather than a call/bl instruction. WASM bytecode can't do that, as it needs to be safe: the stack is structured and unrestricted jumps aren't an option.
Now it's not uncommon for the various JS/WASM engines to perform tail call optimizations anyway, but the important things that many languages need is guaranteed tail calls, e.g. a way to say "hello runtime, please put the work into making this a tail call even if you haven't decided the function is hot enough to warrant optimizations, as I require TCO for correctness".
For example lets imagine this silly factorial function:
function factorial(n, accumulator = 1) {
if (n <= 1) return accumulator;
return factorial(n - 1, accumulator * n);
}
in the absence of guaranteed TCO, this might fail for a large enough argument n. If the function is called enough a JIT might choose to run some optimization passes that perform TCO and then this works for any value of n. So for correctness it is necessary to be able to guarantee TCO will occur. In wasm that's apparently [going to be?] a annotated call instruction, which is what .net's vm does as well. The JS TCO proposal a few years back simply said something like "if a return's expression is a function call, the call must be tail recursive".
WASM is not really an assembly language. Before this, WASM didn't have jump at all, and so tail calls are adding a form of jump (jump with arguments). This makes WASM a much better compilation target.
It doesn't and I hope whoever is in charge of web assembly doesn't ruin what they have by giving in to nonsense trends that are part of a silver bullet syndrome and don't offer real utility.
For example, it's done by all of the Schemes, all of the MLs (SML, Ocaml, Haskell, Rust, Scala, Idris, etc.), scripting languages (Lua, Elm, Perl, Tcl, etc.), and many others.
> that are part of a silver bullet syndrome and don't offer real utility
Their "real utility" is right there in the subtitle of that original paper:
Debunking the "Expensive Procedure Call" Myth
or, Procedure Call Implementations Considered Harmful
or, Lambda: The Ultimate GOTO
In other words, any implementation of procedure/function/method-calls which doesn't eliminate tail-calls is defective (slow, memory-hungry, stack-unsafe, etc.)
Not standard, since almost no programming is actually done using tail calls. Programming is done with loops, tail calls are an exotic and niche way of working.
Even in lua and rust tail calls are rarely used and the other languages you listed are extremely niche.
Debunking the "Expensive Procedure Call" Myth
or, Procedure Call Implementations Considered Harmful
or, Lambda: The Ultimate GOTO
These are not explanations of utility, these are titles that you are using to claim something without backing it up.
In other words, any implementation of procedure/function/method-calls which doesn't eliminate tail-calls is defective (slow, memory-hungry, stack-unsafe, etc.)
This is a very bold claim with no evidence for it and pretty much the entire history of programming against it.
Also, it was literally the title of a paper. If citing the literature with a hyperlink to wikisource doesn't count as "backing it up", then I have no idea where you put the goalposts.
> almost no programming is actually done using tail calls
Literally every function/procedure/method/subroutine/etc. has at least one tail position (branching allows more than one). It's pretty bold to claim that there are 'almost no' function calls in those positions. I wouldn't believe this claim without seeing some sort of statistics.
> Programming is done with loops, tail calls are an exotic and niche way of working.
Loops have limited expressiveness; e.g. they don't compose, they break encapsulation, etc. Hence most (all?) programs utilise some form of function/method/procedure/subroutine/GOTO. Tail-calls are simply a sub-set of the latter which, it turns out, are more powerful and expressive than loops (as an obvious example: machine-code doesn't have loops, since it's enough to have GOTO (AKA tail-calls)).
That wasn't 'me claiming something', it was Guy L Steele Jr.
You still just copy and pasted titles, this isn't evidence of anything.
If citing the literature
Then put in the part you think is evidence or significant. This is the classic "prove my point for me" routine. You're the one who wants to change a standard.
Literally every function/procedure/method/subroutine/etc. has at least one tail position
Are you conflating general functions with tail call elimination?
Loops have limited expressiveness; e.g. they don't compose, they break encapsulation,
Why would that be true? How would looping through recursion change this?
Hence most (all?) programs utilise some form of function/method/procedure/subroutine/GOTO
What does this have to do with tail call optimizations? Web assembly has functions.
machine-code doesn't have loops,
Web assembly is not the same as machine code
I think overall you are thinking that making claims is the same as evidence. You haven't explained any core idea why tail call optimizations have any benefit in programming or web assembly. You basically just said a well known language creator put them in some languages. There is no explanation of what problem is being solved.
Indeed; if web assembly allowed unrestrained GOTOs (like machine code) then compilers would already be able to do tail-call elimination.
---
> Are you conflating general functions with tail call elimination?
> What does this have to do with tail call optimizations?
> You haven't explained any core idea why tail call optimizations have any benefit
Sorry, I think there have been some crossed wires: I was mostly pointing out the absurdity of your statement that "almost no programming is actually done using tail calls" (when in fact, almost all programs will contain many tail-calls).
That's separate to the question of how tail-calls should be implemented: in particular, whether they should perform a GOTO (AKA tail-call elimination/optimisation); or, alternatively, whether they should allocate a stack frame, store a return address, then perform a GOTO, then later perform another GOTO to jump back, then deallocate that stack frame, etc.
> You haven't explained any core idea why tail call optimizations have any benefit in programming or web assembly
Based on my previous sentence, I would turn the question around: what benefit is gained from allocating unnecessary stack frames (which waste memory and cause stack overflows), performing redundant jumps (slowing down programs), etc.?
almost no programming is actually done using tail calls
I'm talking about tail call optimization, which is what this whole thread was about, what you are you talking about?
Based on my previous sentence, I would turn the question around:
That's not how it works, since you're the one wanting a standard to change.
what benefit is gained from allocating unnecessary stack frames
Where are you getting the idea the webasm JIT has to allocated 'unnecessary stack frames'.
stack frames (which waste memory and cause stack overflows)
This makes me think you don't understand how the stack even works. The memory is already there and stack memory is very small. You can't both 'waste memory' and overflow the stack at the same time. This stuff is fundamental to how computers work.
It seems like you don't even know or understand what problem you are trying to solve. Where did you get this absurd idea that the stack is a problem? Also do you think that rust doesn't use a stack?
Reputation systems should be based on /abuse/, not on automation. I also ended up on the naughty list for running an archival scraping program. Trying to preserve part of the Internet is apparently against the rules. It's really a shame because my code honors rate limits, doesn't spam, and is completely docile.
Cloudflare has mixed up the definitions of "bot" and "abuse". Tor users may or may not be bots, but as long as they don't abuse (spamming or DoS), they ought to be treated the same.
It wasn’t framed as an opinion. And even if it was, I’m saying I think it is wrong and I want to know why I should change my mind.
The fact is that CloudFlare distinguishes abuse (DDoS at IP layers 3 and 4) completely separately from bot detection. And it allows user controls to domain owners to allow some bots like Google Search Crawler.
So my statement stands: I want to see a citation of evidence that CloudFlare doesn’t have the ability to distinguish abuse.