synthetigram's comments

synthetigram · on Jan 17, 2024

You might be correct, but it's not obvious to me unsafe is really faster. I recall reading that using unsafe invalidates most of the other optimizations because it's opaque to the JIT what you are doing. Also, if there are BCE opportunities, I feel pretty confident OpenJDK will add them, rather than being unsympathetic.

greid · on Jan 17, 2024

In terms of performance: I realize that this is a somewhat "toy" issue, and it's a sample size of 1, but for the currently ongoing "One Billion Row Challenge"[1] (an ongoing Java performance competition related to parsing and aggregating a 13 GB file), all of the current top-performers are using Unsafe. More specifically, the use of Unsafe appears to have been the change for a few entries that allowed getting below the 3-second barrier in the test.

1. https://github.com/gunnarmorling/1brc

anonymoushn · on Jan 17, 2024

Submissions to this challenge are also hampered by the lack of prefetch intrinsic, vpshufb intrinsic, aesenc intrinsic, and graal's complete lack of vector intrinsics. As a total outsider to the Java ecosystem, it makes it seem like nobody in a place to make changes really knows or cares about enabling high-throughput code to run in the JVM.

kaba0 · on Jan 17, 2024

What do you mean by that?

Also, graal can do some vectorization, and definitely has some libraries with vector intrinsics, e.g. truffle’s regex implementation uses similar algorithms to simdjson.

anonymoushn · on Jan 17, 2024

I think I was fairly specific? There's no way for a user to do vpshufb, aesenc, or a prefetch instruction. One would expect the former two things to be in jdk.incubator.vector, where they are not present. The last thing was explicitly removed, here's a link to one part of the process of it being removed: https://bugs.openjdk.org/browse/JDK-8068977

Code that uses jdk.incubator.vector does not actually get compiled to vector instructions under graal. This is why the top submission that uses jdk.incubator.vector is the only top solution that does not use graal.

I don't doubt that "similar algorithms to simdjson" may be used, but simdjson uses instruction sequences that are impossible to generate using the tools provided.

I mention these instructions because vpshufb comes up a lot in string parsing and formatting, because prefetch is useful for hiding latency when doing bulk accesses to a hash table, and because aesenc is useful for building cheap hash functions such as ahash.

kaba0 · on Jan 17, 2024

Ah okay, it wasn’t clear that you were talking about explicit control over this through Vector API and similar.

dzaima · on Jan 17, 2024

For my solution (https://github.com/dzaima/1brc, uses jdk.incubator.vector significantly), switching all array reads/writes to identical Unsafe ones results in the solution running ~100x slower, so there's certainly truth to Unsafe messing with optimization - I suppose it means that the VM must consider all live objects to potentially have mutated. (as for the sibling comment, indeed, lack of vpshufb among other things can be felt significantly while attempting to use jdk.incubator.vector)

iscoelho · on Jan 17, 2024

As the parent comment said: bounds checks. Unsafe is an order of magnitude faster for performance sensitive tasks due to only that.

Quick publication showing the difference (up to 125%!): https://www2.cs.arizona.edu/~dkl/Publications/Papers/ics.pdf

(and no, the story hasn't changed an awful lot since 2004. The gap still exists.)

Retr0id · on Jan 17, 2024

As a non-java developer, this is surprising to me. I always assumed that bounds checks were "almost free" because the branch predictor will get it right 99% of the time. I can see that being not-true for interpreted runtimes (because the branch predictor gets bypassed, essentially), but I thought Java was JITed by default?

Perhaps the paper answers my question, but I'll admit I'm being lazy here and would much appreciate a tl;dr.

iscoelho · on Jan 17, 2024

Here's an example of the same issue in Rust: https://dropbox.tech/infrastructure/lossless-compression-wit...

(with bounds checks) "Telling the Rust allocator to avoid zeroing the memory when allocating for Brotli improves the speed to 224 MB/s."

(without bounds checks) "Activating unsafe mode results in another gain, bringing the total speed up to 249MB/s, bringing Brotli to within 82% of the C code."

224MB/s -> 249MB/s (11% Brotli compression perf difference just by eliminating bounds checks)

sitkack · on Jan 17, 2024

Couple issues with the linked article, the rust compiler has matured a lot since it was written I didn’t look at the benchmark or the code, but many bounce checks are elided if you use iterators instead of array access and lastly, I would want to see ZSTD benchmark written in normative rust.

I don’t doubt that all of those things in the article were true back then, but that was eight years ago. Wow.

iscoelho · on Jan 17, 2024

Sure, you're not wrong, but my focus was that the benchmark is indicative of the raw performance impact of bounds checks (when they are unable to elided) in a real world algorithm (as opposed to a micro-benchmark).

With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

It could be an argument that bounds checks make up a small percentage of total application performance. However, I've profiled production Java servers where >50% of the CPU was encryption/compression. JDK implementations of those algorithms are heavily impacted by (and commonly fail to elide) bounds checks.

Performance matters!

xxs · on Jan 17, 2024

>With that said, convincing a compiler to elide bounds checks (especially Java's JIT compiler) is a hugely frustrating (and for some algorithms futile) task.

Adding explicit checks does work to a certain degree but it can change with the compiler, and it requires to keep checking the generated assembly - not fun (no unsafe, either but still)

iscoelho · on Jan 17, 2024

These explicit checks are basically what I mean by "convincing the compiler" (and sometimes, it isn't quite convinced!) - Yeah. Not fun at all.

Especially in Java, because "the assembly" can change as the JIT evolves. What is optimized today may not be tomorrow.

xxs · on Jan 17, 2024

Some 12-13y back (time does fly) Cliff Click (hotspot architect) had a series of blogs on optimizations, incl. the lattice checks (i.e. within bounds). Was quite insightful. It was called: "Too Much Theory" [0]

>What is optimized today may not be tomorrow.

Exactly. (Also most developers will have exceptionally hard time maintaining such code)

[0]: https://web.archive.org/web/20120328222841/http://cliffc.org...

sitkack · on Jan 17, 2024

I understand where you are coming from, but without comparing the generated assembly, we are comparing an implementation of bounds checks. I think we should have a bounds check instruction and operates concurrently.

I should play around this with this using a couple RISC-V cores.

> Performance matters!

https://www.youtube.com/watch?v=r-TLSBdHe1A by Emery Berger

Another, yesbut, encryption and compression are and will handled by on die accelerators.

Retr0id · on Jan 17, 2024

That's a good example, but much more inline with my expectations (not on the order of 125%)

iscoelho · on Jan 17, 2024

Keep in mind that the "11%" applies to the fully implemented Brotli compression algorithm. Bounds-checks are a small part of that.

If you write a micro-benchmark for only bounds-checks, you'd see the larger difference more inline with the "125%"

xxs · on Jan 17, 2024

Bounds checks are not 'free' but predicted ones are cheap. There are way to convince the JIT to remove them - in cases it can prove the index doesn't exceed the byte array, so explicit checks with the length of the array may improve the performance.

The stuff gets worse with DirectByteBuffers as the JIT has to work harder. Unsafe allows to 'remove' all bound checks, but it may prevent some other optimizations.

iscoelho · on Jan 17, 2024

> but it may prevent some other optimizations.

I see this mentioned a few times in this thread, but I haven't experienced this in practice (and I've written a lot of unsafe in Java). Are there any examples of this?

slaymaker1907 · on Jan 17, 2024

There are tricks you can do, but it introduces load on the TLB. You basically restrict accesses to 32 bits of memory and isolate that space in virtual memory with its own prefix.

kaba0 · on Jan 17, 2024

Do we know why is that? I would assume the branch predictor is mostly happy with it, is it just the code cache increase?

iscoelho · on Jan 17, 2024

Perf impact due to bounds checks can be experienced in statically compiled languages as well (like Go/Rust). Although branch predictor improvements likely narrow the gap, they do not eliminate it.

Code cache is not completely relevant afaik - you can easily replicate in micro-benchmarks.

kaba0 · on Jan 17, 2024

> statically compiled languages

I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost, besides taking up place in i$, so I would assume that that is responsible for the difference.

iscoelho · on Jan 17, 2024

> I don’t think it’s relevant here, the JIT compiler can do the same optimizations here.

Correct. I was clarifying that the issue can be replicated in a "simpler" statically compiled benchmark.

> If the branch predictor can basically 100% guess correctly (which will be the case in any correct program), it should not have any additional cost

That isn't true. The CPU branch predictor has a cost no-matter what. Of course, it's a complicated story: https://blog.cloudflare.com/branch-predictor

kaba0 · on Jan 17, 2024

Thanks! I definitely have to do some readup on that!

cogman10 · on Jan 17, 2024

> the story hasn't changed an awful lot since 2004

Um... yeah it has. For starters, hotspot wasn't even a part of the JVM at that point. But further newer JVM additions like the enhanced for loop eliminate a ton of conditions where someone would run into bounds checking. Doing a naked `a[i]` is simply not common java code.

The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

iscoelho · on Jan 17, 2024

> Doing a naked `a[i]` is simply not common java code.

It is extremely common in performance sensitive code, 1) graphics & rendering 2) networking 3) buffers

> But further newer JVM additions like the enhanced for loop eliminate a ton of conditions > The JVM is far more likely today to remove the bounds check all together than it ever was in 2004.

There are more comments in this thread that clarify further, but Java is very commonly unable to eliminate bounds checks. You can test all of these things yourself with a quick benchmark - don't take my word for it! The JIT is not as great at this as common rhetoric claims it is.

rrdharan · on Jan 17, 2024

> For starters, hotspot wasn't even a part of the JVM at that point.

HotSpot had been part of the JVM for five years at that point.

https://en.m.wikipedia.org/wiki/HotSpot_(virtual_machine)

cogman10 · on Jan 17, 2024

I misread the merge date in the wiki. Hotspot merged into mainline in Java 1.3 in 2000. It was first released as a separate distributable.

That said, in the paper they didn't use sun's JVM they used gcj and their own modified version of gcj.

> We examined the performance of both our new Java implementation as well as standard gcj on a variety of Java applications.

So my point still stands, a lot as changed. The researches chose to use static compilation over a JIT or interpreter.

pron · on Jan 17, 2024

(and even 20 years ago 125% was still almost order of magnitude less than an order of magnitude)

iscoelho · on Jan 17, 2024

(and that is quite pedantic)

foobazgt · on Jan 17, 2024

Order of magnitude in casual conversation typically means 10x. I don't think 2x vs 10x qualifies as pedantic.

iscoelho · on Jan 17, 2024

Oh I don't disagree at all!

It's a writing error on my end (clear since I qualify the percentage afterwards), so I think focusing on it distracts from the point (which is why I say it's pedantic).

I understand the pet peeve though (-:

Pet_Ant · on Jan 17, 2024

(it was the best kind of correct: technically correct)

xxs · on Jan 17, 2024

2004 had java 1.4 + hotspot. 1.5 (w/ generics and stuff) was about to be released. Hotspot was there, so were the Direct Buffers.

Also accessing byte arrays and direct buffers is extremely common - if you do just "business logic" jazz - it does not happen, though. However, every hashmap needs that, pretty much each hash lookup is a direct a[hash]

synthetigram · on Jan 16, 2024

This problem is not going to go away so easily. Numerous core Java classes (like BufferedInputStream) use synchronized. I count 1600+ usages in java.base. The blocking issue means it's _much_ easier to accidentally run into this, rather than waving it away as an unlikely edge case.

I personally ran into this Using the built in com.sun webserver, with a virtual thread executor. My VPS only has two CPUs which means the FJP that virtual threads run on only have 2 active threads at a time. I ran into this hang when some of the connection hung, blocking any further requests from being processed.

pron · on Jan 16, 2024

As the JEP states, pinning due to synchronized is a temporary issue. We didn't want to hold off releasing virtual threads until that matter is resolved (because users can resolve it themselves with additional work), but a fix already exists in the Loom repository, EA builds will be offered shortly for testing, and it will be delivered in a GA release soon.

Those who run into this issue and are unable or unwilling to do the work to avoid it (replacing synchronized with j.u.c locks) as explained in the adoption guide [1] may want to wait until the issue is resolved in the JDK.

I would strongly recommend that anyone adopting virtual threads read the adoption guide.

[1]: https://docs.oracle.com/en/java/javase/21/core/virtual-threa...

cesarb · on Jan 16, 2024

> unable or unwilling to do the work to avoid it

The problem is that it's rare to write code which uses no third-party libraries, and these third-party libraries (most written before Java virtual threads ever existed) have a good chance of using "synchronized" instead of other kinds of locks; and "synchronized" can be more robust than other kinds of locks (no risk of forgetting to release the lock, and on older JVMs, no risk of an out-of-memory while within the lock implementation breaking things), so people can prefer to use it whenever possible.

To me, this is a deal breaker; it makes it too risky to use virtual threads in most cases. It's better to wait for a newer Java LTS which can unmount virtual threads on "synchronized" blocks before starting to use it.

josefx · on Jan 16, 2024

> have a good chance of using "synchronized" instead of other kinds of locks; and "synchronized" can be more robust than other kinds of locks (no risk of forgetting to release the lock, and on older JVMs, no risk of an out-of-memory while within the lock implementation breaking things),

I haven't professionally written Java in years, however from what I remember synchronized was considered evil from day one. You can't forget to release it, but you better got out of your way to allocate an internal object just for locking because you have no control who else might synchronize on your object and at that point you are only a bit of syntactic sugar away from a try { lock.lock();}finally{lock.unlock();} .

hashmash · on Jan 16, 2024

The fact that the monitor is public rarely causes issues, and in those cases where it's used on internal objects, it's not really public anyhow.

There's an additional benefit to using the built in monitors, and that has to do with heap allocation. The data structure for managing it is allocated lazily, only when contention is actually encountered. This means that "synchronized" can be used as a relatively low cost defensive coding practice in case an object which isn't intended to be used by multiple threads actually is.

bunderbunder · on Jan 16, 2024

Is there a similarly low-level synchronization mechanism that doesn't work this way? .NET's does the same thing.

I guess I might have preferred if both Java and .NET had chosen to use a dedicated mutex object instead of hanging the whole thing off of just any old instance of Object. But that would have its own downsides, and the designers might have good reason to decide that they were worse. Not being able to just reuse an existing object, for example, would increase heap allocations and the number of pointers to juggle, which might seriously limit the performance of multithreaded code that uses a very fine-grained locking scheme.

merb · on Jan 16, 2024

In .net async won where lock and mutex does not work (lock is like synchronized, not exactly the same, tough). That’s why most libraries use SemaphoreSlim which would work with green threads. But that’s more because of the ecosystem. I’ve barley stumble upon lock’s and mutex is mostly used in the main method since it acquires a real os mutex, not really a cheap thing but for GUIs it’s clever to check if the app is running. Most libs that use system.threading.task use semaphoreslim tough.

bunderbunder · on Jan 16, 2024

Yeah, definitely. But for a fair comparison I think you have to look at how .NET did things before async/await hit the scene. And, for that, the aspect of the design in question is quite similar between the two.

josefx · on Jan 17, 2024

Early .Net is hardly an independent data point from early Java. Not only was .Net directly influenced by Java, it also had to support a direct migration from the Microsoft JVM specific Visual J++ to J#.

The handful of languages I know either do not have a top level object class that supports a randomized set of features ( C++ ) or prioritize a completely different way of concurrent execution ( Python, JavaScript ).

masoudprv · on Jan 16, 2024

Hi Ron. Thanks a lot for the amazing work you are doing on loom and whole JVM platform. EA builds and GA release you mentioned can make it into 22 or you meant EA build for 23?

i000 · on Jan 16, 2024

Wow, I would love to be in the meeting where this decision was made.

Let's ship this with a foot gun, but lets not mention in the JEP that it may hang - let them figure it out.

bilbo0s · on Jan 16, 2024

I don't know man?

We make scalable graphics rendering servers to stream things like videogames across the web. When we started the project to switch to virtual threads we had that as number one on the big board. "Rewrite for reentrant locks."

Maybe we have more fastidious engineers than a normal company would since we are in the medical space? But even the juniors were reading and familiarizing themselves on how to properly lock in loom's infancy.

All that only to point out that, yes, they had communicated the proper use of reentrant locks long ago.

I do understand what you're saying from an engineering management perspective though. That effort cost a fortune. Especially when you have the FDA to deal with.

It was more than worth it though! In the world of cloud providers, efficiency is money.

riwsky · on Jan 16, 2024

Wait, are you writing medical videogames?

bilbo0s · on Jan 16, 2024

We use the same technologies to deliver, say, remote CT review capability, that you would use to stream a videogame. It's just far more likely that the audience I'm communicating with, HN, is familiar with the requirements of videogame streaming, than it is that they are familiar with remote medical dataset viewing. Obviously the requirements or our use case are far more stringent, but no need to go into all that to illustrate the point made.

1 - Use virtual threads with reentrant locks if you need to do "true heavy" scaling.

2 - Kind of implied, but since you gave the opportunity to make it explicit with your comment =D, there is no need to waste your life on earning no money in videogames when the medical industry is right there willing to pay you 10x as much for the same skills. (Provided your skill is in the hard backend engine and physics work. They pay more for the ML too, if I'm being honest.)

deely3 · on Jan 16, 2024

I understand the frustration, but why not read a doc?

https://docs.oracle.com/en/java/javase/21/core/virtual-threa...

In Virtual Threads: An Adoption Guide part there is:

When using virtual threads, if you want to limit the concurrency of accessing some service, you should use a construct designed specifically for that purpose: the Semaphore class.

bunderbunder · on Jan 16, 2024

That language only obliquely mentions the issue. It is nowhere near clear and direct enough for someone who is just, for example, using a third-party library that is affected. And then it's stuck inside detailed documentation that anyone who wasn't personally planning on adopting virtual threads is unlikely to read.

This seems like it's at least vaguely headed in the direction of that famous scene from early in The Hitchhiker's Guide to the Galaxy:

“But the plans were on display…”

“On display? I eventually had to go down to the cellar to find them.”

“That’s the display department.”

“With a flashlight.”

“Ah, well, the lights had probably gone.”

“So had the stairs.”

“But look, you found the notice, didn’t you?”

“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”

sigzero · on Jan 16, 2024

Maybe you should stick to reading Adams and not programming?

kaba0 · on Jan 16, 2024

You might accidentally write an infinite loop as well - should we not use Turing-complete languages or what?

It’s not like multithreaded computing wasn’t full of footguns anyway.

eklavya · on Jan 16, 2024

I would like to take this opportunity to thank pron and the amazing jdk developers for working on a state of the art runtime and language ecosystem and providing it for free. Please ignore the entitled, there are many many happy Dev's who can't thank you all enough.

jillesvangurp · on Jan 16, 2024

People always forget that things that only happen every few million times, can happen fairly frequently on a busy server. This has bitten me numerous times. The nature of a lot of these types of issues is that they are hard to detect and hard to reproduce.

Virtual threads are nice for unblocking legacy code but they aren't without issues. There are better options for new code with less trade offs on the jvm as well. I've recently been experimenting with jasync-postgresql (there's a mysql variant as well) as an alternative to JDBC in Kotlin. It's a nice library. It does have some limitations and is a bit on the primitive side. But it appears to be somewhat widely used in various database frameworks for Scala, Java, and Kotlin.

Databases and database frameworks are an area on the JVM where there just is a huge amount of legacy code built on threads and blocking IO. It's probably one of the reasons Oracle worked on virtual threads as migrating away from these frameworks is unlikely to ever happen in a lot of code bases. So, waving a magic wand and making all that code non blocking is very attractive. But of course that magic has some hard limitations and synchronize blocks are one of those. I imagine they are working on improving that further.

simiones · on Jan 16, 2024

> Virtual threads are nice for unblocking legacy code but they aren't without issues. There are better options for new code with less trade offs on the jvm as well.

The designers of Project Loom would say the exact opposite. The whole push behind Project Loom and similar models (Go's oft-praised "goroutines" runtimes being another one) is motivated by Threads being a much better fit for async behavior in a fundamentally procedural language like Java or Go than promise-based frameworks like async/await.

The whole motivation of Project Loom is to make the simple thing (spawning threads to handle blocking IO) the fast thing as well (by actually replacing the blocking IO with efficient async IO OS calls and managing the threads internally). Project Loom will be considered a full success if the next generation Java web server does something akin to "new Thread(() -> {executeHandlerFunc(conn); }.Start(); " for each incoming connection, just like the Go built-in web server.

jillesvangurp · on Jan 16, 2024

I think it's not that black and white. Clearly they made a choice to be backwards compatible. Not because Java Threads have a nice API (not even close) but because a lot of legacy code that will never be changed uses it. Including all the ugly bits that you shouldn't be using. Like a lot of the low level synchronization primitives that date back to the early days of Java. It's an impressive bit of work but they made some compromises to make things work. A new API would have been easier, would have had less overhead, and be nicer to use. But backwards compatibility with legacy code was a big goal.

It mostly works fine and it's an impressive bit of engineering. But it has some really ugly failure modes in combination with hacky legacy code designed for real threads. So, you can't blindly assume things to just work. Hence the deadlocks.

Many Java servers already work the way you outline. It's just that they are a bit tedious to use with the traditional Java frameworks. Which is one reason I like using Spring's webflux with Kotlin instead. Just way nicer when it's all exposed via co-routines.

simiones · on Jan 16, 2024

There are two separate choices. One is the choice of whether to implement green threads in the JVM at all, or whether to use async/await, or some other type of concurrency primitive. The other is whether to expose the new concurrency primitive using a new API or an existing one.

You could say the second choice, the specific API, was done, at least to some extent, for backwards compatibility reasons. I wouldn't agree, but I think there is at least some argument to be made. Here is one of the designer's explanation [0]:

> We also realized that implementing the existing thread API, so turning it into an abstraction with two different implementations won't add any runtime overhead. I also found that when talking about Java's new user mode threads back when this feature was in development, and back when we still called them fibers, every time I talked about them at conferences, I kept repeating myself and explaining that fibers are just like threads. After trying a few early access releases of the JDK with a fiber API, and then a thread API, we decided to go with the thread API.

However, the choice of adding a new concurrency primitive to Java in the form of green threads instead of others was very very clearly not done for backwards compatibility's sake. Ron Pressler (who is active here as 'pron') has several talks on the advantages of green threads over async/await that you can look at [0][1]. The designers of Go also had the same belief, and also chose to add green threads as the fundamental built-in concurrency primitive in Go, obviously not for backwards compatibility reasons in their case.

[0] https://www.infoq.com/presentations/virtual-threads-lightwei...

[1] https://www.youtube.com/watch?v=EO9oMiL1fFo

coldtea · on Jan 16, 2024

>The designers of Project Loom would say the exact opposite.

Sure, but then again the designers of circa 2000-2010 J2EE also thought the verbosity and over-engineering was a good idea.

discreteevent · on Jan 16, 2024

There might be some justification for comparing any one particular thing to the worst possible particular thing if those things have something in common. The only feature the two things you picked have in common is the word 'java'.

coldtea · on Jan 16, 2024

Also have in common the "appeal to authority": (the designers) as arbiters of good judgement

peterashford · on Jan 16, 2024

Appeal to expertise. Appeal to authority is a falacy when the authority is not an expert in the requisite domain. eg: we don't care what a policeman thinks about astrophysics, we do care what the astrophysicist says.

pjmlp · on Jan 17, 2024

J2EE started as a Objective-C framework, before being rewritten in Java.

nordsieck · on Jan 16, 2024

I don't know.

My understanding is that that highest performance webserver is nginx. And it uses async internally.

IMO, virtual threads is a better general purpose language feature because it avoids function coloring and is generally easier to reason about, but it may not result in the highest performance Java webserver.

simiones · on Jan 16, 2024

NGINX is a native C implementation, so it has to be carefully written to use the OS's native high-performance IO and native OS threads.

The purpose of project Loom is to abstract that away from Java application code. The runtime can use the most efficient IO for the given platform (ideally io_uring on Linux or IOCP on Windows, for example) even if the application code calls the old blocking File.Write(). The application can then use simple APIs and code patterns, but still get massive performance.

With Loom, you can easily have 20,000 virtual threads servicing 20,000 concurrent HTTP requests and each "blocked" in IO, while only using, say, 100 OS threads that are polling an IOCP. A normal Linux box can typically only handle around maybe 1000 threads across all running processes.

incrudible · on Jan 16, 2024

Servicing 20,000 concurrent requests on a single box where somehow threads are the bottleneck, is that not a problem that approximately no one has?

bitzun · on Jan 16, 2024

Most application webservers (by default) handle one request per thread. For mostly IO bound stuff (which many projects are), it makes sense to me that threads become a bottleneck in relatively ordinary scenarios.

incrudible · on Jan 17, 2024

The scenario where your IO could handle way more than a thousand concurrent requests if only the thread overhead was reduced? When does that ever happen?

simiones · on Jan 17, 2024

Each OS thread costs memory. With the version of Java I have, the default is to allocate 1MB of stack for each thread. So, 10,000 threads would require 10,000 MB of RAM even if we configured ulimit to allow that many threads. In contrast, asking the kernel to do buffered reads of 10,000 files in parallel requires much less memory - especially if most of those are actually the same physical file. Of course, they won't be read fully in parallel.

For example, this program:

  var threads = new Thread[20000];
  for (int i = 0; i < 20000; i++) {
    threads[i] = Thread.ofVirtual().start(() -> {
      try {
        Files.copy(FileSystems.getDefault().getPath("abc.txt"), System.out);
      } catch (IOException e) {
        System.err.println("Error writing file");
        e.printStackTrace();
      }});
    }
  for (int i = 0; i < 20000; i++) {
   threads[i].join();
  }

Run as `java Test > ./cde.txt` takes about 4.5s to run on my WSL2 system with 2 cores, writing a 2 GB file (with abc.txt having 100KB); even this would be within the HTTP timeout, though users would certainly not be happy. Pretty sure a native Linux system on a machine beefy enough to be used as a web server would have no problem serving even larger files over a network like this.

incrudible · on Jan 17, 2024

1. You are not solving a real problem. The use case you describe (basically a CDN) is already exotic, the scenario where such a system would have already been implemented with Java and its basic IO seems implausible.

2. You did not compare against fewer threads to see if threads are actually the bottleneck rather than IO. Also, all your threads are competing for stdout.

mike_hearn · on Jan 16, 2024

The lack of support for synchronized isn't a fundamental or hard limit, it's just that the HotSpot implementation is complicated for performance reasons and they put off rewriting that code until later. They're indeed working on that now and in some future version I guess wait/notify and synchronized blocks will start to work. After all, you can easily transform such code into an equivalent that does work.

tveita · on Jan 16, 2024

There are ways to find problem sections without having to trigger a full deadlock: https://openjdk.org/jeps/444

  The system property jdk.tracePinnedThreads triggers a stack trace when a thread blocks while pinned. Running with -Djdk.tracePinnedThreads=full prints a complete stack trace when a thread blocks while pinned, highlighting native frames and frames holding monitors. Running with -Djdk.tracePinnedThreads=short limits the output to just the problematic frames.

vincnetas · on Jan 16, 2024

Was curious what it is "jasync". And man it hurts me to see documentation like this (when compared to classic javadocs)

https://github.com/jasync-sql/jasync-sql/wiki/API-Overview

From project WIKI (https://github.com/jasync-sql/jasync-sql/wiki)

kaba0 · on Jan 16, 2024

Synchronized blocks are not a problem. Synchronized blocks that later don’t unblock the thread may sometimes be.

he0001 · on Jan 16, 2024

BufferdInputStream is rewritten and is only using synchronized if subclassed. In fact there has been a lot of work removing the synchronized keyword.

anthony88 · on Jan 16, 2024

I've written an open source library to easily replace synchronized with something more virtual thread friendly: https://github.com/japplis/Virtually

synthetigram · on July 28, 2023

After exploring a few constant access serialization formats, I had to pass on Capn Proto in favor of Apache Avro. Capn has a great experience for C++ users, but Java codegen ended up being too annoying to get started with. If Capn Proto improved the developer experience for the other languages people write, I think it would really help a lot.

synthetigram · on July 7, 2023

The StreamObserver API came at a time (2015) when it seems liked RxJava was going to take over. That didn't end up happening, but the API is still around. While it is more cumbersome, some things are /impossible/ to do with the Go style blocking. For example, try cancelling out of a Recv() call. The only way is to tear the entire Stream down. Goroutines never successfully married select {} and sync.Cond, or context Cancel. These are needed to successfully back out of a blocking statement. Unfortunately, that can't be done, and a goroutine that blocks is really stuck there. The only saving grace is that goroutines are relatively cheap (2-4K of memory?), and it's okay if a few O(100K) of them get stuck.

dfawcus · on July 9, 2023

As I recall, cancelling out of a Recv() as in a network read can be achieved by setting an expiry time in the past.

i.e. on a net.Conn one can use SetReadDeadline() to unblock/cancel a Read().

synthetigram · on June 5, 2023

This is impractical a lot of the time. There are a class of problems where you want to put 2^N + 1 possible values in an N bit container, and it won't ever fit cleanly. Null is that 1 extra value that won't fit cleanly.

Another case is an array based queue. It can be implemented with head+tail pointers, or size+offset. However, there will always be an ambiguity with either, because two words of memory aren't enough to represent all possible states of the queue.

synthetigram · on April 7, 2023

What isn't clear to me is why tail calls need to be implemented in WASM, rather than in the compiler? The post linked to Josh Habermans post on tail calls, which show how tail calls can help the compiler decide where to inline (cool!). But that was needed for the C++ text, not the LLVM code. It feels like tail calls are too high level of a concept to be in an "assembly" language.

ufo · on April 7, 2023

If a function calls itself using tail-recursion, a compiler can turn that into a loop without too much trouble. However, if it's tail-calling a different function then that becomes more difficult; it would have to merge the two functions into a single WASM function. And if the tail-call is indirect (through a function pointer) then it is impossible to turn into a loop.

> It feels like tail calls are too high level of a concept to be in an "assembly" language.

WASM is a bit higher level than typical assembly languages. It doesn't have unrestricted "goto", so there's no way to implement tail calls optimization the "hard way".

MaulingMonkey · on April 7, 2023

> It feels like tail calls are too high level of a concept to be in an "assembly" language.

In x86, the difference is between `call some_fn` and `jmp some_fn` (perhaps after a `call _guard_check_icall` to implement Control Flow Guard)

In WASM, the difference is between `call[_indirect] some_fn` and `return_call[_indirect] some_fn`, which is just `jmp some_fn` with a funny name - and similar control flow integrity constraints as imposed by CFG. Why not just use `jmp some_fn`? Because WASM lacks a generic `jmp some_fn`, as they've baked https://webassembly.org/docs/security/#control-flow-integrit... into the arch, which seems fair enough.

vbezhenar · on April 7, 2023

To add to other reasons, my opinion is: Webassembly right now is an extremely easy target for compilation. You can create your own compiler absolutely easily and get plenty of V8 optimizations for free. It's like LLVM, but probably much more accessible for hobby projects. You just parse text, build AST and dump AST to WebAssembly.

If WebAssembly would implement tail calls, implementing them in an original language requires zero efforts, they would just work.

But implementing tail calls using some kind of optimization, like function rewrite to loop is far from easy.

So my personal hope is that Webassembly would include features that are possible but very hard to develop in a hobby project. GC is another example. Even primitive toy GC is a serious project.

pjmlp · on April 7, 2023

Reference counting is a toy GC algorithm, quite simple to implement.

Making it perform as fast as tracing GC algorithms, now that is a serious project.

garaetjjte · on April 7, 2023

WASM has structured control flow, with stack frame maintained by runtime. You cannot just jump anywhere.

olliej · on April 7, 2023

Because the general tail call problem isn't self recursion - that is indeed trivial to flatten, but co-recursion, or recursion when you don't know the target.

co-recursion (f calls g calls h calls f, etc) requires you create a single function containing the bodies of all functions that will be called via tail recursion, put them all in a loop, and have a single stack frame capable of holding all the independent frames concurrently. It's doable, but taking my dumb example you'd need to do this for each of f, g, and h, or you have to convert those functions into wrapper functions for your one giant mega function. The mega function then has issues if it has non-tail recursive calls that re-enter as its own frame is large. The stack frame is large because WASM is structured: the bytecode can't just treat the stack as a general purpose blob of memory.

Dynamic recursion is a case where ahead of time you quite simply cannot have compile time flattening, e.g.

    int f(int (*g)()) {
      return g();
    }

Languages that directly target machine code can do this because they just have to rebuild the stack frame at the point of the call and perform a jump rather than a call/bl instruction. WASM bytecode can't do that, as it needs to be safe: the stack is structured and unrestricted jumps aren't an option.

Now it's not uncommon for the various JS/WASM engines to perform tail call optimizations anyway, but the important things that many languages need is guaranteed tail calls, e.g. a way to say "hello runtime, please put the work into making this a tail call even if you haven't decided the function is hot enough to warrant optimizations, as I require TCO for correctness".

For example lets imagine this silly factorial function:

    function factorial(n, accumulator = 1) {
      if (n <= 1) return accumulator;
      return factorial(n - 1, accumulator * n);
    }

in the absence of guaranteed TCO, this might fail for a large enough argument n. If the function is called enough a JIT might choose to run some optimization passes that perform TCO and then this works for any value of n. So for correctness it is necessary to be able to guarantee TCO will occur. In wasm that's apparently [going to be?] a annotated call instruction, which is what .net's vm does as well. The JS TCO proposal a few years back simply said something like "if a return's expression is a function call, the call must be tail recursive".

maxiepoo · on April 7, 2023

WASM is not really an assembly language. Before this, WASM didn't have jump at all, and so tail calls are adding a form of jump (jump with arguments). This makes WASM a much better compilation target.

CyberDildonics · on April 7, 2023

It doesn't and I hope whoever is in charge of web assembly doesn't ruin what they have by giving in to nonsense trends that are part of a silver bullet syndrome and don't offer real utility.

chriswarbo · on April 7, 2023

> giving in to nonsense trends

Tail-calls have been pretty standard since at least the 'Lambda the Ultimate GOTO' paper from 1977 https://en.wikisource.org/wiki/Lambda_Papers

For example, it's done by all of the Schemes, all of the MLs (SML, Ocaml, Haskell, Rust, Scala, Idris, etc.), scripting languages (Lua, Elm, Perl, Tcl, etc.), and many others.

> that are part of a silver bullet syndrome and don't offer real utility

Their "real utility" is right there in the subtitle of that original paper:

  Debunking the "Expensive Procedure Call" Myth
  or, Procedure Call Implementations Considered Harmful
  or, Lambda: The Ultimate GOTO

In other words, any implementation of procedure/function/method-calls which doesn't eliminate tail-calls is defective (slow, memory-hungry, stack-unsafe, etc.)

iudqnolq · on April 7, 2023

Nitpick: Rust doesn't do tail call optimization (unless llvm decides to unprompted, iirc).

https://github.com/rust-lang/rfcs/issues/2691

CyberDildonics · on April 7, 2023

Tail-calls have been pretty standard

Not standard, since almost no programming is actually done using tail calls. Programming is done with loops, tail calls are an exotic and niche way of working.

Even in lua and rust tail calls are rarely used and the other languages you listed are extremely niche.

Debunking the "Expensive Procedure Call" Myth or, Procedure Call Implementations Considered Harmful or, Lambda: The Ultimate GOTO

These are not explanations of utility, these are titles that you are using to claim something without backing it up.

In other words, any implementation of procedure/function/method-calls which doesn't eliminate tail-calls is defective (slow, memory-hungry, stack-unsafe, etc.)

This is a very bold claim with no evidence for it and pretty much the entire history of programming against it.

chriswarbo · on April 7, 2023

> These are not explanations of utility, these are titles that you are using to claim something without backing it up.

That wasn't 'me claiming something', it was Guy L Steele Jr., who's worked on such "niche" languages as Java, Fortran and ECMAScript: https://en.wikipedia.org/wiki/Guy_L._Steele_Jr.

Also, it was literally the title of a paper. If citing the literature with a hyperlink to wikisource doesn't count as "backing it up", then I have no idea where you put the goalposts.

> almost no programming is actually done using tail calls

Literally every function/procedure/method/subroutine/etc. has at least one tail position (branching allows more than one). It's pretty bold to claim that there are 'almost no' function calls in those positions. I wouldn't believe this claim without seeing some sort of statistics.

> Programming is done with loops, tail calls are an exotic and niche way of working.

Loops have limited expressiveness; e.g. they don't compose, they break encapsulation, etc. Hence most (all?) programs utilise some form of function/method/procedure/subroutine/GOTO. Tail-calls are simply a sub-set of the latter which, it turns out, are more powerful and expressive than loops (as an obvious example: machine-code doesn't have loops, since it's enough to have GOTO (AKA tail-calls)).

CyberDildonics · on April 7, 2023

That wasn't 'me claiming something', it was Guy L Steele Jr.

You still just copy and pasted titles, this isn't evidence of anything.

If citing the literature

Then put in the part you think is evidence or significant. This is the classic "prove my point for me" routine. You're the one who wants to change a standard.

Literally every function/procedure/method/subroutine/etc. has at least one tail position

Are you conflating general functions with tail call elimination?

Loops have limited expressiveness; e.g. they don't compose, they break encapsulation,

Why would that be true? How would looping through recursion change this?

Hence most (all?) programs utilise some form of function/method/procedure/subroutine/GOTO

What does this have to do with tail call optimizations? Web assembly has functions.

machine-code doesn't have loops,

Web assembly is not the same as machine code

I think overall you are thinking that making claims is the same as evidence. You haven't explained any core idea why tail call optimizations have any benefit in programming or web assembly. You basically just said a well known language creator put them in some languages. There is no explanation of what problem is being solved.

chriswarbo · on April 7, 2023

> Web assembly is not the same as machine code

Indeed; if web assembly allowed unrestrained GOTOs (like machine code) then compilers would already be able to do tail-call elimination.

---

> Are you conflating general functions with tail call elimination?

> What does this have to do with tail call optimizations?

> You haven't explained any core idea why tail call optimizations have any benefit

Sorry, I think there have been some crossed wires: I was mostly pointing out the absurdity of your statement that "almost no programming is actually done using tail calls" (when in fact, almost all programs will contain many tail-calls).

That's separate to the question of how tail-calls should be implemented: in particular, whether they should perform a GOTO (AKA tail-call elimination/optimisation); or, alternatively, whether they should allocate a stack frame, store a return address, then perform a GOTO, then later perform another GOTO to jump back, then deallocate that stack frame, etc.

> You haven't explained any core idea why tail call optimizations have any benefit in programming or web assembly

Based on my previous sentence, I would turn the question around: what benefit is gained from allocating unnecessary stack frames (which waste memory and cause stack overflows), performing redundant jumps (slowing down programs), etc.?

CyberDildonics · on April 7, 2023

almost no programming is actually done using tail calls

I'm talking about tail call optimization, which is what this whole thread was about, what you are you talking about?

Based on my previous sentence, I would turn the question around:

That's not how it works, since you're the one wanting a standard to change.

what benefit is gained from allocating unnecessary stack frames

Where are you getting the idea the webasm JIT has to allocated 'unnecessary stack frames'.

stack frames (which waste memory and cause stack overflows)

This makes me think you don't understand how the stack even works. The memory is already there and stack memory is very small. You can't both 'waste memory' and overflow the stack at the same time. This stuff is fundamental to how computers work.

It seems like you don't even know or understand what problem you are trying to solve. Where did you get this absurd idea that the stack is a problem? Also do you think that rust doesn't use a stack?

maxiepoo · on April 7, 2023

nonsense trends like an "assembly" language supporting jumps?

CyberDildonics · on April 7, 2023

I will quote you from a different comment:

WASM is not really an assembly language.

maxiepoo · on April 8, 2023

One of the main reasons it's not really an assembly language is that it doesn't support jumps! This is fixing a defect!

CyberDildonics · on April 8, 2023

This is fixing a defect!

Only according to people using niche languages that are already slower than javascript but not anyone designing, implementing and using webasm.

maxiepoo · on April 9, 2023

This would allow you to do things like compile other assembly languages to webassembly.

CyberDildonics · on April 10, 2023

What are you basing that on? If that were true then binaries for different CPU ISAs would also be able to be statically compiled to each other.

maxiepoo · on April 18, 2023

Even emscripten, which compiles LLVM to WASM has to use a relooper algorithm to get around the lack of jumps.

synthetigram · on Oct 12, 2022

> After the quarterly report on October 27, nothing better can be expected.

How to tell an article is written by a hedge fund shorting the company.

tester756 · on Oct 12, 2022

Sell rumors, buy facts :)

synthetigram · on Sept 28, 2022

Cloudflare is great at blogging about technologies Google pioneered.

synthetigram · on Sept 20, 2022

Reputation systems should be based on /abuse/, not on automation. I also ended up on the naughty list for running an archival scraping program. Trying to preserve part of the Internet is apparently against the rules. It's really a shame because my code honors rate limits, doesn't spam, and is completely docile.

d2wa · on Sept 22, 2022

> Trying to preserve part of the Internet is apparently against the rules.

It’s against copyright laws too, unless you get the right holders’ go-ahead first.

Some regional differences, but it’s mostly not allowed with a few exceptions for some institutions.

synthetigram · on Sept 20, 2022

Cloudflare has mixed up the definitions of "bot" and "abuse". Tor users may or may not be bots, but as long as they don't abuse (spamming or DoS), they ought to be treated the same.

thephyber · on Sept 21, 2022

Citation needed.

Kab1r · on Sept 21, 2022

I think this is more of an opinion than a matter of fact

thephyber · on Sept 21, 2022

It wasn’t framed as an opinion. And even if it was, I’m saying I think it is wrong and I want to know why I should change my mind.

The fact is that CloudFlare distinguishes abuse (DDoS at IP layers 3 and 4) completely separately from bot detection. And it allows user controls to domain owners to allow some bots like Google Search Crawler.

So my statement stands: I want to see a citation of evidence that CloudFlare doesn’t have the ability to distinguish abuse.