Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
It's TCP vs. RPC All over Again (systemsapproach.substack.com)
154 points by zdw on Feb 21, 2023 | hide | past | favorite | 112 comments


This paragraph is just wrong:

"Another explanation is that the Internet has unnecessarily coupled the transport protocol with the rest of the RPC framework. Conflating the two naturally follows from the purpose-built examples I just gave: SMTP is bundled with MIME; SNMP is bundled with MIB; and HTTP is bundled with HTML. "

I think at this point he's suffering from what old-timers used to call recto-cranial inversion.


Agreed. MIME came after SMTP, for example, and HTTP is used for all sorts of not-HTML-or-hyper-anything stuff. That statement smacks me of "wow, HTTP is a pretty good RPC, but too bad it's only/mainly for transporting HTML, shucks!". The name of the protocol is not very representative anymore either of how it's used or how anyone should use it. If HTTP fits the bill, then use it and move on.


MIME not only arrived after SMTP (the 'E' in MIME stands for 'extension'); it doesn't infect the SMTP part of the stack. There's nothing in SMTP that would have to change if MIME was abolished. SMTP provides no special support for MIME. They are disjoint.

Since HTTP/1.1, HTTP hasn't been premised on HTML, and vice-versa; again they are disjoint.

I don't know why author makes these claims.


And I keep saying that SMTP is a perfectly good message-queue broker message submission protocol (and POP a perfectly good message-queue broker:consumer protocol), but nobody ever listens. :P


I've not just used SMTP the protocol as a submission protocol, but used actual mail servers as the broker, in a production system.

Back in '99 I co-founded a mail provider, and we got so used to abusing qmail (+ rewritten replacement components - any qmail installation becomes a Ship of Theseus thanks to the clear API between small components) that when we needed a message broker we used our internal DNS server for service discovery and our mail infra as the broker (and we had a DNS server where the zones could be updated via SMTP). It worked well, and let us reuse a lot of tooling.


> and we had a DNS server where the zones could be updated via SMTP

Interesting — what was the message format (MIME type) you chose to use to describe the zone changes? Actually, were they even patch-files at all, or did you just fetch the zone data file from your DNS server, update it client-side, and then send an updated complete copy back in the body of the email message, to get de-enveloped directly into /var/named?

With the benefit of hindsight, I'm guessing you'd agree that it would have been simpler to use an LDAP directory for service discovery; then you'd at least have a specced-out message format for updates (LDIF) to be sending to the queue. But then, OpenLDAP as a project only got started in 1998, so maybe you weren't even aware of its existence back then, and were stuck on YP/NIS for Intranet directory management (which has no similar useful extensibility.)


The DNS for service discovery was fairly static, and not what we updated via SMTP.

We treated the DNS server in question simply as a key-value store where a message would set/replace a specific record. The use case was that we we're setting a handful of records when people registered a domain, so those DNS servers were user visible, unlike the service discovery.

The main reason for not considering LDAP at the time for the user facing DNS server besides the lack of maturity for OpenLDAP was that using SMTP gave us one single unified mechanism for dispatching messages to all parts of the system, which also meant it was trivial to proxy, intercept, and log the full message flow with the same mechanism as needed.


To be clear, I'm not saying you should have used LDAP as a changeset submission protocol instead of SMTP; I'm saying you should have used LDIF as a changeset format for sending via SMTP, instead of inventing your own changeset description format to send over SMTP. :)

> We treated the DNS server in question simply as a key-value store where a message would set/replace a specific record.

Interesting! That doesn't sound like any DNS server I'm aware of. In BIND et al, a zone was a file, and to update a zone you needed to modify a zone file and SIGHUP the daemon; which in turn meant that whatever you were using to mechanistically modify zones, needed to be able to parse and regenerate an entire zone file at a time. (It's actually a lot like what's required to mechanistically update-patch Kubernetes YAML resource manifests, now that I think about it.)

Or do you mean that you had exactly one record per zone in this key-value store? If so, how did that work exactly? Zones need at least an SOA record, no? Or, I guess, maybe not, if this nameserver was only exposed within the Intranet, without the requirement of being an authoritative source that other DNS servers could fetch and cache from.


The DNS server was our own.

Writing an authoritative-only DNS server is trivial (few hundred lines), and since it knew it was always authoritative, it'd synthesise a SOA record for every zone it answered for.

The zones had a very minimal set of records: Either they had NS records pointing to where the user wanted the domain delegated, or to us. If they pointed to us they had an A and MX record which might point to us or to wherever the customer wanted.

It was not exposed only within the intranet (the service discovery DNS was exposed only via the internet; separate thing)


In hindsight everything should be HTTP, and there should be JSON schemas as an option for everything. Not needing the bloat of an LDAP implementation is very nice. Schema differences between sites and schema metastasis into code in ways that complicate schema evolution is a big deal too, so "stored procedures" are really a good idea for containing schema metastasis.

I'm not sure why GP would need service discovery for DNS zone updates via SMTP though.


Mail servers have a bunch of tooling for routing, retries, queue management etc. that already exists. If you do that via HTTP you end up having to reinvent a full message broker stack just for the sake of using a different protocol that lacks the abundance of components you can just drop in to do things like reflection (mailing lists), forwarding, filtering, retrieval, deletion, access control that you get for free with a mail server.

Maybe if we that entire ecosystem existed on top of HTTP it'd have been ok, but the actual transport is the tiniest little uninteresting sliver of what we got by using a mail server for queuing.

With respect to the DNS servers, the service discovery was for addressing, so we/a client component could e.g. e-mail "update@dns.live.local" or similar. The SMTP updated one was not the service discovery one, but feeding our customer facing DNS infrastructure.


There's several ways of doing service discovery for HTTP services like this in an enterprise, using 3xx replies to redirect if you move the service. The simplest is to just use a URI authority that you can change as needed. The next simplest is to publish URI RRs in DNS. The next simplest is to use the .well-known URI local-part namespace to include information for client-side routing. Granted, the last two are recent developments.


Or we could just use DNS A or CNAME records and not have to do any of those, the way e.g. CoreDNS w/k8s or ldapdns on top of openldap, and similar does.

There are use cases for more complicated service discovery, especially discovery of external services, but for the scenario I described, which in involved discovery within our own network this all just would have added extra complexity for no benefit that would have made any difference to us.

Note that service discovery was not listed as one of the reasons I think http doesn't solve the issues that made us choose a mail server back then..the exact same service discovery mechanism would have worked just fine for http too. But we'd still have had to build all the message broker code from scratch instead of picking and choosing from pre-existing components.


I'm not saying that you shouldn't have used SMTP. It really is appropriate for this sort of thing. If you wanted an HTTP interface you might still make it inject messages into SMTP, with HTTP being only used for a browser UI, say, or with the HTTP server as a proxy for some reason.

My objection had been to LDAP. LDAP is too heavy-weight by comparison to HTTP w/ a JSON API.


I listened, derefr. I listened. ;)


HTTP might now be used for other things but right from the beginning (check the RFCs) HTTP stood for "Hypertext Transfer Protocol", the RFC was written by Berners-Lee. HTTP and HTML did come together, and from the same person.


So? They aren’t bundled together as the article says they are. Even on the web HTTP is used for all sorts of other content types, from images to CSS, loading javascript, JSON RPC calls, websocket connections, server sent events and so on. Outside of the web? DoH, JSON RPC, git over https, npm, video games, elasticsearch queries, … the list goes on for a long time.

A very small percent of http requests transmit html.

Claiming http “bundles” html is like saying XmlHttpRequest is bundled with xml, or shortening javascript to Java. It’s a strong signal that you don’t know what you’re talking about.


The HTTP verbs don’t support RPC unless you violate their specified operations. Add DO to GET, PUT, POST, PATCH and DELETE along with expanded response codes and I’m good.


All those verbs are just constants that provide some behavior.

POST is the constant that means I'm going to probably send you a body and probably nobody is going to try to add caching, which is usually what you want from an RPC. You can just #define DO POST, and you're set.


And POST can come with a response body. And it can either create a new entity or not. It's what you want for an RPC, yes. That's why SOAP used it. Of course, it's not "RESTful", but hey.


RPC usually means stateful client so REST goes out the window.


Browser clients of RESTful APIs can be stateless in a very stateful way because the DOM and JS allows the client to keep state. The statelessness of HTTP is supposed to be about the HTTP layer itself, not the application, and this is true as much about the client as about the server.

Consider a database application where the client is a browser and the server is just something like PostgREST, and then you see that the client stores state impliedly in the DOM (what rows it's displaying, which affects what the user might do next) and that the server keeps state in the database even though PostgREST itself is stateless. The whole thing is RESTful (because REST is in the name of PostgREST, duh; just kidding, it really is RESTful as designed).

I think this RPC-means-stateful / REST-means-stateless dichotomy is just not quite right, and it leads to thinking like that in TFA. All RESTful really means is that you use appropriate HTTP verbs for the actions you're doing (especially HEAD and GET for reading so you can cache where caching makes sense), use HTTP status codes as much as possible, and give entities URIs (possibly derived on the fly) -- that's really it. If you have an RPC called `CreateWidgetyThing()` you can just as well have a POST to `/widgety-things` to create them -- it really is that simple.


Why not use POST?


You can add verbs.


I think I feel dumber for reading this. (The referenced paper by Ousterhout is quite good though, and doesn’t confuse the terminology. Maybe that one’s what HN should link and discuss.)


To someone only weakly aware in this area, can you explain what is wrong with the statement? From a cursory view, these protocols indeed seem to normally carry the mentioned payloads.


It's not like HTTP stops working if you choose to send something besides HTML. Most of its features still work and make sense for JSON payloads and the usual building blocks still work (eg HTTPS, load balancers, proxies, status codes, redirects...)

There are a few bits that don't make sense eg cookies where the server sends something and the client is expected to remember it but they can just be ignored.


Exactly. The HTTP standard doesn't really mention HTML, except tangentially in examples. And regarding cookies, there are even people who use cookies in RPC contexts, but this is rather rare, and as you said, entirely optional.

Also, as a minor point, even when browsing the web, a minority of the actual requests deliver HTML -- most of them are for various kinds of media referenced by the HTML document.


There have been lots of RPC protocols. Here are some still in use.

Transport level:

* Sun RPC [1]. QNX still uses this. It can run over UDP or over raw Ethernet. It just transfers an array of bytes and gets an array of bytes back - marshalling is a higher level problem. It handles messages bigger than one packet, and retransmission. It's simple and performance is good, but there is no security.

* Stream Control Transmission Protocol. Telcos still use this. It's how Signalling System 7 is sent over IP.

Although many implementations exist for both, they've never been popular outside their niches.

Marshalling level:

* CORBA - you define interfaces in a special language and compile them. Data is not self-describing.

* SOAP, from the XML era.

* Google Protocol Buffers - another system where you compile definitions.

* HTTP/JSON - where we are now.

Plus a lot of Microsoft-specific stuff.

Either you have a system where both ends have to exactly agree on format, you have a verbose format with unneeded description data in every message, or you have a very complicated negotiation at connection time. This leads to much disagreement.

[1] https://en.wikipedia.org/wiki/Sun_RPC


> CORBA - you define interfaces in a special language and compile them. Data is not self-describing.

OMG the Common Object Request Broker Architecture (which coincidentally was created by OMG). I never thought I'd see that acronym again, and all these years later, I still don't understand what it is.


But is it not obvious? It’s the architecture for brokering requests of objects most common.


No no no, it's a Common broker architecture for object requests


Yeah, but the question at hand is why layer it over TCP.

SunRPC over UDP is close, but then you push reliability and congestion management into every application.


SCTP, from the parent's list, is a "over IP" protocol, not over TCP.

(Though it can be tunneled through UDP, to get around Internet ossification.)


> HTTP/JSON - where we are now.

That’s not a protocol either, you can use HTTP as an RPC (with a JSON payload), but you can also define a transport-independent JSON RPC protocol (e.g. JSON-RPC2, which can be used over HTTP but can also be used over simpler sockets, for instance chrome’s dev tools protocol is JSON-RPC2 over websocket or a pipe)


> Stream Control Transmission Protocol. Telcos still use this. It's how Signalling System 7 is sent over IP.

WebRTC is also a relatively thin wrapper over SCTP.


Yep, WebRTC data channels use SCTP (tunneled in UDP and DTLS), audio/video stuff is just SRTP (in DTLS) I think.


> SOAP, from the XML era.

Don't forget XML-RPC :-)


Remembered and now intentionally forgotten :-)


Ah, yes. I had to do XML-RPC from Rust recently. The login handshake for Second Life is XML-RPC. No other part of the system is.


If you'd like to watch Ousterhoust explain Homa and his vision of replacing TCP:

https://www.youtube.com/watch?v=o2HBHckrdQc

Spoiler: Ousterhout is right, and the data shows he is right. Maybe not everyone knows it yet, or it's the kind of "right" that not everyone will ever get to (and we'll settle for "good enough" for much longer), but he is almost certainly right.

Not to disparage the poster of this article of course -- I always love a good hot take or contrarian piece, but if you can't come with the same amount of data to support your thesis...

Homa is probably dismissable on UX grounds (are people really going to switch to it?), but it's clear that long hard thinking has gone into the case for Homa (and it's implementation).


I think you're right that he's "right", but the writing is shooting itself in the foot here. I agree that there's a mostly-unexplored middle between TCP & UDP, but you need to not utter things like "HTTP is bundled with HTML". I also think RPC is a poor label for "request/response" protocol (HTTP is request/response, and many of the problems here apply to HTTP too … but it isn't RPC).

I'd like more discussion of the contenders in the space: why do they not work? QUIC is barely mentioned beyond "we tried it and nope". SCTP isn't mentioned on the OP at all, but if you click through the Homa stuff a bit, it's written off as "a WAN protocol" — whatever that is? I get that his focus is datacenters, but "WAN protocol" is what I feel most of us are interested in, and I don't immediately see why WAN vs. DC matters¹.

What I'm also not getting from the Homa page is … what is the protocol, what are it's packet structures? Is there no documentation? (I've not hit all the third-degree links because I'm not a spider, but nothing seemed relevant or it seemed a little relevant but didn't hold the answers.) The old adage about data structures over algorithms seems to apply:

> "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." -- Fred Brooks, The Mythical Man Month (1975)

¹I'd also want to know why a good "WAN" protocol that solves the stated problems wouldn't equally apply to the datacenter, modulo needing some adjustments around assumptions made about the network's latency and maybe reliability: a DC network, I'd hope, is going to have much lower latency. But like Raft timers, it seems like a by-use-case adjustable constant.


On the topic of WAN vs DC it's about reliability. A lot of the engineering behind TCP assumes an unreliable transport, but in a data center context a lot of the unreliability doesn't exist, or can be engineered out of the network infrastructure you use. A good example of this in my space is video over IP (ST 2110), which is over RTP/UDP, but RTCP (for retransmission and rate control) is much more of a vestigial component, and if I remember correctly is entirely optional in the specs. A lot of the SDN solutions even default to not carrying the RTCP back channel.

Now, in the video over IP space, there is essentially zero time available for retransmission, so such packet loss would be better handled as a dropped frame or audio packet, but it's not like this is an expected outcome of using UDP in such a controlled environment as a dedicated video transport network


> Spoiler: Ousterhout is right, and the data shows he is right. Maybe not everyone knows it yet, or it's the kind of "right" that not everyone will ever get to (and we'll settle for "good enough" for much longer), but he is almost certainly right.

No, Ousterhout is not "right", Ousterhout has put forward an hypothesis, backed up by _some_ data that's not really borne out from real world experiences in the datacentre.

Infiniband, despite his claims to the contrary solves a whole bunch of the problems he's describing. Moreover its a proven solution that deploys at scale. not only that, but you can run enthernet/IP over the top.

Having proper flow control inside your switching fabric is bloody amazing if you want real-time with QOS over the top. After all Fibre Channel was designed to be fast, guaranteed deliver and low latency. The problem is, ethernet was cheaper and good enough to cover 80% of usecases.


A bit late but for the average (small, let’s say) data center that isn’t interested, then he’s offering a valid (if not unlikely) solution right?


I'm not sure. Having run a smaller datacenter (~2 megawatts) We got far more bang for our buck by upgrading the core switch fabric and moving to dual 40gig(from dual 10 gig) ethernet for the "heavy" nodes.

Sure it cost more in terms of capex, but once it was installed we didn't need to change anything else. The cost of having an engineering team porting mission critical stuff to a new transport mechanism would have been astronomical, not only in opex, but in lost capacity to do other business critical tasks.


I for one think we should finally replace DNS with JPEG. That makes exactly as much sense as this article.


jxl so the requests can be lossless, transparent, and multi-channeled


You could just use UDP instead of TCP and implement some subset of TCP features that you need in-band, like sequence numbers. This is not a new concept. The Facebook Memcache paper [1] in 2013 described using UDP for intra-datacenter requests.

[1] https://research.facebook.com/publications/scaling-memcache-...


Sure, but what is UDP bringing to the party then? The pseudoheader?

A reasonable RPC layer could just layer on IP.


Well, for one, UDP brings the U. It's called the User Datagram Protocol because (at least on UNIX-like systems) unprivileged programs can almost always send and receive UDP datagrams, but in general cannot send and receive raw IP datagrams.


UDP gives you port numbers supported by every networking equiment your datacenter already has, instead of having to deal with raw IP frames or somehow extending how your firewalls and servers handle your special boy packets. Especially useful considering you probably want port numbers anyway, and checksums are just a nice bonus. You'll probably reimplement part of TCP on top of it too.


> Sure, but what is UDP bringing to the party then?

Interoperability. It's been how many years since SCTP was published? How's that going?


You're likely interacting with SCTP every time you use a mobile phone, as it's in heavy use in the control plane of telecom networks. Outside that, not so much though.


Also every Zoom/Google Meet call. WebRTC is on top of SCTP.


WebRTC uses SCTP in a weird way: on top of DTLS, which is itself on top of UDP. When most people talk about using SCTP, they mean using it instead of UDP.


Sort of. SCTP is used for data channels, the audio/video is RTP. Still an important piece though.


Yes but if my program on my laptop wants to talk to my program in my server somewhere out there, chances are the SCTP will be blocked by some piece of blinking furniture along the path.


> Outside that, not so much though.

Thanks for making my point.


The UDP header will give you proper ECMP hashing.


Yup - this is the correct answer


IIRC, UDP just adds port numbers and a checksum to IP. It's about as bare bones as it can get, unless you want to get rid of the checksum too.


Some implementations do skip checksum (leaving it as zero), for instance you can configure a Linux UDP socket to skip it if you can't afford the time it would take to calculate them.


Osterhout's response to Pepelnjak's criticism is solid, but he mentions not being familiar with SCTP. Homa is much more fairly compared to SCTP than TCP, since there is a large intersection with the problems they solve. Osterhout seems to think that ACK-based sender congestion avoidance is a significant problem in the datacenter.

SCTP uses ACK-based sender congestion avoidance, but lacks the head-of-line blocking that TCP has[1]. Comparing Homa to SCTP would be a good way to test this thesis.

1: Or rather it's optional.


If you change the stack at TCP level...

- How many hardware network components would need replacing/upgrading to support a layer not expected to change?

- How many network + application software libraries would need to support parsing at packet level instead of byte level?

- How many OSI diagrams showing that high level byte and string based protocols run above the transport layer need to be re-written to say that's no longer the case.

- How many versions of RPC would need to be supported... just look at the numerous revisions of CORBA/RMI/SOAP(Document or RPC Style!)/REST/GRPC to see how many times you can re-invent (and misinterpret, break standards of) a wheel.

- Look how long IPv6 needed buy in and refinement, and then adoption. That was a necessary change with no alternative too!

There are many applications that shouldn't need TCP, such as those only doing short network hops or needing real time traffic with expected packet loss. They should use UDP but often don't. TCP is ubiquitous for the very reason that it is the chameleon mentioned in the article - TCP is not ideal, but it is convenient and well understood.


> the Internet seems to be missing a standard protocol for the request/response paradigm, with repeated attempts to force-fit TCP leading to inevitable mismatches

HTTP is that. QUIC is that over UDP.

We've discussed Ousterhout's paper about replacing TCP in the data center. Last I looked the answer was "not with what is described in the paper".

> I have never understood why the Internet has worked so persistently to adapt TCP in support of request/reply workloads instead of standardizing an RPC transport protocol to complement TCP.

> ...

> But my original question was to ask why that hasn’t happened. The only answer I can come up with is that judgment often reflects biases.

I can explain this. Some of it is bias, as TFA surmises, but there's a lot more to it.

First off, there is an Internet RPC protocol, and it's the ONC RPC protocol, the one that NFS uses. ONC RPC does run over TCP and UDP, which, yes, TFA doesn't seem to want.

Second point: adding new upper layer protocols (ULPs) like TCP and UDP and SCTP is fraught with middlebox and over-design problems, and you can look at SCTP for details.

The third point is that the ability to implement and iterate quickly means that having to wait for kernel support for the new thing is a non-starter. This is the strongest reason why you won't see a new RPC ULP happen quickly and why there has been little interest in one.

So between the middlebox issues and iteration latency issues the only way to do a new ULP is as a new binding of the RPC which must also have TCP and UDP bindings.

I.e., all request-response protocols we have pretty much have to run on top of TCP or UDP.

This fact of Internet life may suck, but that's just how it is. Making it go away will require lots of time. Developers don't have lots of time.

Fourth point: running a request response protocol over UDP is quite reasonable, especially in the datacenter.

Fifth point: building a new ULP will risk yielding the same over-design issues as SCTP all over again unless it's a new binding of another protocol.

I'm beating a dead horse here; let's switch to the "bias" issue.

RPC got a really bad rap in the 90s and 00s. The reason is that the RPC frameworks that came out of the 80s, and the more modern versions that followed them in the 90s and 00s, were all [mostly] synchronous. There's no reason for that: it was the easiest way to implement back then.

Another aspect of synchrony is that RPC == remote procedure call and comes from an era when it was fashionable to build things like distributed remote shared memory (which is a performance disaster). The idea that RPC is like a function call instantly evokes synchrony even though RPCs are not condemned to be synchronous. That suggestion leads to aversion.

Another reason that RPC frameworks got a back rap back then was the encoding systems used. It doesn't matter if the encoding system was trivial like XDR or complex like NDR or DER or PER or whatever -- the tooling for encoding rules from the 80s and 90s was just not remotely universal. As well we had the XML era (e.g., SOAP), which was very buzzword-rich and efficiency-poor.

That bad reputation is undeserved, really, but there it is. Nowadays RPC is making a comeback via gRPC and friends, but it shouldn't surprise that they typically run over HTTP.

Meanwhile the web folks iterated quickly. HTTP user-agent stacks quickly became async. HTTP APIs iterated from XML to JSON, and either way, with JavaScript and HTML, web apps got a great deal of flexibility and performed great by comparison to RPC apps.

Plus the web security model is atrocious mainly because of the most useful thing about HTML: pages can have cross-site links. Cross-site linking is so incredibly useful that it can't be done without. And now you need a fancy user-agent (browsers), so the web folks built one (browsers).

HTTP won out over RPC.

Why not use HTTP for everything, then? Well, HTTP/1.1 is... not efficient. H2 is much better, but still over TCP. H3/QUIC should be perfectly fine in the datacenter, and for some things on the web (and eventually maybe all things on the web?), and runs over UDP.

But HTTP is a request/response protocol. Just like any RPC!

Plus ça change, plus ça devient la même chose.

My argument: HTTP is the protocol you're looking for. And if ever we need a new ULP and are confident that we can deploy it, then we should design one for HTTP because a) that will satisfy the RPC need, b) it will tend to quash over-design instincts, c) it will have the same APIs/semantics as a protocol we're all already very familiar with, d) having one protocol with bindings to all applicable ULPs will help us get over the kernel support issues. (d) is especially appealing: you get to target one API and you get the best ULP locally supported.

We do have a widely-deployed Internet RPC protocol called HTTP, and you get to run it over TCP or UDP, and its semantics are stable when we add new ULP support. That's a pretty good outcome. Though maybe it requires getting over one's own biases :)

> That we have since turned HTTP into the Internet’s de facto RPC protocol (and then now realizing that it is suboptimal, are trying to optimize it by collapsing all the layers into the new QUIC protocol), is only a testament to how small a role technical rationale plays in what happens in industry.

I object to the characterization of HTTP as suboptimal for RPC. TFA is clearly referring to TCP as being suboptimal for RPC, but not really covering how UDP is suboptimal for RPC. And TFA does not discuss why NFS switched from using UDP to TCP, say.

It feels like TFA is just emoting that we ought to have a ULP specifically for pure RPC. That seems like a very biased position.

Suppose we built a ULP for RPC though. We could easily expect HTTP to get adapted to run over it. What does HTTP add that an RPC doesn't need? Mostly headers, which the RPC can just not use, and URI components, which the RPC can also set to the smallest possible values, say.

There's not that much light between RPC and HTTP. Both are request-response protocols. Accept HTTP as an RPC and you might be happy.


Consider how Lustre RPC works. It's mainly based on RDMA-capable protocols, though it has a TCP binding as well. For the TCP binding Lustre uses separate connections for RPC headers and bulk data, and even in the TCP case it's built around RDMA as a concept. RDMA essentially means that before you do a bulk I/O you negotiate acquisition of buffers for that bulk I/O, which imposes discipline on resource usage, which is pretty great for HPC and... less great on the public Internet. For writes this means an extra round trip at least once to set up the buffers you will use.

Now, the thing about RDMA is that it greatly benefits from HW NIC support -think InfiniBand-, but for that to be feasible the RDMA protocol needs to be its own ULP running over IP or -even better- a protocol at the same layer as IP. (If it's not run over IP, however, it can't be an Internet protocol.) This is where the real need for new protocols other than TCP or UDP should come in. And there are some such protocols, like RoCEv2.

It's totally possible to build an RDMA protocol over UDP that is performant. One might worry about the UDP header overhead, but it's minimal, and besides, you'll want cryptography, which will add more overhead for authentication tags. Don't allow fragmentation, meaning you'll need PMTUD, and you don't need your own length fields, so the only UDP overhead that could be avoided is the src/dst port numbers (4 bytes!)... which you might have anyways in a new ULP.


>d) having one protocol with bindings to all applicable ULPs will help us get over the kernel support issues.

Isn't the purpose of QUIC to hide all of this from the kernel, at the cost of statically binding the library to the app (as Go might)?

>Cross-site linking is so incredibly useful that it can't be done without. And now you need a fancy user-agent (browsers), so the web folks built one (browsers).

Just set the MIME to text/plain?


> Isn't the purpose of QUIC to hide all of this from the kernel, [...]

Yes, but the moment you want to use a new ULP (which is what TFA seems to be angling for) you'll basically need kernel support, except for kernel-mode / baremetal apps that aren't really relevant to this discussion.


Good to learn a bit of history about TCP.

It always felt weird developing RPC layers on top of TCP cause requests and responses end up tied to the underlying socket -- which doesn't need to be the case ever.


How else would the kernel know to which application it should route a response?


Destination port number should be sufficient.

For example, imagine if there are two sockets established between server and client processes (due to multiple ips or roaming, time lags, etc.), then in theory, request received from one socket could be responded through another socket. From application pov it doesn't need to care about the socket. We can do this over TCP, but connection-oriented nature of TCP makes this weird.


Destinations can be described in the packet headers. That is how the current systems work.


So the question is why hasn't an optimized RPC implementation emerged for the data center that avoids the headaches that come with layering over TCP.

Turns out integrating with the control plane is hard at data center scale.

When everything is working, an optimized protocol stack is great and everyone is happy.

When congestion happens (fabric, kernel, memory bus, cache footprint, etc), and your optimized app causes a legacy production app to behave strangely, all hell breaks loose.

For example, a distributed file system SRE sees some strange tail latency effects and doesn't know that a cluster is shared with a non-TCP app and of course the optimized stack doesn't respond to congestion the way TCP would.

Worst case, the SREs notice this just after an update is pushed in their app, and they spend a random amount of time trying to figure out if the update is what changed the behavior.

Best case, they know there are non-TCP apps in the area, and they ping the SRE for that stack and say "my app is behaving strange, please hit your big red off button so I can see if that fixes the problem."

Someplace in between those options, the SRE is using TCP aware network debugging tools to try to figure out why this day is different from yesterday, or this cluster is different from the one where everything is working fine.

Regardless, you get an unhappy SRE. But of course you never get just one unhappy SRE.

So you need to generate a lot of value got justify their pain.


Is this a good time to bring up T/TCP?

https://www.rfc-editor.org/rfc/rfc1644

    This memo specifies T/TCP, an experimental TCP extension for
    efficient transaction-oriented (request/response) service.  This
    backwards-compatible extension could fill the gap between the current
    connection-oriented TCP and the datagram-based UDP.
I have no illusions about the likelihood it'll ever see use, but I've always thought T/TCP was cool and fun.


> and not limited to HTTP’s five operations

What is he talking about? HTTP methods are an arbitrary string.

> The method token is case-sensitive because it might be used as a gateway to object-based systems with case-sensitive method names. By convention, standardized methods are defined in all-uppercase US-ASCII letters.

Even if you talk about the standard methods, there's 8 listed right in the RFC:

https://httpwg.org/specs/rfc9110.html#methods


> SMTP was a purpose-built RPC for email

SMTP is a very chatty stateful back-and-forth protocol, where what message is legal to send when depends on the state of the connection. It's roughly the opposite of a singular request-response RPC.

> judgment often reflects biases

As seems to be true with this article, too.


Really interesting article. Wish there was a discussion of performance for various RPCs from the past.


RPC in general failed for two reasons: 1. Synchronous in an async environment 2. Brittle client/server coupling of RPC definitions

HTTP is a remote procedure call, it's request/response.

Each request/response pair is independent and asynchronous.

The difference between HTTP and a more "specified" RPC is that the body of the request and the body of the response are defined "out of band" and are flexible enough to allow for the client and the server to evolve independently.

I could make HTTP look like ONC-RPC by using XDR to martial the arguments in the request/response body and map the actual procedure name to the URL. The MIME type "application/protobuf" exists to do the same using protobufs. So does "applicaton/vnd.google.protobuf".

HTTP was built using TCP as the reliable connection layer over the IP packet protocol but QUIC replaces that using UDP over IP to remove the problems that a connection based/reliable protcol like TCP causes.

Really not sure what a new protocol would provide other than shuffling the boundaries again.


There's so many...

Apollo Domain's.

ONC RPC.

DCE RPC (and MSFT RPC, which derives from DCE RPC).

SOAP.

LustreRPC.

And who knows how many others.

The biggest problem with RPC in general is that historically it was synchronous because that's what was easy to implement in the 80s. Fix that and an RPC sucks only as much as the encoding system it uses.

But RPC == remote procedure call, and that causes people to instantly think "synchronous", and that is a kiss of death.


Well of course once you say "procedure call", people think synchronous.

For the sender, asynchronous RPC is just a convenient marshalling and return interface for the sender. It is pretty clunky for the A-RPC caller to create an illusion in the code of the RPC being "just like" a local procedure call.

For the called procedure an Async RPC can look just like a local sync procedure call, except for that whole address space thing.

I don't think that sync RPC was popular because it was easy to implement.

I think it was popular because it was easy to code to. Multi-threaded coding is easy to get wrong, and was poorly understood at the time.


> I think it was popular because it was easy to code to.

That's not that unlike it being easy to implement.

> Multi-threaded coding is easy to get wrong, [...]

Yeah, but the better design is async/await and so on, not threading, but that stuff was not very popular in the 80s.

The approach to concurrency in the 80s was heavy on context switching because that's what was inherited from the 70s. It feels like the thinking was along the lines of "during the time you're waiting for a response some other process will run, and since we're used to heavy latency, who cares!", but that became less and less tenable. There's a reason all the distributed operating system research of the 80s died except for filesystems -- with filesystems you can consciously avoid hot ping pong by just not accessing shared resources concurrently, while the filesystem itself remains super useful.


There are some things that current RPC frameworks simply cannot touch.

I've been up to my elbows in TCP lately. Working on a low-latency, select-based socket server for streaming gaming applications. Each instance runs on 1 thread so there is absolutely no context switching delay when servicing player requests. The TCP streams are responsible for moving player inputs and game state as quickly as possible between system elements. There's also an externally facing websocket being serviced inside of these threads. The whole networking chunk of the loop usually completes within few tens of uS. Socket selects hang out for up to 10uS.

I considered something like AspNetCore/kestrel, but that's a lot of machinery, GC liability, thread contention, et.al. gRPC also crossed my mind but I found some headache with the tooling.


A well written C mux is absurdly performant. I was handling thousands of requests per second with efficient hand rolled binary protocols in the early oughts when I could make UDP work, which I could since it was an intra datacenter app and a super lightweight exponential backoff retry scheme on the client side was entirely adequate. “Modern” network programming is hilariously inefficient. I think games are the only mass market exception since that’s the one area where optimal network performance affects pleasantness. Gamers used to speak of netcode almost reverently.


This.

The waste of doing RPC over TCP is simply astounding.


I am not even planning to push any sort of limits here. There are architectural elements which coalesce requests towards the edges into fewer, larger pipes. At most, any one of my socket servers will be responsible for 1k connections.


Do you mean the `select` call? libuv and friends can offer higher throughput than `select`.


Whatever OS primitive this API effectively calls is what I use:

https://learn.microsoft.com/en-us/dotnet/api/system.net.sock...

Edit: This is the source: https://source.dot.net/#System.Net.Sockets/System/Net/Socket...


I didn't dive past the links you dropped, but reading the doc comment makes it seem like the call is using `poll` (as an alternative to `select` for FD size reasons.) It's well recognized that the polling model of `select` and `poll` are slow and it was specifically this problem that led to the framing of the C10K problem. The solution is to use an async style of programming which is what libuv and other wrappers offer (piggybacking off `epoll` on Linux.)


Interestingly enough, polling is still preferred in some cases of low-latency networking, because the overhead of NIC interrupts can become significant under high utilization.

Of course in those cases you're running on bare-metal with kernel bypass networking, so there's no user<->kernel switch like in poll


The issue with select(2)/poll(2) (and for that matter WaitForMultipleObjects()) is that you are passing huge datastructure across userspace/kernel boundary with each call. There are various alternative platform specific implementations of what in the end is still polling that place the set of interesting FDs on the kernel side and thus make the whole thing significantly faster. That is what libev/libevent/libuv abstracts away in portable manner.

The programming style is essentially same for a program that uses select(2) directly. (If you do not count various anti-patterns that are common in GUI applications which do networking)


Async style brings downsides with regard to real time latency. As noted in other comments here, I am not pushing more than 1k per socket server.


It’s been many many years since I’ve done low level multiplexing but I think epoll() is the no longer quite so new hotness?

Ah I see libuv is a cross platform lib. I’d probably use that for production code, but it’s still really valuable to work with the low level API enough to understand it.


Yeah epoll is what libuv calls on Linux. And agreed I think it's very valuable to try and use epoll or equivalent yourself at least once to understand what's happening.


I've used NFS over stunnel and I don't see any performance problems, but I do have the perception that it is not a panacea.

I've read that kerberos NFS encryption has performance problems that stunnel solves.

I've never sunk down to ONC RPC, but it does feel like it's a hack compared to QUIC.


What are the best recommendations for modern day low latency RPC? Google protocol buffers? Or some message queue such as ZeroMQ?


Part of what we're arguing in this article is that there is no good answer today because of the assumption that TCP was good enough. Homa is challenging that assumption (as many others have tried before). My view is that if you want something better than TCP for RPC today, QUIC is the best thing available, as discussed here: https://systemsapproach.substack.com/p/quic-is-not-a-tcp-rep...

What you run over that for your RPC is a matter of taste - gRPC is certainly popular, and can be run over QUIC.


You’re thinking too high level, this is talking about replacing TCP or UDP with a new transport protocol designed specifically with RPC in mind.

Instead of opening a TCP (or UDP) socket and writing data (or data grams) to it you’d open a (for want of a better term) RPCP socket and write data that represents the procedure call and parameters to it.

I assume the protocol would have assurances of some kind of reliability (unlike UDP) and procedure call data boundaries (unlike TCP) built in at the lowest levels along with request/response sequences to allow out of order completions.


Flatbuffers, Cap'n Proto, or SBE over RDMA.


I'm a huge fan of capnproto.org


Check out eRPC.io

It leverages DPDK and is perfect for low-latency RPC in lossless DCs


It would be nice to have something like that, but have libraries for most common languages.


Where would I find one of these lossless fabrics?


The third protocol the author is talking about sounds a lot like Content Centric Networking/Named Data Networking.


I'm a pragmatist and there are couple of points in this article that give me allergic reaction.

> You want individual messages instead of a byte-stream? TCP has an option for that.

That links to. https://book.systemsapproach.org/e2e/tcp.html#record-boundar... In my opinion you can't be serious suggesting usage of URG or PUSH as message boundaries. TCP does not work with records/messages. It works with streams. There is a reason for that.

The reason is that a record can be large. If it fits one packet - who cares. With small records maybe the transport protocol can concatenate/batch the records to save the network some load (Nagle's algo). With large records, you need to do retransmission if data is lost, you have head-of-line blocking, you must deal with congestion backoff, flow control (what if remote end is busy, or has scarce memory).

In traditional IP the solution to RPC would be to open a new stream for each request. This is how HTTP 1.0 worked. It had certain advantages.

> You want congestion control? TCP can give you one version tuned for the wide-area and another version tuned for the datacenter.

You can go quite far without tuning congestion control algos inside datacenter. Cubic and BBR are good enough for almost everyone. There is cost in using DCTCP. It's in the same ballpark as ECN. Great in theory, solves real issue. But requires heavy investment to get any value off it.

> You want a low-latency network stack? Well that’s a challenge TCP has a 40-year history of trying to optimize away, and when that falls short, ultimately looking to SmartNICs to solve.

Okay, yes, TCP was not intended to be low latency. With traditional API's it's impossible to get zero-copy, gosh, even stuff like getting a transmission completion signal is basically impossible. Normal NIC's have offloads and they work, they save the CPU from some dumb work. If you want low latency then go for RDMA-like approach.

> In contrast, RPC was designed from the start to optimize round-trip performance in low-latency networks.

Err... Okay, so by this definition HTTP is not RPC. Fine.

Okay, so we're talking about inter-datacenter "trusted", "fast", "low latency", with homogenous network, protocol, that is tuned to RPC-style, request-response traffic. Stuff like this comes to mind: Memcached (including binary protocol), redis, RDMA, grpc, and quic.

Fine! TCP is indeed not optimal. Should you care? Nope. In real life tricks like tcp connection-pooling, raw UDP protocols (think: memcached UDP) work just fine.

Furthermore, often you need encryption (even inside datacenter). The power of TCP and other generic protocols is that they work fine on the lossy, untrusted public internet. I can connect to redis over 300ms lossy stream just fine (and I often do!).

Would we benefit if there is a protocol (like QUIC or SCTP) that is tuned for inter-datacenter RPC - probably yes. Should you care - nope. Look at how hard making QUIC was.

I'm a pragmatist and I'm allergic to theoretical discussions. There is a reason why memcached binary protocol and memcached UDP protocols are super obscure and (almost) nobody runs them (the reason is that latency is not the most important thing. Simple code in the client is often more important).

> But coming back to the specific question of RPC vs TCP in the datacenter, it still has me scratching my head about why it hasn’t happened

I can give one answer: BSD sockets API. They are limiting, and think about a case of large response. Ideally the server would like to give it to the kernel and move on to next request. This is not how the API's work. I think SCTP + some kind of zercopy would give you quite a decent starting point. However, SCTP didn't catch on, and the API has serious flaws. The next big thing is QUIC. But unless someone provides a kernel API for it, it won't be "fast" or "low latency".


I'd be happy if DNS and NTP would just get off their butts and switch to TCP. No, DoH doesn't count.


Why would NTP be better with retransmisson of stale packets and go-back-N semantics on drops?


UDP can be spoofed for amplification attacks, and since ISPs won't implement reverse path filtering, the only other option is to get rid of UDP.


Normal DNS will switch to TCP for any large request and has done for quite some time.


Unless you're using the musl libc - which is in heavy use in conteainerized applications


OK well if it does zone transfers those are over TCP only and always have been.


There is more to DNS than just zone transfers...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: