In search of performance – how we shaved 200ms off every POST request

dap · on Aug 20, 2015

This is a great write-up of a tough technical problem, and I definitely agree with their conclusion that "it's worth taking the time to understand your stack." I especially liked that it wasn't preachy. It just explained their problem and how they solved it. (It seems a lot of write-ups end up offering very specific technical advice that grossly overgeneralizes from their experience.)

My only question is how they got to haproxy as the root cause so quickly. In my experience, comparing what's different between production and staging is a long shot because there's so much that's different. Obviously workload can matter a lot, and so can uptime and time since upgrade. So I'm curious if haproxy was the first thing they saw that was different or if they just didn't write about the dead ends.

developer1 · on Aug 21, 2015

To pitch in some preachy advice: with some time and effort, staging and production environments should be essentially identical. Sure, production is generally going to have more servers handling each service (more web nodes, more db/storage nodes, multiple load balancers, etc.), but the setup and deployments should not be any different.

All too often, we think of staging as nothing more than a clone of the development environment on a remote server where QA can get at it. If you run your load balancer (haproxy), web server (apache, nginx), database (mysql, postgres, mondo, elasticsearch), caching (memcached, redis) all on a single server for staging, you're eventually going to stumble on this kind of hard-to-diagnose problem.

One of the main disservices you are doing to yourself with a single-server staging is that all of your traffic is going to be travelling over localhost or, worse, over unix sockets. You're not even testing basic network latency or performance.

Staging should really be considered a first-level production. Its configuration and maintenance should be handled with the same attention to detail as production.

Sinjo · on Aug 20, 2015

Thanks! Glad you enjoyed it.

The main thing that narrowed it down was thinking about the request path from the app server to Elasticsearch. That pointed us at the load balancers (we tried a request directly from app -> Elasticsearch in production to verify they were the problem).

You're right that HAProxy wasn't the first place we looked once we were there, though. If I remember right, we started by diffing `sysctl` output to see if Ubuntu had tweaked something between versions.

edwhitesell · on Aug 20, 2015

With proper tracking it is not difficult to see different versions of packages installed. Checking run books/OPs logs would also lead to a quick understanding that Staging was updated and Production wasn't. Either of those two things would make it relatively easy to find the discrepancy in versions. Then, as they noted, a review of changelogs would have led to the patch they found.

If you don't have processes and documentation to quickly point out those kinds of differences between environments, you're doing something wrong.

dap · on Aug 20, 2015

It's not that you can't find _some_ differences that way. As others have you pointed out, there should be little difference between the software deployed in staging and production. But there are can be differences in performance and hardware configuration (unless you can afford to mirror your production deployment exactly). Most unavoidably, there are problems you only hit after cumulative amounts of uptime or load[1], which is nearly impossible to simulate in a preproduction environment.

[1] For an example, see our recent outage related having seen 200M PostgreSQL transaction: https://www.joyent.com/blog/manta-postmortem-7-27-2015

edwhitesell · on Aug 21, 2015

Fair enough. I guess my point should have been: With proper tracking of changes and work done between the environments, finding differing versions could be the lowest hanging fruit.

That was my response to your original question of how they found got to haproxy as the root cause so quickly.

Sure, mirroring hardware and performance can help, if you can afford it (which is rare). However, you'd also probably agree it's pretty rare to find an issue caused cumulative amounts of uptime or load. Mis-matched software versions or configurations are a far more likely culprit in my experience.

Horses not zebras...

rocky1138 · on Aug 20, 2015

It might not be all that different if they're using Docker.

markbnj · on Aug 20, 2015

>> To make things worse, Net::HTTP doesn't set TCP_NODELAY on the TCP socket it opens, so it waits for acknowledgement of the first packet before sending the second. This behaviour is a consequence of Nagle's algorithm.

I think rather than a consequence of Nagle's algorithm it is the situation that the algorithm is intended to optimize when an app generates many small packets.

geofft · on Aug 20, 2015

What are the cases where you want Nagle's algorithm enabled? It seems like the primary use case is telnet and telnet-like services (like SSH), but even there, Mosh does much better at balancing performance against congestion. For non-interactive protocols, do you ever want Nagle's algorithm?

toast0 · on Aug 21, 2015

Nagle's algorithm would actually help in this case, if there were a bit more data to send, since ruby is passing data bits at a time, buffering it until you get a full packet would be nice. It's just the first bit is sent right away, and the second bit doesn't fill the packet, and there is no third bit. If you were pipelining reuests, the algorithm would be helpful.

geofft · on Aug 21, 2015

But the client code already has information about whether it's planning to write some more or not, so you could just depend on that, instead of using heuristics about the peer's behavior. Even regular stdio-style buffering plus an explicit flush would help more than Nagle's algorithm.

toast0 · on Aug 21, 2015

The client doesn't always have that information (especially if you're doing Unixy things and piping through netcat), but when it knows, it should certainly indicate that, which is why there's TCP_NODELAY and friends; I couldn't find out exactly when TCP_NODELAY showed up, but old BSD man pages [1] have the option and a date in 1986. In the absence of explicit information from the client, I think Nagle's algorithm is a decent heuristic.

[1] https://www.freebsd.org/cgi/man.cgi?query=tcp&apropos=0&sekt...

throwaway64908 · on Aug 20, 2015

"We use Ruby."

Ono-Sendai · on Aug 20, 2015

That was their first problem.

aikah · on Aug 20, 2015

> That was their first problem.

Where do you think you are, on reddit?