Yeah, I am always going to interpret that sort of thing as performative. There seems to be whole corporate mythology that is absolutely sure there are a bunch of cheap, low-effort things managers can do to raise morale and get more productivity out of employees, like office birthday parties. I propose a name for adherents to this philosophy: the Pizza Party Cult.
Yes. I have a family member that has had many hospital stays over the last few years, and one of the most obnoxious things is that the staff just lets everything beep. The last time we were in the emergency room the blood pressure monitor did not work and the staff didn't notice for over an hour. Even when it does work, they're constantly in an alarm state because patient has chronic high blood pressure. They either can't or won't silence the alarms, so every room is beeping, the nurse's station is beeping, their phones are beeping, and it's all being ignored. It's the very definition of alert fatigue.
In the regional hospital near me, they've begun actively fighting for fewer alarms. In part because they annoy everyone: patients, visitors, and hospital staff alike. But mostly because the inevitable alarm fatigue that the cacophony results in actively endangers patient safety.
The policy of this hospital is that all alarms, beeping, etc. should be disabled except in limited circumstances. Particularly at night.
I've noticed that for some reason developers really like using logs in place of actual metrics. We use Datadog, and multiple times now I have seen devs add additional logging to an application just so they can then create a monitor that counts those log events. I think it's a path of least resistance thing; emitting logs is very easy, and counting them is also very easy. Reporting actual metrics isn't really difficult either, but unless you're already familiar with the system it's more effort to determine how to do it than just emitting a log line, so yeah.
Because when the application is breaking it's good to know why! Logs can be just as ephemeral as metrics -- in many cases, even more so. They're not even mutually exclusive.
Where exactly does this anti-logs sentiment come from? Is it because tools like datadog can be lackluster for reading logs across bunches of hosts?
For me, I don't use Datadog so it's not that $ParticularTool does not work with logs, it's all stuff I put in my original post, it's a ton of samples, filtering puts heavy strain on the systems and it's extremely brittle IME.
If you have good metrics, you can generally get much further not even logging aggregating outside tossing everything into STDOUT and checking on it when you have alerts.
My experience is that metrics may tell you something is wrong, but logs are required to tell you what went wrong and why.
A simple fixed-length rolling buffer can get you pretty far for logging, but it isn't something you necessarily want to get off-device except when something bad has happened.
I like most of these patterns and have used them all before, but a word of caution using UUID primary keys: performance will suffer for large tables unless you take extra care. Truly random values such as UUIDv4 result in very inefficient indexing, because values don't "cluster." For databases, the best solution is to use a combination of a timestamp and a random value, and there are multiple implementations of UUID-like formats that do this. The one I'm familiar with is ULID. It's become a common enough pattern that UUIDv7 was created, which does exactly this. I don't know if it's possible to generate UUIDv7 in Postgres yet.
Yeah - in another place this was shared a few people mentioned that. It is slated to be part of postgres 18, but as of right now to get the "good UUIDs" you need to homeroll something.
Thats the only reason I didn't mention it, seemed a bit of a rabbit hole.
Maybe some people don't remember anymore, but there was a time when Ruby was HN's favorite language. I miss those days. I kind of get why everybody leaned into Python instead, but I'm never going to be happy about it.
> The Fix: Use a full modern programming language, with its existing testing frameworks and tooling.
I was reading the article and thinking myself "a lot of this is fixed if the pipeline is just a Python script." And really, if I was to start building a new CI/CD tool today the "user facing" portion would be a Python library that contains helper functions for interfacing with with the larger CI/CD system. Not because I like Python (I'd rather Ruby) but because it is ubiquitous and completely sufficient for describing a CI/CD pipeline.
I'm firmly of the opinion that once we start implementing "the power of real code: loops, conditionals, runtime logic, standard libraries, and more" in YAML then YAML was the wrong choice. I absolutely despise Ansible for the same reason and wish I could still write Chef cookbooks.
I don't think I agree.
I've now seen the 'language' approach in jenkins and the static yaml file approach in gitlab and drone.
A lot of value is to be gained if the whole script can be analysed statically, before execution. E.g. UI Elements can be there and the whole pipeline is visible, before even starting it.
It also serves as a natural sandbox for the "setup" part so we can always know that in a finite (and short) timeline, the script is interpreted and no weird stuff can ever happen.
Of course, there are ways to combine it (e.g. gitlab can generate and then trigger downstream pipelines from within the running CI, but the default is the script. It also has the side effect that pipeline setup can't ever do stuff that cannot be debugged (because it's running _before_ the pipeline)
But I concede that this is not that clear-cut. Both have advantages.
If you manage to avoid scope creep then sure, static YAML has advantages. But that's not usually what happens, is it? The minute you allow users to execute an outside program -- which is strictly necessary for a CI/CD system -- you've already lost. But even if we ignore that, the number of features always grows over time: you add variables so certain elements can be re-used, then you add loops and conditionals because some things need to happen multiple times, and then you add the ability to do math, string manipulation is always useful, and so on. Before you know it you're trying to solve the halting problem because your "declarative markup" is a poorly specified turing-complete language that just happens to use a YAML parser as a tokenizer. This bespoke language will be strictly worse than Python in every way.
My argument is that we should acknowledge that any CI/CD system intended for wide usage will eventually arrive here, and it's better that we go into that intentionally rather than accidentally.
True. On the other hand, if you control the clients and can guarantee their behavior then DNS load balancing is highly effective. A place I used to work had internal DNS servers with hundreds of millions of records with 60 second TTLs for a bespoke internal routing system that connected incoming connections from customers with the correct resources inside our network. It was actually excellent. Changing routing was as simple as doing a DDNS update, and with NOTIFY to push changes to all child servers the average delay was less than 60 seconds for full effect. This made it easy to write more complicated tools, and I wrote a control panel that could take components from a single server to a whole data center out of service at the click of a button.
There were definitely some warts in that system but as those sorts of systems go it was fast, easy to introspect, and relatively bulletproof.
I've just spent the last month learning exactly why I definitely do want a TCP over TCP VPN. The short answer is almost every cloud vendor assumes you're doing TCP, and they've taken the "unreliable" part of UDP to heart. It is practically impossible run any modern VPN on most cloud providers anymore.
Over the last month, I've been attempting to set up a fast Wireguard VPN tunnel between AWS and OVH. AWS killed all internet access on the instance with zero warning and sent us an email indicating that they suspected the instance was compromised and being used as part of a DDOS attack. OVH randomly performs "DDOS mitigation" anytime the tunnel is under any load. In both cases we were able to talk to someone and have the issue addressed, but I wanna stress: this is one stream between two IPs -- there's nothing that makes this anything close to looking like a DDOS. Even after getting everything properly blessed, OVH drops all UDP traffic over 1 Gbps. It took them a month of back-and-forth troubleshooting to tell us this.
The really terrible part is "TCP over TCP is bad" is now so prevalent there's basically no good VPN options for it if you need it. Wireguard won't do it directly, but there's hacks involving udp2raw. I tried it, and wasn't able to achieve more than 100 Mbps. OpenVPN can do it, but is single-threaded and won't reasonably do more than 1 Gbps without hardware acceleration, which didn't appear to work on EC2 instances. strongSwan cannot be configured to do unencapsulated ESP anymore -- they removed the option -- so it's UDP encapsulated only. Their reasoning is UDP is necessary for NAT traversal, and of course everybody needs that. It's also thread-per-SA so also not fast. The only solution I've found than can do something not UDP is Libreswan, which can still do unencapsulated ESP (IP Protocol 50) if you ask nicely. It's also thread-per-SA, but I've managed to wring 2 - 3 Gbps out of a single core after tinkering with the configuration.
For the love of all that's good in the world, just add performant TCP support to Wireguard. I do not care about what happens in non-optimal conditions.
The whole point of this article is that performant Wireguard-over-TCP support in Wireguard simply does not work. You're not fighting the prevalence of an idea, you're fighting an inherent behavior of the system as currently constituted.
In more detail, let's imagine we make a Wireguard-over-TCP tunnel. The "outer" TCP connection carrying the Wireguard tunnel is, well, a TCP connection. So Wireguard can't stop the connection from retransmitting. Likewise, any "inner" TCP connections routed through the Wireguard tunnel are plain-vanilla TCP connections; Wireguard cannot stop them from retransmitting, either. The retransmit-in-retransmit behavior is precisely the issue.
So, what could we possibly do about this? Well, Wireguard certainly cannot modify the inner TCP connections (because then it wouldn't be providing a tunnel).
Could it work with a modified outer TCP connection? Maybe---perhaps Wireguard could implement a user-space "TCP" stack that sends syntactically valid TCP segments but never retransmits, then run that on both ends of the connection. In essence, UDP masquerading as TCP. But there's no guarantee that this faux-TCP connection wouldn't break in weird ways because the network (especially, as you've discovered, any cloud provider's network!) isn't just a dumb pipe: middleboxes, for example, expect TCP to behave like TCP.
Good news (and oops), it looks like I've just accidentally described phantun (and maybe other solutions): https://github.com/dndx/phantun I'd be curious if this manages to sidestep the issues you're seeing with AWS and OVH.
> The retransmit-in-retransmit behavior is precisely the issue.
But you're concerned about an issue I do not have. In practice retransmits are rare between my endpoints, and if they did occur poor performance is acceptable for some period of time. I just need it to me fast most of the time. To reiterate: I do not care about what happens in non-optimal conditions.
> In practice retransmits are rare between my endpoints
You seem to be mistaken about how (most) TCP implementations work. They regularly trigger packet loss and retransmissions as part of their mechanism to determine the optimal transmission rate over an entire path (made up of potentially multiple point-to-point connections with dynamically varying capacity).
That mechanism breaks down horribly when using TCP-over-TCP.
Maybe, but packet loss isn't the only problem. You'll also want to preserve latency (TCP has a pretty sophisticated latency estimation mechanism), for example.
Some middleboxes will also do terrible things to your TCP streams (restrictive firewalls only allowing TCP are good candidates for that), and then all bets are off.
If you're really required to use TCP, the "fake TCP" approach that others in sibling threads have mentioned seems more promising (but again, beware of middleboxes).
But, my connection speed is usually greater and my loss is much less to my VPN endpoint than to whatever services I am accessing though that endpoint. As a result it doesn't affect things much. Further, accessing it with UDP is not always possible.
Unless it's actually zero, any loss on the "outer" TCP stream will cause a retransmission, visible to the inner one as a sharp jump in latency of all data following the loss. Most TCP stacks don't handle that very well either.
But IP over TCP is in principle non-performant. There's no (non-evil) magic Wireguard could perform to get around that.
Adding TCP support to Wireguard would add a whole bunch of complexity that it doesn't need – for a very niche use case (i.e. where you absolutely have to get an IP VPN to work over a restrictive firewall).
> Wireguard won't do it directly, but there's hacks involving udp2raw.
Which significantly does not do UDP over TCP in the problematic sense (it just masquerades UDP as TCP, without providing a second set of TCP control loops on top of the first one).
> AWS killed all internet access on the instance with zero warning and sent us an email indicating that they suspected the instance was compromised and being used as part of a DDOS attack.
It makes no sense for that to be due to Wireguard usage, though (not saying I don't believe you that it happened, just their explanation or your assumption of their motivation seems strange). Things like Tailscale use Wireguard and should be common enough for AWS to know about them by now, I'd assume?
It is very difficult to misconfigure Wireguard -- there's just not that much to tune aside from MTU. We've had a 1 Gbps tunnel between AWS and OVH for years and it worked mostly fine, except for the handful of times OVH's DDOS mitigation kicked in and killed the tunnel. The issue is when you start wanting to go beyond 1 Gbps.
I think AWS will do 5 Gbps with a capable peer -- which is their limit for a single flow [1] -- but you might need to tell them first so they don't kill public networking on the instance though. I found that UDP iperf tests reliably got my instance's internet shut off, so keep that in mind. On the other hand, OVH will happily do 5-ish Gbps to/from my EC2 instance in a TCP iperf test, but won't tolerate more than 1 Gbps of inbound UDP. OVH support has indicated that this is expected, though they do not document that limitation and it seemed that both their support and network engineering people were themselves unaware of that limit until we complained. They don't seem to have the same limits on ESP, which is why I developed an interest in ipsec arcana.
Worst case, can't you run a minimal turn server and have TCP over Wireshark/UDP over turn/tcp?
For a site to site VPN, something where you use transparent proxying at the routers to turn TCP into TCP over SOCKS (over TLS) might work. TCP proxying with 1:1 sockets avoids most of the issues with TCP over TCP, at the expense of needing to keep socket buffers at the proxy hosts.
We've run Wireguard tunnels that max out at 1 Gbps in AWS for years with no issues (on the AWS side, anyways). It seems like things get hairy once you want to do more than that.
I did not. I'm not terribly familiar with it, but it doesn't look like I can do general routing with it, right? My end goal is to route between two subnets.
Nope, shadowsocks is just plain TCP-in-TCP (not TCP-over-TCP) proxy. If you cannot have performant routing between clouds due to UDP QoS, then the only sensible solution would be to setup proxy nodes on both sides and transparently redirect TCP (if that's all you need) traffic through the proxy.
> strongSwan cannot be configured to do unencapsulated ESP anymore -- they removed the option
wait, what? Pretty sure I still used unencapsulated ESP a few months ago… though I wouldn't necessarily notice if it negotiates UDP after some update I guess… starts looking at things
Edit: strongswan 6.0 Beta documentation still lists "<conn>.encap default: no" as config option — this wouldn't make any sense if UDP encapsulation was always on now. Are you sure about this?
Sorry, I misremembered the issue. Looking at my notes the issue is they don't allow disabling their NAT-T implementation, which detects NAT scenarios and automatically forces encapsulation on port 4500/udp. The issue is that every public IP on an EC2 instance is a 1:1 NAT IP. Every packet sent to the public IP is forwarded to the private IP -- including ESP -- but it is technically NAT and looks like NAT to strongSwan.
There's an issue open for years; it will probably never be fixed:
The documentation is atrocious, and usually won't say things like "label your program unconfined_t" because they don't want you to do that ever. Also, tutorials -- even RedHat's -- are always some variation of "here's how to use audit2allow." That is very much not what I want. I want to create a reusable policy that I can apply to many hosts via Ansible or as part of an RPM package I created. I've never been able to figure out how to do that because it is always drowned out by SEO spam that barely scratches the surface of practical usage.
It's painfully obvious to me that the people who create SELinux and its documentation live in some alternate universe where they don't do anything the way I do, so I just turn it off.
Not excusing that state of documentation by any means, but a good starting point for understanding the actual policy for me was "SELinux System Administration" (ISBN 978-1-80020-147-7).
It won't carry you all the way to applying policies via Ansible or RPM packages, but definitely took me from running random audit2allow commands to taking a more holistic view of my SELinux policies.
It also looks like a long read but if you fast-forward through chapters that aren't relevant to you (looking at you IPSEC) it isn't such a slog.
I've been considering switching from Fastmail to Proton for Mail/Calendar/Contacts, but I didn't realize their bridge didn't do CalDAV or CardDAV. Also, apparently the bridge is desktop-only -- no mobile? That's kind of a deal breaker.
reply