This is a handy list. > 4:07pm The package install has failed as it can't resolv...

userbinator · on March 24, 2024

Dig into machine off the hot path.

Unfortunately, no one has the time to do that (or let someone do it) after the problem is "solved", so over time the "rebuild from scratch" approach just results in a loss of actual troubleshooting skills and acquired knowledge --- the software equivalent of a "parts swapper" in the physical world.

Spivak · on March 24, 2024

Y'all don't do post-mortem investigations / action items?

I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait. I once had to deal with an outage that required we kill all our app servers every 20 minutes (staggered of course) because of a memory leak while it was being investigated.

Salgat · on March 24, 2024

Usually depends on the impact. If it's one of many instances behind a load balancer and was easily fixed with no obvious causes, then we move on. If it happens again, we have a known short-term fix and now we have a justified reason to devote man-hours to investigating and doing a post-mortem.

kqr · on March 24, 2024

> I get the desire to troubleshoot but priority 0 is make the system functional for users again, literally everything else can wait.

What numbers went into this calculation, to get such an extreme result as concluding that getting it up again is always the first priority?

When I tried to estimate the cost and benefit, I have been surprised to make the opposite conclusion multiple times. We ended up essentially in the situation of "Yeah, sure, you can reproduce the outage in production. Learn as much as you possibly can and restore service after an hour."

This is in fact the reason I prefer to keep some margin in the SLO budget -- it makes it easier to allow troubleshooting an outage in the hot path, and it frontloads some of that difficult decision.

Fatnino · on March 24, 2024

I was at a place where we had "worker" machines that would handle incoming data with fluctuating volume. If the queues got too long we would automatically spin up new worker instances and when it came time to spin down we would kill the older ones first.

You can probably see where this is going. The workers had some problem where they would bog down if left running too long. Causing the queues to back up and indirectly causing themselves to eventually be culled.

Never did figure out why they would bog down. We just ran herky jerky like this for a few years till I left. Might still be doing it for all I know.

aequitas · on March 24, 2024

> The workers had some problem where they would bog down if left running too long.

So you just automatically replace the instances after a certain amount of runtime and your problem is gone.

lucianbr · on March 24, 2024

Yeah, fixing a problem without understanding it has some disadvantages. It works sometimes, but the "with understanding" strategy works much more often.

Is this really a prevailing attitude now? Who cares what happened, as long as we can paper over it with some other maneuver/resources? For me it's both intellectually rewarding and skill-building to figure out what caused the problem in the first place.

I mean, I hear plenty of managers with this attitude. But I really expect better on a forum called hacker news.

FridgeSeal · on March 24, 2024

For me, there’s a certain threshold involved.

If it happens extremely rarely (like, once every 6 months) or it’s super transient and low impact, we kick it and move on.

If it starts happening a 3rd or 4th time, or the severity increases we start to dig in and actually fix it.

So we’re not giving up, and losing all diagnosis/bugfixing ability, just setting a threshold. There’ll always be issues, some of them will always be mystery issues, so you can’t solve everything, so you’ve got to triage appropriately.

account42 · on March 26, 2024

Plus just because the only visible symptom of the bug is a perfomance issue right now doesn't mean that there won't also be other consequences. If something is behaving contrary to expectations you should always figure out why.

aequitas · on March 24, 2024

I said nothing about not understanding the issue. Even with understanding just “turning it on and off again” might be the better solution at the moment. Because going for the “real” solution means making a trade-off somewhere else.

patrick451 · on March 24, 2024

The end state of a culture that embraces restart/reboot/clear-cache instead of real diagnoses and troubleshooting is a cohort of junior devs who just delete their git repo and reclone instead of figuring out what a detached HEAD is.

I don't really fault the junior dev who does that. They are just following the "I don't understand something, so just start over" paradigm set by seniors.

hnlmorg · on March 24, 2024

It’s not either / or.

If you have proper observability in place then you can do your diagnosis without affecting your customers.

fuzzfactor · on March 24, 2024

>diagnosis without affecting your customers.

Plus, at the same time successful diagnosis is also the kind that can have the most dramatic effect on your customers.

In a positive way.

hnlmorg · on March 24, 2024

Sure, but at risk of repeating myself: it’s not either /or. Nobody is suggesting analysis shouldn’t happen. Just that it doesn’t need to happen on a live system.

zettabomb · on March 24, 2024

Honestly, there's a certain cost-benefit analysis here. In both instances (rebooting and recloning), it's a pretty fast action with high chances of success. How much longer does it take to find the real, permanent solution? For that matter, how long does it take to even dig into the problem and familiarize yourself with its background? For a business, sometimes it's just more cost effective to accept that you don't really know what the problem is and won't figure it out in less time than it takes to cop-out. Personally, I'm all in favor of actually figuring out the issue too, I just don't believe it to be appropriate in every situation.

patrick451 · on March 24, 2024

There is a short term calculus and long term calculus. Restarting usually wins in the short term calculus. But if you double down on that strategy too much, your engineering team, and culture writ large, will lilt increasingly towards a technological mysticism.

yosefk · on March 24, 2024

To be fair, with git, specifically, it's a good idea to at least clone for backup before things like major merges. There are lots of horror stories from people losing work to git workflow issues and I'd rather be ridiculed as an idiot who is afraid of "his tools" (as if I have anything like a choice when using git) and won't learn them properly than lose work thanks to a belief that this thing behaves in a way which can actually be learned and followed safely.

A special case of this is git rebase after which you "can" access the original history in some obscure way until it's garbage-collected; or you could clone the repo before the merge and then you can access the original history straightforwardly and you decide when to garbage-collect it by deleting that repo.

theptip · on March 24, 2024

Git is a lot less scary when you understand the reflog; commit or stash your local changes and then you can rebase without fear of losing anything. (As a bonus tip, place “mybranch.bak” branches as pointers to your pre-rebase commit sha to avoid having to dig around in the reflog at all.)

I would never ridicule anyone for your approach, just gently encourage them to spend a few mins to grok the ‘git reflog’ command.

ansgri · on March 24, 2024

Then submodules enter the picture. I’m comfortable with reflog, but haven’t fully grokked submodules yet, easier to reclone.

jonathanlydall · on March 24, 2024

If you’re not super comfortable with Git, before rebasing, simply:

- Commit any pending changes.

- Make a git tag at your current head (any name is fine, even gibberish).

If anything “goes wrong” you can rollback by simply doing reset hard to the tagged commit.

Once done, delete the tag.

Making a complete “backup clone” is a complete waste of time and disk space.

ekimehtor · on March 24, 2024

Isn't the whole purpose of GIT Version Control? In other words to prevent work loss occurring from mergers and/or updates? Maybe I'm confusing GitHub with GIT? PS I want to set up a server for a couple of domain names I recently acquired, it has been many years so I'm not exactly sure if this is even practical anymore. Way back when I used to distribution based off of CENT OS called SME server, is it still common place to use a all in one distribution like that? Or is it better to just install my preferred flavour of Linux and each package separately?

jonathanlydall · on March 25, 2024

Git does source code management.

The two primary source code management activities developers use are versioning of source code (tracking changes which happened over time) and the other being synchronisation of code with other developers.

One of Git’s differentiating strengths is it being decentralised, allowing you to do many operations in isolation locally without a central server being involved. You can then synchronise your local repository with an arbitrary number of other copies of it which may be remote, but you may need to rebase or merge in order to integrate your changes with those of other developers.

Git is more like a local database (it even allows multiple local checkouts against a single common “database”) and it only occasionally “deletes” old “garbage”. Anything you do locally in Git is atomic and can always be rolled back (provided garbage collection hasn’t yet been performed).

Although I’m comfortable enough with using the reflog to rollback changes (I’m also skilled enough in git I haven’t needed to in many years), it’s not very user friendly, it’s essentially like sifting through trash, you’ll eventually be able to find what you lost (provided it wasn’t lost too long ago), but you may have to dig around a bit. Hence my suggestion of tagging first, makes it easy to find it again if needed.

I have very limited Linux experience and have no recommendations on your other question.

ekimehtor · on March 31, 2024

Thank you for the well detailed response to my question. I'm currently working on returning to the CS Field due to a devastating and career ending injury. The specific Field I'm interested in his programming the interface between hardware such as robotics and user interfaces. So much has changed over the past decade I feel like I'm having to start all over and relearn everything to do with programming! And on top of that I have to also relearn how to live as a quadriplegic! Thank goodness for the Internet and it's incredible amount of free knowledge available these days!

Salgat · on March 24, 2024

If it's happening so rarely that killing is a viable solution, then there's no reason to troubleshoot it to begin with. If it's happening often enough to warrant troubleshooting, then your concerns are addressed.

crabbone · on March 24, 2024

Here's a real-life example. We have a KVM server that has its storage on Ceph. It looks like KVM doesn't work well with Ceph, esp. when MD is involved, so, if a VM is powered off instead of an orderly shutdown, something bad is happening to MD metadata, and when the VM is turned on again, one MD replica can be missing. This happens infrequently, and I've never been in a situation when two replicas died at the same time (which would prevent a VM from booting), but it's obviously possible.

So... more generally, your idea with replacing VMs is rather naive when it comes to storage. Replacement incurs penalties, s.a. eg. RAID rebuilds. RAIDs don't have the promised resiliency during rebuild. And, in general, rebuilds are costly because they move a lot of data / wear the hardware by a lot. Worst yet, if you experience the same problem that caused you to start a rebuild in the first place during the rebuild, the whole system is a write-off.

In other words, it's a bad idea to fix problems without diagnosing them first if you want your system to be reliable. In extreme cases, this may start a domino effect, where the replacement will compound the problem, and, if running on rented hardware, may also be very financially damaging: there were stories about systems not coping with load-balancing and spawning more and more servers to try and mitigate the problem, where problem was, eg. a configuration that was copied to the newly spawned servers.

whirlwin · on March 24, 2024

That might work in some scenarios. If you're a "newer" company where each application is deployed onto individual nodes, you can do this.

But consider that the case for older companies, where it was more common to deploy several systems, often complex ones, onto the same node. You will also cause outages to system x, y and z too. Maybe some of them are inter-dependent? You have to outwhey the consequences and risks carefully in any situation before rebooting.

bschne · on March 24, 2024

> Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this.

At least as I read it, this contains the assumption that that‘s not how you deploy your applications

FridgeSeal · on March 24, 2024

> it was more common to deploy several systems, often complex ones, onto the same node.

Yeah we do this? It doesn’t pose an issue though. Cordon the node (stop any new deployment going on), drain it to remove all current workloads (these either have replicas, or can be moved to another node, if we don’t have a suitable node, K8s spins up one automatically) and then remove the node. Most workloads either have replicas spare, or in the case of “singleton” workloads, have configs ensuring the cluster must always have 1 replica available, so it’s waits for the new one to come up before killing the old. Most machines deploy and join the cluster in a couple of minutes, and most of our containers take only like, 1 or 2 seconds to deploy and start serving on a machine, so rolling a node is a really low impact process.

eversincenpm · on March 24, 2024

One could argue that most devs these days are parts swappers with all the packages floating around.

throw5323446 · on March 24, 2024

> Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one.

"4:10pm the new machine still has the same performance issue"

FridgeSeal · on March 24, 2024

Sure, but more often than not - esp in cloud scenarios, sometimes you just get a machine that is having a bad day and it’s quicker to just eject it, let the rest of the infra pick up the slack, and then debug from there. Additionally if you’ve axed a machine, and got the same issue, you know it’s not a machine issue, so either go look at your networking layer or whatever configs you’re using to boot your machines from…

tjoff · on March 24, 2024

> esp in cloud scenarios

... so the nice thing about the about the cloud is that you can workaround cloud-specific issues?

jandrese · on March 24, 2024

4:20pm Turns out it was DNS

Propelloni · on March 24, 2024

That made me laugh. Thank you. Of course, it is not DNS. DNS has become the new cabling. DNS is not especially complicated, but cabling is neither. Yet, during dot.com and subsequent years the cabling was causing a lot of the problems so that we get used to first check the cabling. But it only took a few more years to realize that it is not always cabling, actually failures are normally distributed.

Is it wrong to check DNS first? No, but please realize that DNS misconfiguration is not more common than other SNAFUS.

ninkendo · on March 24, 2024

    It’s not DNS
    There’s no way it’s DNS
    It was DNS

smackeyacky · on March 25, 2024

Certificates are the new DNS for service breakages

SerCe · on March 24, 2024

That's actually amazing, a reproducible problem is a 90% solved problem!

Jedd · on March 24, 2024

You're describing one of the benefits of virtualised cattle, not necessarily or exclusively 'cloud'.

KingOfCoders · on March 24, 2024

Kill the machine might destroy evidence. It might be the case you have everything logged outside, but most often there is something missing.

monkpit · on March 24, 2024

Take it out of the pool then.