Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Infrastructure as Code Should Feel (scalefactory.com)
79 points by tutunak on April 4, 2022 | hide | past | favorite | 50 comments


Some folks seem to be taken back by the word "feel" in the title, but it rings true to me.

I work at a place with lots of super smart people. Almost everything is automated, but a lot of that automation breaks and you have to constantly get the creator involved to turn that "one little knob" to get it working for your job. That's fine for the creator because he/she knows exactly what to do when the problem occurs. However, if you're an internal customer trying to use that system, its infuriating.

If you're going to create IaC tooling for a team of 50 people, it needs to work without contacting the author. If not, you've just taken a 5 minute job and turned it into a 2 hour search for the bug event.


I had this exact issue at a last company. We had a system that tore down and rebuilt our entire clusters to updates QPS estimates. It was the right idea and the right approach, but it was a big gnarly ball of code that broke frequently, requiring the steps it got stuck on to be done by hand and the code fixed for next time. So anyone working on it still had to know and be co dude that’s executing any arbitrary manual steps as well as understanding the code enough to fix it (or get the expert involved). I’m doubtful it was a net positive in that state.


I am an IaC advocate. I see articles like this as a barrier to meaningful IaC adoption.

> In the tech industry, we can be guilty of the same crime sometimes. Certain practices get ingrained into the profession to the point where we forget exactly why we did it in the first place.

But in school he hadn't known why he did it in the first plave.

And more often in tech I see cargo cult practices that people don't understand but follow some steps that may or may not "work." Often these are misunderstood "best practices."

He goes on to say:

> Infrastructure as Code (IaC) is a practice I really feel should be implemented everywhere it is relevant. It is good common practice, the popularity of which continues to grow.

Which is exactly how cargo cult practices spread. What do they do? How do the do it? I dunno but everyone seems to be doing something and I think it must be like FOO.

It's also incredibly dangerous to say we should follow a practice because of how it "feels." Does it "feel stable" or "feel understandable" or "feel fast?" Who cares. Show me the value, don't try to give me the tingles!


To be fair, the author does provide arguments for why Infrastructure as Code should be used, in the next few paragraphs.


> It's also incredibly dangerous to say we should follow a practice because of how it "feels." Does it "feel stable" or "feel understandable" or "feel fast?" Who cares. Show me the value, don't try to give me the tingles!

I disagree. First of all, how something feels says something about people's intuition which is a decent indicator of reality. More importantly though, how something feels affects how (or even if) people will use it. A system which 'feels' wrong will be under-used and generally worked around. This makes it a bad system in the real world, regardless of 'how great it would be if everyone used it'. It's similar with vaccines. The fact that to many people a vaccine 'felt wrong' made the vaccine actually less effective at blocking the spread of corona.

The value of something depends very much on 'how it feels'.


The real world don't care about feelings.Feelings are extremely subjective making these kind of statements quite vague.

As a person with alexithymia, this is so confusing, I have no idea what "feels good" is supposed to mean. I know it's not bad, but that's a big spectrum of meaning. Good doesn't mean great or brilliant either. Just confusing.

Emotional blindness: https://en.wikipedia.org/wiki/Alexithymia


Physical reality is barely changed by peoples' emotions (small shifts in their biochemistry).

Physical reality is very often changed significantly by the choices people make, which are rooted in their emotions.

So, while the universe as such may not care about emotions, they still matter as long as you're interacting with humans (including oneself).

FWIW, I'd not heard of alexithymia before, and I now am thinking I may have had it most of my life.

I've gotten much better at recognizing emotions in myself and others over the past two years through a combination of prayer, therapy, introspection (a skeptic might argue my prayer was just an obscured form of introspection), and reading (Non-violent Communication by Marshall Rosenberg helped me immensely).

I've seen the world become much more comprehensible as this has changed - other humans are a big part of the world, given how dependent we all are on civilization for survival and thriving, and as hinted above, I now understand my own actions and choices much better too. Not being able to recognize my emotions didn't mean I had none.


FWIW, "feel" and "good" as used in this article is not about emotions; it's more of a shortcut that means "Do I trust that this is doing what I expect it to regarding (safety | stability | understandability)?"

"Feel": "I trust."

"Good": "Does what I expect it to."


"Trust" is (or at least can be) an emotion.


Trust is the faith that the subject won’t betray you or your interests despite being in a position to do so.

It’s a belief, not an emotion.


Hmm, yeah, good point. I think you're right.

I guess the emotions associated with trust might be "peace" or "confidence".

Thanks for correcting my error!


I don’t really think IaC is there yet, as someone who has used terraform (and now pulumi) in production for some time.

My biggest gripe is that the feedback loop is fairly slow: planning often predicts valid config that fails in deployment. It doesn’t help that the process of planning is really slow (see https://github.com/pulumi/pulumi/issues/8872 it can take minutes for pulumi to plan changes if using azblob backend) with little incentive for teams like pulumi to investigate or fix unless you use their cloud. I get that pulumi is in the business of promoting their own cloud, but there are few (if any im aware of) IaC solutions not intrinsically tied to for-profit SaaS because it’s incredibly labor intensive to build consistent interfaces to several constantly changing cloud providers.


out of curiosity, how did you like pulumi compared to terraform? I'm starting to hear about it more often


As someone starting with terraform literally this weekend, why did you move away?


Terraform as a language is a bit clunky and feels not quite complete. For example, to accomplish if/else logic, you have to hack something like: `count = var.is_foo ? 1 : 0`

More than the language syntax, getting any response from the hashicorp team on their official providers is like rolling the dice. If your bug or feature pr is a bit on the fringes and not in their sights, it can sit for months/years without traction or even a non-automated response. With just the aws provider, there are over 3000 issues and nearly 400 open PRs. That team is understaffed or mismanaged. It has gotten marginally better over the last year or so, but generally speaking, Hashicorp doesn’t seem to care while they’re pouring all efforts into their own cloud. 1. https://github.com/hashicorp/terraform-provider-aws/issues/1... 2. https://github.com/hashicorp/terraform-provider-aws/issues/6... 3. https://github.com/hashicorp/terraform-provider-aws/issues?q...


Terraform is amazing and is also very thin. It’s amazing to be able to commit your infrastructure to source control, but it’s nothing without effective well-maintained APIs that can be driven by Terraform providers.

I’ve built custom Terraform providers to map to custom APIs, and the experience was just fantastic. I’ve also used scripts with AWS resources that were a chore to use because the underlying API was a trash fire.

Specifically with respect to Terraform the language, the uniform adoption of functional programming concepts would be nice— right now providers expose varying ways of dealing with sets of resources in more or less confusing and incompatible ways (count, and the various times that you cannot use count).

So the problems with TF are twofold: the language is not quite consistent, and the ecosystem depends entirely on the quality of contributed providers and/or the module shims built on top of poor providers.

It can be easy to get frustrated during IaC development, but an absolute relief when you can rock deployments confidently.


Asking my engineers to learn a completely new DSL (HCL) with its own quirks to change infrastructure felt like it ran against the spirit of IaC. If they were tweaking infrastructure every day, that’d be one thing, but it’s fairly hard to commit a language to your long term memory when you’re interacting with it only on occasion.


You can do Terraform in Java. Or Python. Or... https://www.terraform.io/cdktf


CDK is reasonably new thing for terraform.

A lot of us have cut our teeth on previous major version of terraform DSL, when it was even less capable than it is now (no maps, no for each, etc.) and it wasn't that uncommon to be told to generate terraform JSON directly to work around DSL shortcomings, generating essentially all of the provider objects yourself.


If you have common changes like adding a new queue or storage bucket, you can pull the details out into yaml or json and have Terraform loop over them. As a (potentially) added bonus, you can have different Terraform projects point to the same yaml--used to do this for maintaining subnets


I think that terraform etc. are largely the wrong way of doing it. I think NixOS (or just the Nix package manager) is the right way of doing it but the Nix ecosystem has it's own separate set of issues. Mostly these are around (lack of) documentation making it impenetrable to get on board with.

On the other hand, my feeling is that as your problems become harder and harder (your infrastructure gets more complex) Nix is exactly the same amount of hard as it was, but the other tools begin to struggle more and more. Sometimes that's from a performance point of view (the plan speed issue you're talking about), sometimes it's because edge cases of unexpected results begin to creep in.


Nix is wonderful, and very difficult.

But, if you're talking about IaC, you need to more precisely talk about NixOps and how that compares to tools like Terraform or Pulumi. -- Overall, Nix isn't really aimed at solving IaC for cloud resources.


I use nix on my personal machines, and I think I agree, but the nix tooling has a lot to be desired. A well-implemented language server would be enough to drive things forward: I’m constantly unsure as to what types various variables are, or what kind of flags or options I can set in derivations.


Why is the “how” cut off at the beginning of the article title. Makes it really confusing without it.


Removing "How" is one of those automatic edits to titles made by HN, and I think it's a bad idea, as it often changes the meaning. Fortunately, it's possible to for the submitter to click "edit" and change it back to the original.


HN does this automatically when you submit a title starting with “How”. I guess it reduces the clickbaityness of certain types of titles, but fails in cases like these.

If it happens and you don’t like it you can edit the submission to add the word back.


I've been doing IaC in AWS since terraform 0.11 (now added Azure and AWS CDK to the mix).

Most cloud providers (and don't get me started with software platforms) support IaC as a complete afterthought. The day you decide to use IaC, you will have chosen to spend your days fighting against their APIs to make things nice and immutable.

Moreover, most languages and tooling are rather immature. Things like secrets, testing, CI/CD, multi repo infra, collaboration, describing stuff that changes itself (e.g. databases that update automatically) are not well understood, solved problems.

Consequently I'm quite diligent when writing my infra as code nowadays. I reserve it for things that really play nice with immutability and are not likely to change all the time. Also, as the author says, not coupling infra code with apllication code is gold advice.


Any pointers to what you use instead for managing those kinds of corner cases like things that modify themselves?


One solution is adding the attributes you expect to be modified to ignored_changes in a lifecycle block[0], which instructs terraform to disregard changes to those attributes when calculating diffs for plan/apply. That allows you to specify a value for that attribute at creation time and allow for its subsequent manual/non-terraform-managed modification without causing "drift".

[0]: https://www.terraform.io/language/meta-arguments/lifecycle


Okayish article. It's more a marketing one than technical.


The "safe" section is directed primarily at the development environment but once you start collaborating with something like Terraform, there are a whole bunch of ways you can forget to plan or apply a change (for example applying in a PR but forgetting to merge or closing it because you don't want to go forward with the change but forgot you applied it or a part of it). Shameless plug but my co-founder and I started Terrateam to try to make it really easy to get going with Terraform on GitHub and stay safe doing it. https://www.terrateam.io/blog/posts/safety-first/


You shouldn't be able to apply in an MR tho, unless is some playground account. Developers shouldn't have enough permissions to modify production infra from a MR.


Organizations have different rules around when to apply a change. Some prefer to do it pre-merge, that way if something goes wrong they can modify it and apply again, and others prefer to do it post-merge. We are not opinionated own when a customer does it, we just have rules to make sure a change is both merged and applied. Additionally, who can perform an apply is a different, but related, question to when.

The bet behind Terrateam is that a lot of developers and SREs don't want to leave their MR page when applying and planning so we are supporting as much functionality as possible there.


On the other hand I can't think in automating something like this:

https://news.ycombinator.com/item?id=30371604


IaC is unfortunately the wrong abstraction for most use cases. People have become religious about it and it’s slowing down development.

I’ve been doing IaC since terraform and k8s came around. We initially all thought it was great, now I think we’re covering up these wonderful UIs for a much worse interface that takes significantly longer, and in many cases gives us very little.

What we will hopefully converge on is UIs that basically do the same things as IaC.


Agreed. IaC has become another gated bottleneck; rather than waiting on an admin to change a server, we wait on an admin to write and test the IaC to change a server.

One reason it's this way is poor design of cloud tech. Cloud tech isn't immutable, idempotent, and versioned, most of the time. Often there's arbitrary limitations and conventions that are not obvious which cause changes to break unexpectedly ("we don't know the result until apply" seems to happen every single apply, and success in dev is no guarantee of success in prod). And there are very few turn-key solutions, so we're left to keep rebuilding the same things and running into the same problems.

We need people creating SaaS/PaaS/IaaS to create systems that give us what we really want, without having to hire somebody to write really crap fake code to work around what those systems lack. We need GUIs that allow anyone to safely make changes without being an expert. We need turn-key solutions. We need advanced best practice deployment methods to be the first option, not implemented 5 years down the road. We need security to be easy. We need multitenancy to be the default. We need simple billing limits.

..... Ok a lot of that has nothing to do with IaC. But it's clear to me that a lot of what we do do today is an unfortunate side effect of how these systems were designed, and what they still lack. Somebody built a railroad, and we're still trying to force the railroads work like real roads.


> we wait on an admin to write and test the IaC to change a server

This is more about company culture, not IAC. I write IAC, not a sysadmin. They don't even test it, that's what our test environment is for. There's a few things they don't let us touch, and a few enforced best practices but those are handled in code, and I can incorporate those checks as part of my usual unit test cycle.

> And there are very few turn-key solutions, so we're left to keep rebuilding the same things and running into the same problems.

That's what IAC can do for us. IAC is, at it's heart, providing a desired state, which Terraform (ansible, etc) checks to see if the environment state matches what I requested, and if the answer is no, changes it to match the provided state.

If I was writing code to do this myself, it would do the exact same thing.

> We need GUIs that allow anyone to safely make changes without being an expert.

Amazon has a GUI that, for most operations, is perfectly acceptable. And I'd still rather use IAC than make changes via a GUI in production. And the reason is simple: I'll make mistakes. I'll forget or fat-finger one of dozens of required options for setting up a secure S3 bucket.

If everyone uses IAC against production, then it will work most of the time (the exceptions that I've run up against are very rare).


You can make mistakes in IAC just like in the GUI. If the difference is having a plan stage (which, again, quite often you still don't know what the change is because it can't be known until apply), the GUI could offer a plan stage too.

One of the things Terraform should have had from day 1 was automatic import of all existing cloud resources and generation of hcl. That would allow creation of infra in a GUI and then locking it down into a version controlled declarative configuration. Terraformer (written by a Google team) does exactly this, though it's rudimentary and lacks a lot. I still use it to quickly capture the state of legacy accounts and make changes later.

Amazon should have had this for all its services by default, because it would prevent the need to manually craft code. GUIs exist to prevent people from having to slowly craft code by hand. I think people today still forget the entire purpose of GUIs because there's this cargo cult that says the only way to do anything good is to manually write lines of code.

Personally I'd be happy if I never wrote another line of code in my life, if the GUI would just version the configuration. And even better, if the state were immutable and versioned and didn't need to be constantly "fixed" by a configuration management tool (Terraform).


fancy UIs usually don't even attempt to solve the primary problem that IaC does, which is change management. Systems where changes (rather than operations) are made through interactive UIs are a horror to manage; it works for extremely simple systems, but most systems don't qualify for that.


And even for simple systems like my pet project it's just great to press enter once and have all different AWS resources deployed/updated/replaced/destroyed.


yes but a UI could give you those properties with an ability to break glass into the underlying logic


I was about to disagree with you until I read the part where you are using terraform with k8s. IaC starts breaking down once you start using k8s - especially if you are using a flavor like EKS that is both kubernetes but is also its own IaC platform.


k8s has its own built in IaC with object definitions (usually yaml or some sort of yaml templates) and controllers to apply the changes. I don't think most people are managing k8s objects directly with edit/patch--they're just applying complete yaml/object definitions to override current state.

If you want drift/auto apply you can use a CD solution like Argo


Yes, that is what I was referring to.


I feel like infrastructure as code is one of those things that highlights your companies poor practices more than it does highlight the failures in say terraform or cloudformation.

A company that is willing to work at the code level to automate their systems will find tools like terraform and cloud formation useful, companies that have a LAMP stack from 2010 and no way to test their infrastructure changes in a safe way will not...


I think more about culture that you should have "automate first" mindset.

Yes, changing 500 lines in a text file if you can write grep replace in 30 seconds is nice.

Automating deployment that you are doing once a month and is taking 20 mins to prepare and run, I don't feel it. Because developing such automation takes considerable resources especially if you cannot test it right away on prod and have to make test env and other overhead.

Maybe one should automate preparation parts of such deployment first and then see how it goes.


Automation is not only about saving time, it's also about guaranteeing consistency and quality.


I like Terraform but I've grown to not fully trust its plan output.

There are lots of cases where you can plan something successfully without errors but then when you go and apply it you'll run into errors and now your infrastructure is in a half working state where some resources applied successfully and others failed.


You’re not wrong on this.

Being able to give 100% guaranteed valid plans would require cloud providers to publish full specifications of their APIs, including (most importantly) the domain of each property relative to possible values of all the other properties of a resource.

Given that most cloud providers struggle to publish an API where even the data types for properties are correct, this does not seem realistic today.

Another way to achieve a good result would be an accurate “dry run” API from the provider, but these are also often inconsistent when they exist, and make planning painfully slow.


Yeah, it's not an easy problem to solve.

But it is worth pointing out because you can't trust plans for figuring out if something will work or not. However they are great for letting you know what's about to get CRUD'd, which in itself is very valuable to help prevent a set of issues but it doesn't guarantee victory in the end.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: