There's a difference between permanent staging environments that need maintenance and disposable "staging" environments that are literally a clone of what's on your laptop that you trash once UAT/smoke is done.
The former costs money and can lie to you; the latter is literally prod, but smaller.
This makes it sound so easy, but in my experience, permanent staging environments exist because setting up disposable staging environments is too complex.
How do you deal with setting up complex infrastructure for your disposable staging environment when your system is more complex than a monolithic backend, some frontend and a (small) database? If your system consists of multiple components with complex interactions, and you can only meaningfully test features if there is enough data in the staging database and it's _the right_ data, then setting up disposable staging environments is not that easy.
Sibling here but I can talk a bit about how we do it.
Through infrastructure as code. We do not have a monolithic backend. We have a bunch of services, some smaller, some bigger. Yes there's "some frontend" but it's not just one frontend. We have multiple different "frontend services" serving different parts of it. As for database, we use multiple different database technologies, depending on the service. Some service uses only one of those, while others use a mix that is suited best to a particular use case. For one of those we use sharding and while a staging or dev environment doesn't need the sharding, these obviously use the only shard we create in dev/staging but the same mechanism for shard lookup are used. For data it depends. We have a data generator that can be loaded with different scenarios, either generator parameters or full fledged "db backup style" definitions that you can use but don't have to. We deploy to Prod multiple times per day (basically relatively shortly after something hits the main branch).
Through the exact same means we could also re-create prod at any time and in fact DR exercises are held for that regularly.
Absolutely. The answer is better integration boundaries but then you’re paying the abstraction cost which might be higher.
It’s particularly difficult when the system under test includes an application that isn’t designed to be set up ephemerally such as application-level managed services with only ClickOps configuration, proprietary systems where such a request is atypical and prevented by egregious licensing costs, or those that contain a physical component (e.g. a POS with physical peripherals).
it's actually "pretty easy" to do when you start from first principles.
I usually ask "can I build your code on my laptop? is this the same as what's in prod?" usually the answer is no, so I work to turn that into a yes.
often times, I find that much of the complexity that you speak of is due to shared services that few have invested time into running locally precisely because of long-lived dev/staging envs, like access to data (databases, filesystems, secrets managers, etc) or tight dependencies (config services, databases, and other APIs come to mind).
example. i once worked with a team where we tried to get their app running locally in docker. (they used pcf back when it was called that; it's called tas now.) their app needed to use a dev instance of a db when it was not in a prod env. we asked if we could get a mocked schema. they said yes, but it would take three days.
it took three days because another team would manually produce the dataset from querying prod and modifying values. since they loaded it into the dev/staging environments, teams just used that. leadership also had no way of knowing whether devs were using data with real values on their workstations (because lack of automation and auditing), so politics were involved in producing a local schema that we could load into Postgres on Compose. (this was a financial company, so any environment with PII is fair game for auditors, which costs time and money.)
we landed up reverse-engineering the tables they needed so we could produce fake data good enough for integration to pass, but of course that introduces environment stratification of another kind since this team didn't own the data.
honestly, now that i wrote this, if every CTO forced their teams to make their core applications 12-factor, then staging environments would go away naturally while improving code quality and platform safety.
If at all possible your entire infrastructure should be defined as code.
At my workplace we use aws cdk for infrastructure and standing up a new environment is as easy as calling ‘cdk deploy’ and then we have a script which runs after the provision to copy in data.
Yeah, it sounds to me like OP had the former, which they've dropped, and haven't yet found a need for the latter.
I work for a tiny company that, when I joined, had a "pet" prod server and a "pet" staging server. The config between them varied in subtle but significant ways, since both had been running for 5 years.
I helped make the transition the article described and it was huge for our productivity. We went from releasing once a quarter to releasing multiple times a week. We used to plan on fixing bugs for weeks after a release, now they're rare.
We've since added staging back as a disposable system, but I understand where the author is coming from. "Pet" staging servers are nightmarish.
The former costs money and can lie to you; the latter is literally prod, but smaller.