Here's a trick to confidence in a BEAM system. If you get good at hot loading, you significantly reduce the cost of deployment, and you don't need as much pre-push confidence. You can do things like "I think this works, and if it crashes, I'll revert or fix forward right away" that just aren't a good fit for a more common deployment pattern where you build the software, then build a container, then start new instances, then move traffic, etc.
Of course, there are some changes that you need confidence in before you push, but for lots of things, a bit crashy as an intermediate step is acceptable.
As for understanding the OTP stuff, I think you have to be willing to look at their code. Most of it fits into the 'as simple as possible' mold, although there's some places where the use case is complex and it shows in the code, or performance needs trumped simplicity.
There's also a lot of implicitness for interaction between processes. That takes a bit of getting used to, but I try to just mentally model each process in isolation: what does it do when it receives a message, does that make sense, does it need to change; and not worry about the sender at that time. Typically, when every process is individually correct, the whole system is correct; of course, if that always worked, distributed systems would be very boring and they're not.
Erlang's hot reload is a two-edged blade. (Yes yes, everything is a tradeoff but this is on another level.)
Because it's possible to do hot code reloading, and since you can attach a REPL session into a running BEAM process, running 24/7 production Erlang systems - rather counterintuitively - can encourage somewhat questionable practices. It's too easy to hot-patch a live system during firefighting and then forget to retrofit the fix to the source repo. I _know_ that one of the outages in the previous job was caused by missing retrofit patch, post deployment.
The running joke is that there have been some Ericsson switches that could not be power cycled because their only correct state was the one running the network, after dozens of live hot patches over time had accumulated that had not been correctly committed to the repository.
You certainly can forget to push fixes to the source repo. But if you do that enough times, it's not hard to build tools to help you detect it. You can get enough information out of loaded modules to figure out if they match what's supposed to be there.
I had thought there was a way to get the currently loaded object code for a module, but code:get_object_code/1 looks like it pulls from the filesystem. I would think in the situation where you a) don't know what's running, and b) have the OTP team on staff, you could most likely write a new module to at least dump the object code (or something similar), and then spend some time turning that back into source code. But it makes a nice story.
That's part of it yeah. But, at least in my experience, that tells me you pushed code (to disk) and didn't load it. You could probably just notify at 4 am every day if erlang:modified_modules() /= []; assuming you don't typically do operations overnight. No big deal if you're doing emergency fixes at 4 am, you'll get an extra notification, but you're probably knee deep in notifications, what's one more per node?
But, that's not enough to tell you that the code on disk doesn't match what it's supposed to be. You'd need to have some infrastructure that keeps track of that too. But if you package your code, your package system probably has a check, which you can probably also run at 4 am.
Of course, there are some changes that you need confidence in before you push, but for lots of things, a bit crashy as an intermediate step is acceptable.
As for understanding the OTP stuff, I think you have to be willing to look at their code. Most of it fits into the 'as simple as possible' mold, although there's some places where the use case is complex and it shows in the code, or performance needs trumped simplicity.
There's also a lot of implicitness for interaction between processes. That takes a bit of getting used to, but I try to just mentally model each process in isolation: what does it do when it receives a message, does that make sense, does it need to change; and not worry about the sender at that time. Typically, when every process is individually correct, the whole system is correct; of course, if that always worked, distributed systems would be very boring and they're not.