This sandboxing for services provides similar isolation as various container run...

thu2111 · on April 27, 2020

If I understand Docker correctly, it's not actually intended to be a sandbox and wasn't designed as such (e.g. the daemon runs as root, or at least used to). It's not clear to me what the threat model for running untrusted Docker images is, or how you'd know what the expected set of permissions were except by reading a README.

Whereas this feature is explicitly a sandboxing feature, and the needed permissions are enumerated by the service file.

speedgoose · on April 27, 2020

A Docker container cannot contact the daemon as far as I know. Unless you bind it, but then you know about the risks.

hiram112 · on April 27, 2020

Not that it's exactly relevant to this article, but on RHEL 8, at least, Docker isn't supported, and instead they use their own container runtime called Podman along with Buildah for building them.

Podman does not run as root, and thus neither do the containers.

I tested it out on my development backup laptop; I usually use Docker-CE on my main MBP. Podman and Buildah were able to deal with all my individual containers, but their replacement for Docker-Compose failed on all my compose environments, and the errors were not helpful. I ended up installing an unsupported version of Docker-CE, and everything worked fine.

* Podman https://podman.io/

* Buildah: https://github.com/containers/buildah

* Podman-Compose: https://github.com/containers/podman-compose

colechristensen · on April 27, 2020

Cgroups limit the impact anything inside the container can do to anything outside the container.

It doesn't matter that the daemon runs as root, it starts processes in an a way that prevents them from interacting with other daemons, filesystems, etc. resources.

You don't quite understand docker correctly :)

TheDong · on April 27, 2020

It's not cgroups, but rather namespaces and seccomp (and apparmor/selinux on some distros) that sandbox the processes inside the container.

cgroups are used mostly for resource limits, not for sandboxing (aka namespacing).

docker by default does have a slightly more lax security posture than systemd or lxc (i.e. a default set of capabilities that isn't explicitly enumerated and a focus on UX over tweaking them, no usernamespaces by default, etc), though you're right that it is largely meant to be a secure sandbox for untrusted containers, as long as you know hat you're doing.

colechristensen · on April 27, 2020

Ah, I was under the impression that namespacing was a part of cgroups in general.

TheDong · on April 28, 2020

To quote Jessie's blog post [0]: "containers were not a top level design, they are something we build from Linux primitives [Linux namespaces and cgroups]".

cgroups can be used without namespaces, and the reverse is also true. Both of them are part of linux container implementations (like lxc and docker), but for an easy example, systemd uses cgroups for every service, and only uses namespaces for ones you very explicitly turn them on for.

Don't quote me on this, but I also think cgroups landed in the kernel many years before namespaces did.

[0]: https://blog.jessfraz.com/post/containers-zones-jails-vms/

eeZah7Ux · on April 27, 2020

It's way better than container runtimes. It's proper security sandboxing and it comes by default on most Linux distributions.

arianvanp · on April 27, 2020

it's the same kernel features. So no it's identical.

lazyier · on April 27, 2020

It's not identical. Implementation matters.

It's not enough that a system has the capability to do something; Ideally it needs to be well documented, easy to use correctly, difficult to use incorrectly, repeatable, and have it's correct usage verifiable. With logging and monitoring available.

When you have a piece of software you want to sandbox.. how exactly are you going to do it? What are the steps? Are they going to be easy for other people to follow and understand what is going on? How do you know it's working correctly?

These sorts of things matter. Not just in terms of usability, but also security. Having the same limitations and kernel hooks underneath doesn't make sandboxing implementations identical. it's still very much possible to have one that is objectively better then another.

I don't know if this is the best implementation, but it's certainly nice that if you are using Linux you probably have it available already. Out of the box.

acdha · on April 27, 2020

You're correct at a very narrow level considering only the mechanism used to apply the sandbox but think of the larger picture and especially how container runtimes are not created equal. For example, dockerd involves a running daemon with access control issues which many people handle by handing out root access. podman is better but far less common outside of the Kubernetes world. If you're trying to give generic advice, systemd avoids needing to drag in that extra discussion about which launcher you're using and how it's configured.

An interesting question would be integration with other features like SELinux or seccomp, since those are commonly punted on but make a huge difference in security.

dirtydroog · on April 27, 2020

Are you suggesting that a system's services should be run in individual docker containers?

nightfly · on April 27, 2020

No, they're saying that systemd provides many of the same benefits that running things in Docker provides.

MertsA · on April 27, 2020

No, he's saying that isolation via systemd is basically the same kind of isolation that you get with runtimes like docker. It's just Linux namespaces, which conventional wisdom is to assume that it's a relatively modest security boundary at best. The key takeaway here is that you can get this added security with minimal to no performance impact in a way that's simple and straightforward for the sysadmin to configure.