From ae21d322ec539852b64fbc86c7661f95500316c0 Mon Sep 17 00:00:00 2001 From: ctalledo Date: Sun, 20 Oct 2019 04:01:27 +0000 Subject: [PATCH] Updated and extended documentation as per latest Sysbox features. --- .remarkrc | 16 + README.md | 261 ++-- dockerfiles/alpine-docker/Dockerfile | 7 + .../alpine-supervisord-docker/Dockerfile | 34 + .../docker-entrypoint.sh | 20 + .../supervisord.conf | 17 + dockerfiles/debian-stretch-docker/Dockerfile | 6 +- .../syscont-with-inner-containers/Dockerfile | 13 + .../docker-pull.sh | 15 + .../ubuntu-bionic-systemd-docker/Dockerfile | 49 + dockerfiles/ubuntu-bionic-systemd/Dockerfile | 22 +- dockerfiles/ubuntu-disco-docker/Dockerfile | 6 +- docs/design.md | 329 +++-- docs/issue-guidelines.md | 12 +- docs/quickstart.md | 1157 +++++++++++++++++ docs/security.md | 201 +++ docs/troubleshoot.md | 113 +- docs/usage.md | 658 ++++++---- 18 files changed, 2377 insertions(+), 559 deletions(-) create mode 100644 .remarkrc create mode 100644 dockerfiles/alpine-docker/Dockerfile create mode 100644 dockerfiles/alpine-supervisord-docker/Dockerfile create mode 100644 dockerfiles/alpine-supervisord-docker/docker-entrypoint.sh create mode 100644 dockerfiles/alpine-supervisord-docker/supervisord.conf create mode 100644 dockerfiles/syscont-with-inner-containers/Dockerfile create mode 100755 dockerfiles/syscont-with-inner-containers/docker-pull.sh create mode 100644 dockerfiles/ubuntu-bionic-systemd-docker/Dockerfile create mode 100644 docs/quickstart.md create mode 100644 docs/security.md diff --git a/.remarkrc b/.remarkrc new file mode 100644 index 0000000..617917e --- /dev/null +++ b/.remarkrc @@ -0,0 +1,16 @@ +{ + "plugins": { + "remark-toc": { + "tight": true, + "heading": "contents" + }, + "remark-validate-links": { + }, + "remark-lint": { + "no-multiple-toplevel-headings": false, + "maximum-line-length": 90, + "table-pipe-alignment": true, + "table-cell-padding": false + } + } +} \ No newline at end of file diff --git a/README.md b/README.md index ebd0e5a..8215266 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,29 @@ -Sysbox: System Container Runtime -================================ +# Sysbox: System Container Runtime + +## Contents + +- [About Nestybox](#about-nestybox) +- [About Sysbox](#about-sysbox) +- [Features](#features) + - [System Container Deployment](#system-container-deployment) + - [System Container Software](#system-container-software) + - [System Container Image Creation](#system-container-image-creation) + - [Security and Isolation](#security-and-isolation) +- [Supported Linux Distros](#supported-linux-distros) +- [Host Requirements](#host-requirements) +- [Installation](#installation) +- [Usage](#usage) +- [Documentation](#documentation) +- [Software supported inside the System Container](#software-supported-inside-the-system-container) +- [Integration with Container Managers](#integration-with-container-managers) +- [Production Readiness](#production-readiness) +- [Troubleshooting](#troubleshooting) +- [Issues](#issues) +- [Roadmap](#roadmap) +- [We need your feedback](#we-need-your-feedback) +- [Uninstallation](#uninstallation) +- [Contact](#contact) +- [Thank You](#thank-you) ## About Nestybox @@ -10,13 +34,14 @@ with Docker (and soon Kubernetes). A Nestybox system container is a Linux container designed to run low-level system software, not just applications. See this [blog article](https://blog.nestybox.com/2019/09/13/system-containers.html) for more info on system -containers and the use cases we envision for them. +containers and some of the use cases we envision for them. Our mission is to make our system containers run as many system-level workload types as possible in order to provide users a fast, efficient, and easy-to-use alternative to virtual machines for -deploying virtual hosts on Linux. And for this work out-of-the-box and -securely, without complex configurations or hacks. +deploying virtual hosts on Linux. And for this to work out-of-the-box +and securely, without complex configurations and without resorting +to unsecure privileged containers. ## About Sysbox @@ -33,48 +58,73 @@ support a reduced set of features and use-cases at this time. Below is a list of features currently supported by Sysbox. -### Deployment +### System Container Deployment -* Supports deployment of system containers with Docker. +- Supports deployment of system containers with Docker. -* The system containers can run concurrently with regular Docker - application containers, without conflict. +- The system containers can run concurrently with regular Docker + application containers, without conflict. ### System Container Software -* Supports running Docker inside the system container. +- Supports running Docker inside the system container. - - Cleanly & securely, with total isolation between the Docker inside - the container and the Docker on the host. No need to use insecure - privileged containers, or to bind-mount the host's Docker socket - into the container. + - Cleanly & securely, with total isolation between the Docker inside + the container and the Docker on the host. No need to use unsecure + privileged containers or to bind-mount the host's Docker socket + into the container. - - The Docker inside the system container can build and run - containers as usual. + - The Docker inside the system container can build and run + containers as usual. - - This is useful for testing & CI/CD use cases. + - This is useful for Docker sandboxing, testing and CI/CD use cases. -### Security & Isolation +- Supports running Systemd inside the system container (preliminary support). -* Strong system container isolation + - Useful for system containers that are used as virtual hosts. - - System containers use the Linux user namespace and exclusive - user-ID and group-ID mappings for increased container-to-host and - container-to-container isolation. + - Run Systemd securely (without resorting to privileged Docker containers). -* Resource isolation + - Super easy: simply launch a system container image with Systemd as + its entry point and Sysbox will ensure the system container is setup + to run Systemd without problems. - - Programs inside the system container (e.g., Docker) are limited - to using the resources given to the system container itself. +### System Container Image Creation -* Partially virtualized procfs +- Use Docker to build system container images, just like regular containers. - - Processes inside the system container see a partially virtualized `/proc`. +- In addition, Sysbox supports using `docker build` or `docker commit` to create + system container images with pre-packaged inner containers in them. - - This makes the system container more closely resemble a real host. + - This enables you to use the system container as a fully pre-configured + Docker sandbox environment. - - Prevents processes within the container from changing global - kernel settings. + - When you start the system container all inner Docker container images + are ready to run. No need to pull the inner Docker images from a + remote repository. + +### Security and Isolation + +- Enhanced system container isolation + + - System containers use the Linux user namespace and exclusive + user-ID and group-ID mappings for increased container-to-host and + container-to-container isolation. + +- Resource isolation + + - Programs inside the system container (e.g., Docker) are limited + to using the resources given to the system container itself. + +- Partially virtualized procfs + + - Processes inside the system container see a partially virtualized `/proc`. + + - This makes the system container more closely resemble a physical + host or VM. + + - Prevents processes within the container from changing global + kernel settings. Please see our [Roadmap](#roadmap) for a list of features we are working on. @@ -83,26 +133,37 @@ Please see our [Roadmap](#roadmap) for a list of features we are working on. Sysbox relies on functionality that is only present in very recent Ubuntu kernels: -* Ubuntu 19.04 "Disco" (kernel >= 5.0.0-21.22) -* Ubuntu 18.04 "Bionic" (with 5.0+ kernel upgrade) +- Ubuntu 19.04 "Disco" (kernel >= 5.0.0-21.22) +- Ubuntu 18.04 "Bionic" (with 5.0+ kernel upgrade) -If you need to upgrade your kernel the match requirements stated -above, see [here](docs/troubleshoot.md#upgrading-ubuntu-kernel) for -suggestions on how to do this. +If you need to upgrade your kernel in order to match requirements +stated above, see [here](docs/troubleshoot.md#upgrading-the-ubuntu-kernel) +for suggestions on how to do this. Alternatively it's possible to use Sysbox with slightly older Ubuntu -kernels, but doing so requires that the Docker daemon be configured -with [userns-remap](docs/usage.md#interaction-with-docker-userns-remap). - -In this case you can run Sysbox on the following distros (without the -need to upgrade the kernel): +kernels, but doing so requires configuring Sysbox in "Docker +userns-remap isolation mode". In this case you can run Sysbox on the +following distros (without needing to upgrade the kernel): -* Ubuntu 19.04 "Disco" -* Ubuntu 18.10 "Cosmic" -* Ubuntu 18.04 "Bionic" +- Ubuntu 19.04 "Disco" +- Ubuntu 18.10 "Cosmic" +- Ubuntu 18.04 "Bionic" We plan to add support for more distros in the future. +Here is a summary of the distro requirements: + +| System Container Isolation Mode | Required Distro & Kernel | +| ------------------------------- | ----------------------------------- | +| Exclusive userns-remap | Ubuntu 19.04 Disco (>= 5.0.0-21.22) | +| | Ubuntu 18.04 Bionic (5.0+ kernel) | +| Docker userns-remap | Ubuntu 19.04 Disco | +| | Ubuntu 18.10 Cosmic | +| | Ubuntu 18.04 Bionic | + +See [here](docs/usage.md#system-container-isolation-modes) for info on +Sysbox isolation modes and how to configure them. + ## Host Requirements The Linux host on which Sysbox runs must meet the following requirements: @@ -111,24 +172,24 @@ The Linux host on which Sysbox runs must meet the following requirements: 2) Systemd must be the system's process-manager (the default in the supported distros). -3) Docker must be installed on the host machine. +3) Docker must be installed. ## Installation -1) Download the latest package from the [release](https://github.com/nestybox/sysbox-external/releases) page. +1) Download the latest Sysbox package from the [release](https://github.com/nestybox/sysbox-external/releases) page. 2) Verify that the checksum of the downloaded file fully matches the expected/published one. For example: ```console -$ sha256sum ~/sysbox_0.0.1-0~ubuntu-bionic_amd64.deb -2a02898dc53b4751cf413464b977f5b296d9aac3c5b477e05272bfa881d69cfc /home/user/sysbox_0.0.1-0~ubuntu-bionic_amd64.deb +$ sha256sum sysbox_0.1.2-0.ubuntu-disco_amd64.deb +23b99987bd0c5fb347f0231a47e4dc7c27bd082baaac942055ce6168adf6d9e2 sysbox_0.1.2-0.ubuntu-disco_amd64.deb ``` 3) Install the Sysbox package: ```console -$ sudo dpkg -i sysbox_0.0.1-0~ubuntu-bionic_amd64.deb +$ sudo dpkg -i sysbox_0.1.2-0.ubuntu-disco_amd64.deb ``` In case you hit an error with missing dependencies, fix this with: @@ -140,7 +201,6 @@ $ sudo apt-get install -f -y This will install the missing dependencies and automatically re-launch the Sysbox installation process. - 4) Verify that Sysbox's systemd units have been properly installed, and associated daemons are properly running: @@ -171,15 +231,42 @@ If you omit the `--runtime` option, Docker will use its default `runc` runtime to launch regular application containers (rather than system containers). -It's perfectly fine to run system containers along side with regular -Docker application containers on the host at the same time; they won't +It's perfectly fine to run system containers launched with Docker + +Sysbox along side regular Docker application containers; they won't conflict. -Refer to the [Sysbox User's Guide](docs/usage.md) for other ways to -run system containers with Sysbox. +## Documentation + +We have several documents to help you use and get the best out of +system containers. + +- [Sysbox Quick Start Guide](docs/quickstart.md) -If you hit problems with the instructions above, see the -[Troubleshooting document](docs/troubleshoot.md). + - Provides many examples for using system containers. New users + should start here. + +- [Sysbox User's Guide](docs/usage.md) + + - Provides more detail information on Sysbox features. + +- [Sysbox Design Notes](docs/design.md) + + - Provides information on Sysbox's design. + +- [Sysbox Security Guide](docs/security.md) + + - Provides information on system container security. + +- [Troubleshooting Guide](docs/troubleshoot.md) + + - Refer to this document if you hit problems. + +- [Issue Guidelines](docs/issue-guidelines.md) + + - Guidelines for filing issues in the Sysbox GitHub project site. + +Also, the [Nestybox blog site](https://blog.nestybox.com) has articles +on how to use system containers. ## Software supported inside the System Container @@ -188,41 +275,39 @@ application container, and thus should be able to run any application that runs in a regular Docker container. In addition, it runs system-level software that does not run in a regular Docker container. -For system-level software, we currently only support running Docker -inside the system container. This allows you to build and run Docker -application containers inside the system container, just as you would -on a physical host or in a VM. It's useful in CI/CD pipelines where -the need for a container to build another container arises often. +For system-level software, we currently support running the following +inside the system container: -See [here](docs/usage.md#running-software-inside-the-system-container) for more info on this. +- Systemd -## Integration with Container Managers + - Allows using the system container as a virtual host, much like you + would use a VM. -Sysbox is designed to work with Docker / containerd. +- Docker -We don't yet support other container managers (e.g., cri-o). + - Allows you to build and run Docker application containers inside + the system container, just as you would on a physical host or in a + VM. -## Design + - Allows you to use the system container as a Docker sandbox, or in + CI/CD pipelines where the need to deploy a container to build + another container arises often. -For more detailed info about Sysbox's design, refer to the -[Sysbox design document](docs/design.md). +See [here](docs/usage.md#running-software-inside-the-system-container) for more info on this. -## OCI Compatibility +## Integration with Container Managers -Sysbox is a fork of the [OCI runc](https://github.com/opencontainers/runc). It is mostly -(but not 100%) compatible with the OCI runtime specification. See [here](docs/design.md#oci-compatibility) -for a list of incompatibilities. +Sysbox is designed to work with Docker / Containerd. -We believe these incompatibilities won't negatively affect users of -Sysbox and should mostly be transparent to them. +We don't yet support other container managers (e.g., cri-o, etc). ## Production Readiness -Sysbox is still in an experimental stage. It's **not** production ready yet. +Sysbox is still in alpha / experimental stage. It's **not production ready yet**. -Nestybox is actively enhancing its functionality and fixing issues at this stage. +Nestybox is actively enhancing its functionality and fixing issues at this time. -Your feedback is much appreciated! +Your feedback regarding issues or improvements is much appreciated! ## Troubleshooting @@ -231,7 +316,7 @@ Refer to the [Troubleshooting document](docs/troubleshoot.md). ## Issues We apologize for any problems in the product or documentation, and we appreciate -customers filing issues that help us improve them. +customers filing issues that help us improve Sysbox. To file issues with Sysbox (e.g., bugs, feature requests, documentation changes, etc.), please refer to the [issue guidelines](docs/issue-guidelines.md) document. @@ -249,21 +334,19 @@ priorities. Here is the list: -* Support for more Linux distros. - -* Support for Docker volume plugins for use with system containers. +- Support for more Linux distros. -* Support for deploying system containers with Kubernetes. +- Support for deploying system containers with Kubernetes. -* Support for other container managers (e.g., cri-o) +- Support for other container managers (e.g., cri-o). -* Running Systemd inside the system container +- Running Kubernetes inside the system container. -* Running Kubernetes inside the system container +- Exposing host devices within the system container. -* Running window managers (e.g., X) inside the system container (for GUI apps & desktops). +- Running window managers (e.g., X) inside the system container (for GUI apps & desktops). -## Feedback +## We need your feedback We love feedback, as it helps us improve Sysbox and set its future direction. @@ -271,7 +354,7 @@ direction. We would much appreciate if you would take a couple of minutes to answer the following survey: -https://www.surveymonkey.com/r/SH8HMGY + ## Uninstallation @@ -302,10 +385,10 @@ $ sudo userdel sysbox Please contact us at `contact@nestybox.com` for any questions. We will be happy to help. -## Thank You! +## Thank You We thank you **very much** for using Sysbox. We hope you find it useful. Your trust in us is very much appreciated. --- *The Nestybox Team* +\-- _The Nestybox Team_ diff --git a/dockerfiles/alpine-docker/Dockerfile b/dockerfiles/alpine-docker/Dockerfile new file mode 100644 index 0000000..741ec75 --- /dev/null +++ b/dockerfiles/alpine-docker/Dockerfile @@ -0,0 +1,7 @@ +# +# Alpine + Docker +# + +FROM alpine:latest + +RUN apk update && apk add docker diff --git a/dockerfiles/alpine-supervisord-docker/Dockerfile b/dockerfiles/alpine-supervisord-docker/Dockerfile new file mode 100644 index 0000000..32a9a22 --- /dev/null +++ b/dockerfiles/alpine-supervisord-docker/Dockerfile @@ -0,0 +1,34 @@ +# +# Sample system container with alpine + supervisord + sshd + docker +# +# Run with: +# +# $ docker run --runtime=sysbox-runc -d -P +# + +FROM alpine:latest + +# docker +RUN apk add --update docker && rm -rf /tmp/* /var/cache/apk/* + +# supervisord +RUN apk add --update supervisor && rm -rf /tmp/* /var/cache/apk/* +RUN mkdir -p /var/log/supervisor +#COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf +COPY supervisord.conf /etc/ + +# sshd +RUN apk add --update openssh && rm -rf /tmp/* /var/cache/apk/* +RUN mkdir /var/run/sshd +RUN echo 'root:root' | chpasswd +RUN sed -ri 's/^#?PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config +RUN sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config +RUN mkdir /root/.ssh +RUN ssh-keygen -f /etc/ssh/ssh_host_rsa_key -N '' -t rsa +EXPOSE 22 + +# entrypoint +COPY docker-entrypoint.sh /usr/bin/docker-entrypoint.sh +RUN chmod +x /usr/bin/docker-entrypoint.sh + +ENTRYPOINT ["/usr/bin/docker-entrypoint.sh"] diff --git a/dockerfiles/alpine-supervisord-docker/docker-entrypoint.sh b/dockerfiles/alpine-supervisord-docker/docker-entrypoint.sh new file mode 100644 index 0000000..e4dd262 --- /dev/null +++ b/dockerfiles/alpine-supervisord-docker/docker-entrypoint.sh @@ -0,0 +1,20 @@ +#!/bin/sh +set -e + +# sys container init: +# +# If no command is passed to the container, supervisord becomes init and +# starts all its configured programs (per /etc/supervisor/conf.f/supervisord.conf). +# +# If a command is passed to the container, it runs in the foreground; +# supervisord runs in the background and starts all its configured +# programs. +# +# In either case, supervisord always starts its configured programs. + +if [ "$#" -eq 0 ] || [ "${1#-}" != "$1" ]; then + exec supervisord -n "$@" +else + supervisord -c /etc/supervisor/conf.d/supervisord.conf & + exec "$@" +fi diff --git a/dockerfiles/alpine-supervisord-docker/supervisord.conf b/dockerfiles/alpine-supervisord-docker/supervisord.conf new file mode 100644 index 0000000..cddeb36 --- /dev/null +++ b/dockerfiles/alpine-supervisord-docker/supervisord.conf @@ -0,0 +1,17 @@ +[supervisord] +stdout_logfile=/dev/stdout +stdout_logfile_maxbytes=0 + +[program:dockerd] +command=/usr/bin/dockerd +priority=1 +autostart=true +autorestart=true +startsecs=0 + +[program:sshd] +command=/usr/sbin/sshd -D +priority=1 +autostart=true +autorestart=true +startsecs=0 diff --git a/dockerfiles/debian-stretch-docker/Dockerfile b/dockerfiles/debian-stretch-docker/Dockerfile index de0ff29..1ac17a2 100644 --- a/dockerfiles/debian-stretch-docker/Dockerfile +++ b/dockerfiles/debian-stretch-docker/Dockerfile @@ -6,10 +6,9 @@ # FROM debian:stretch -RUN apt-get update # Docker install -RUN apt-get install -y \ +RUN apt-get update && apt-get install --no-install-recommends -y \ apt-transport-https \ ca-certificates \ curl \ @@ -21,5 +20,4 @@ RUN add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/debian \ $(lsb_release -cs) \ stable" -RUN apt-get update -RUN apt-get install -y docker-ce docker-ce-cli containerd.io +RUN apt-get update && apt-get install --no-install-recommends -y docker-ce docker-ce-cli containerd.io diff --git a/dockerfiles/syscont-with-inner-containers/Dockerfile b/dockerfiles/syscont-with-inner-containers/Dockerfile new file mode 100644 index 0000000..11b2b7f --- /dev/null +++ b/dockerfiles/syscont-with-inner-containers/Dockerfile @@ -0,0 +1,13 @@ +# +# Sample Dockerfile to build a system container image that include inner container images. +# +# Build with: +# $ docker build -t nestybox/syscont-with-inner-containers:latest . +# +# Run with: +# $ docker run -it --runtime=sysbox-runc --hostname=syscont nestybox/syscont-with-inner-containers:latest + +FROM nestybox/alpine-docker + +COPY docker-pull.sh /usr/bin +RUN chmod +x /usr/bin/docker-pull.sh && docker-pull.sh && rm /usr/bin/docker-pull.sh diff --git a/dockerfiles/syscont-with-inner-containers/docker-pull.sh b/dockerfiles/syscont-with-inner-containers/docker-pull.sh new file mode 100755 index 0000000..3c050d2 --- /dev/null +++ b/dockerfiles/syscont-with-inner-containers/docker-pull.sh @@ -0,0 +1,15 @@ +#!/bin/sh + +# dockerd start +dockerd > /var/log/dockerd.log 2>&1 & +dockerd_pid=$! +sleep 2 + +# pull inner images +docker pull busybox:latest +docker pull alpine:latest + +# dockerd cleanup (remove the .pid file as otherwise it prevents +# dockerd from launching correctly inside sys container) +kill $dockerd_pid +rm -f /var/run/docker.pid diff --git a/dockerfiles/ubuntu-bionic-systemd-docker/Dockerfile b/dockerfiles/ubuntu-bionic-systemd-docker/Dockerfile new file mode 100644 index 0000000..4119926 --- /dev/null +++ b/dockerfiles/ubuntu-bionic-systemd-docker/Dockerfile @@ -0,0 +1,49 @@ +# +# Ubuntu Bionic + Systemd + sshd + Docker +# +# Usage: +# +# $ docker run --runtime=sysbox-runc -it --rm -P --name=syscont nestybox/ubuntu-bionic-systemd-docker +# +# This will run systemd and prompt for a user login; the default +# user/password in this image is "admin/admin". Once you log in you +# can run Docker inside as usual. You can also ssh into the image: +# +# $ ssh admin@ -p +# +# where is chosen by Docker and mapped into the system container's sshd port. +# + +FROM nestybox/ubuntu-bionic-systemd:latest + +# Docker install +RUN apt-get update && apt-get install --no-install-recommends -y \ + apt-transport-https \ + ca-certificates \ + curl \ + gnupg-agent \ + software-properties-common +RUN curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - +RUN apt-key fingerprint 0EBFCD88 + +RUN add-apt-repository \ + "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ + $(lsb_release -cs) \ + stable" +RUN apt-get update && apt-get install --no-install-recommends -y docker-ce docker-ce-cli containerd.io + +# Add user "admin" to the Docker group +RUN usermod -a -G docker admin + +# sshd install +RUN apt-get update && apt-get install --no-install-recommends -y openssh-server +RUN mkdir /var/run/sshd +RUN echo 'admin:admin' | chpasswd +RUN sed -ri 's/^#?PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config +RUN sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config +RUN mkdir /root/.ssh +RUN mkdir /home/admin/.ssh +EXPOSE 22 + +# Set systemd as entrypoint. +ENTRYPOINT [ "/sbin/init" ] diff --git a/dockerfiles/ubuntu-bionic-systemd/Dockerfile b/dockerfiles/ubuntu-bionic-systemd/Dockerfile index da1ab5a..d021dc1 100644 --- a/dockerfiles/ubuntu-bionic-systemd/Dockerfile +++ b/dockerfiles/ubuntu-bionic-systemd/Dockerfile @@ -1,18 +1,19 @@ -# Nestybox's systemd dockerfile. # -# Description: +# Ubuntu Bionic + Systemd # -# Image's goal is to serve as a basic building-block for users looking to -# run various background processes (daemons) in Nestybox's system containers. -# For this purpose we are installing systemd process-manager as part of this -# Dockerfile. +# Description: # +# This image serves as a basic reference example for user's looking to +# run Systemd inside a system container in order to deploy various +# services within the system container, or use it as a virtual host +# environment. # -# Container initialization: +# Usage: # -# $ docker run --runtime=sysbox-runc \ -# -it --rm --name=sys-cont nestybox/ubuntu-bionic-systemd +# $ docker run --runtime=sysbox-runc -it --rm --name=syscont nestybox/ubuntu-bionic-systemd # +# This will run systemd and prompt for a user login; the default user/password +# in this image is "admin/admin". FROM ubuntu:bionic @@ -51,9 +52,8 @@ RUN apt-get update && \ # Create default 'admin/admin' user useradd --create-home --shell /bin/bash admin && echo "admin:admin" | chpasswd && adduser admin sudo - # Make use of stopsignal (instead of sigterm) to stop systemd containers. STOPSIGNAL SIGRTMIN+3 # Set systemd as entrypoint. -ENTRYPOINT [ "/sbin/init" ] \ No newline at end of file +ENTRYPOINT [ "/sbin/init" ] diff --git a/dockerfiles/ubuntu-disco-docker/Dockerfile b/dockerfiles/ubuntu-disco-docker/Dockerfile index cc6759f..78c117b 100644 --- a/dockerfiles/ubuntu-disco-docker/Dockerfile +++ b/dockerfiles/ubuntu-disco-docker/Dockerfile @@ -6,10 +6,9 @@ # FROM ubuntu:disco -RUN apt-get update # Docker install -RUN apt-get install -y \ +RUN apt-get update && apt-get install --no-install-recommends -y \ apt-transport-https \ ca-certificates \ curl \ @@ -22,5 +21,4 @@ RUN add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable" -RUN apt-get update -RUN apt-get install -y docker-ce docker-ce-cli containerd.io +RUN apt-get update && apt-get install --no-install-recommends -y docker-ce docker-ce-cli containerd.io diff --git a/docs/design.md b/docs/design.md index 127c377..fa5625b 100644 --- a/docs/design.md +++ b/docs/design.md @@ -1,22 +1,45 @@ -Sysbox Design Notes -=================== +# Sysbox Design Notes This document briefly describes some aspects of Sysbox's design. +## Contents + +- [Sysbox Components](#sysbox-components) +- [Linux Namespace Usage](#linux-namespace-usage) + - [User Namespace](#user-namespace) + - [Cgroup Namespace](#cgroup-namespace) +- [Exclusive User Namespace Mappings](#exclusive-user-namespace-mappings) +- [Ubuntu Shiftfs Module](#ubuntu-shiftfs-module) + - [Shiftfs Security Precautions](#shiftfs-security-precautions) + - [Shiftfs Functional Limitations](#shiftfs-functional-limitations) +- [Procfs Virtualization](#procfs-virtualization) +- [OCI compatibility](#oci-compatibility) + - [Namespaces](#namespaces) + - [Process Capabilities](#process-capabilities) + - [Procfs](#procfs) + - [Cgroupfs Mount](#cgroupfs-mount) + - [Seccomp](#seccomp) + - [AppArmor](#apparmor) + - [Read-only and Masked Paths](#read-only-and-masked-paths) + - [Mounts](#mounts) +- [Sysbox Nesting](#sysbox-nesting) + ## Sysbox Components Sysbox is made up of the following components: -* sysbox-runc +- sysbox-runc -* sysbox-fs +- sysbox-fs -* sysbox-mgr +- sysbox-mgr sysbox-runc is a container runtime, the program that does the low level kernel setup for execution of system containers. It's the -"frontend" of sysbox: higher layers (e.g., Docker & containerd) invoke -sysbox-runc to launch system containers. +"front-end" of sysbox: higher layers (e.g., Docker & containerd) +invoke sysbox-runc to launch system containers. It's mostly (but not +100%) compatible with the OCI runtime specification (more on this +[here](#oci-compatibility)). sysbox-fs is a file-system-in-user-space (FUSE) daemon that emulates portions of the system container's filesystem, in particular portions @@ -28,67 +51,105 @@ sysbox-mgr is a daemon that provides services to sysbox-runc and sysbox-fs. For example, it manages assignment of exclusive user namespace user-ID and group-ID mappings to system containers. -Together, sysbox-fs and sysbox-mgr are the "backends" for +Together, sysbox-fs and sysbox-mgr are the "back-ends" for sysbox. Communication between the sysbox components is done via gRPC. -## Linux Namespace Usage +Users don't normally interact with the Sysbox components directly. +Instead, they use higher level apps (e.g., Docker) that interact with +Sysbox to deploy system containers. -Sysbox always enables all Linux namespaces in the system containers -(including the Linux user namespace). +## Linux Namespace Usage -This is done to improve isolation of the container from the rest of -the system (i.e., root inside the container maps to a fully -unprivileged user on the host). +System containers deployed with Sysbox always use _all_ Linux +namespaces for enhanced isolation & security from the rest of the +system. -This also allows the system container to run more types of workloads, -in particular system-level workloads that require root inside the -container to have full privileges within the container. +That is, when you deploy a system container with Docker + Sysbox +(e.g., `docker run --runtime=sysbox-runc -it alpine:latest`), Sysbox +will always setup the container with all Linux namespaces enabled. This is one area where Sysbox deviates from the OCI specification, -which allows a higher layer (e.g., Docker + containerd) to choose the -namespaces that should be enabled for the container. - -## User Namespace and ID Mappings - -As mentioned in the prior section, Sysbox enables the Linux user -namespace in all system containers. - -The user namespace works by mapping user-IDs and group-IDs between the -container and the host (or more precisely between the container's -user namespace and its parent user namespace). For example, -root in the container maps to a non-root user in the host. - -When starting a system container, if the higher-layer (e.g., Docker + -containerd) provides these mappings to Sysbox via the container's -OCI `config.json` file, then Sysbox honors them. Docker does this -when the Docker daemon is configured with the userns-remap option. - -Otherwise (e.g., when Docker is configured without userns-remap), -Sysbox allocates these mappings for the system container. The mappings -remain allocated until the container is destroyed. +which leaves it to the higher layer (e.g., Docker + containerd) to +choose the namespaces that should be enabled for the container. + +In addition to providing enhanced isolation, using all Linux namespaces +(specially the user namespace) allows the system container to run more +types of workloads, in particular system-level workloads that require +root inside the container to have many capabilities within the +container. + +The table below shows a comparison on namespace usage between +Nestybox system containers and regular Docker containers. + +| Namespace | Docker + Sysbox | Docker + runC | +| --------- | --------------- | -------------------------------------------------------------------------- | +| mount | Yes | Yes | +| pid | Yes | Yes | +| uts | Yes | Yes | +| net | Yes | Yes | +| ipc | Yes | Yes | +| cgroup | Yes | No | +| user | Yes | No by default; Yes when Docker engine is configured with userns-remap mode | + +### User Namespace + +Nestybox system containers always use the Linux user-namespace. This +is a key feature for enhanced isolation, as it confines the privileges +of processes running inside the container to resources assigned to the +container. + +For example, a process with full capabilities inside the container +(root inside the container) is only capable of using those +capabilities on resources assigned to the container itself. It can't +access resources not assigned to the container (e.g., global system +resources, resources assigned to other containers, etc.) + +### Cgroup Namespace + +In addition, system containers also use the cgroup namespace, which +virtualizes cgroup information exposed via the system container's +`/proc` filesystem. The end result is that it hides host paths related +to the system container's cgroups from processes inside the system +container. Refer to the kernel's +[cgroup_namespaces][cgroup-namespaces] manual page for more info. + +## Exclusive User Namespace Mappings + +The Linux user namespace works by mapping user-IDs and group-IDs +between the container and the host (or more precisely between the +container's user namespace and its parent user namespace). + +There are different ways this mapping can be done, which give rise to +the "system container isolation modes". + +As described in the [user guide](usage.md), Sysbox supports the +following system container isolation modes. + +- [Exclusive userns-remap mode](usage.md#exclusive-userns-remap-mode) +- [Docker userns-remap mode](usage.md#docker-userns-remap-mode) + +In this section, we will focus on exclusive userns-remap mode and the +manner in which Sysbox allocates user-ID mappings in this mode. + +In exclusive userns-remap mode, Sysbox ensures that all system +containers get **exclusive** user-ID and group-ID mappings on the +host. This has the benefit of hardening container-to-container +isolation (i.e., if a process escapes the container, it will find +itself without permissions to access any files). For example, Sysbox may allocate user-ID mappings for a system container as follows: - | User-ID Range on Host | User-ID range in System Container | - |-----------------------|------------------------------------| - | X -> X + 65535 | 0 (root) -> 65535 | - -where X is chosen by Sysbox from the corresponding range in -`/etc/subuid` as described below. The same mapping applies to Group ID -ranges. - -When allocating the mappings, Sysbox ensures all system containers -get *exclusive* user-ID and group-ID mappings on the host. This has -the benefit of hardening container-to-container isolation (i.e., if a -process escapes the container, it will find itself without permissions -to access files of the host and all other containers). +| Host User-IDs | System Container User-IDs | +| -------------- | ------------------------- | +| x to (x+65535) | 0 to 65535 | -The allocated mappings come from the range specified in the host files -`/etc/subuid` and `/etc/subgid`. These files are automatically -configured by Sysbox during installation (more specifically when the -sysbox-mgr component is started during installation). For example: +where 'x' is chosen from the range associated with user `sysbox` in +the host files `/etc/subuid` and `/etc/subgid`. These files are +automatically configured by Sysbox during installation (or more +specifically when the sysbox-mgr component is started during +installation). For example: ```console # more /etc/subuid @@ -106,30 +167,32 @@ If more than 4K containers are running at the same time, Sysbox will by default re-use user-ID mappings from the range specified in `/etc/subuid`. The same applies to group-ID mappings. In this scenario multiple system containers may share the same user-ID mapping, -reducing container-to-container isolation. +reducing container-to-container isolation a bit. It is possible to configure Sysbox to not re-use mappings and -instead fail to launch the container. But this requires restarting +instead fail to launch the system container. But this requires restarting Sysbox (in particular the sysbox-mgr component). See section [Sysbox Reconfiguration](usage.md#sysbox-reconfiguration) for details on this. -One final note: when Sysbox allocates user-ID mappings, the presence -of the Ubuntu shiftfs module in the kernel is required as described -below. +Note that when Sysbox allocates exclusive user-ID mappings, the +presence of the [Ubuntu shiftfs module](#ubuntu-shiftfs-module) in the +kernel is required. ## Ubuntu Shiftfs Module -Sysbox makes use of the Ubuntu shiftfs module, which is included -in recent Ubuntu kernels (see the list of [supported Linux distros](../README.md#supported-linux-distros) -for more info on this). +When Sysbox is configured in exclusive userns-remap mode (it's +default isolation mode), it makes use of the Ubuntu shiftfs module, +which is included in recent Ubuntu kernels (see the list of +[supported Linux distros](../README.md#supported-linux-distros) for more info on this). The purpose of this module is to perform filesystem user-ID and group-ID "shifting" between the a container's Linux user namespace and the host's initial user namespace. -Recall from the [prior section](#user-namespace-and-id-mappings) -that Sysbox uses the Linux user namespace and allocates an exclusive -user-ID range on the host for each system container. +Recall from the [prior section](#exclusive-user-namespace-mappings) +that in Exclusive userns-remap mode, Sysbox allocates exclusive +user-namespace user-ID and group-ID mappings for each system +container. Without the shiftfs module, the system container would see its root filesystem files (as well as any mounted directories and files) owned @@ -144,10 +207,10 @@ module. By virtue of Sysbox mounting shiftfs on the system container's rootfs as well as mount sources, the ownership of files will be mapped as follows between the host and the system container: - | File Ownership on Host | File Ownership in System Container | - |------------------------|------------------------------------| - | 0 (root) -> 65535 | 0 (root) -> 65535 | - | Others | nobody:nogroup | +| File Ownership on Host | File Ownership in System Container | +| ---------------------- | ---------------------------------- | +| 0 to 65535 | 0 to 65535 | +| Others | nobody:nogroup | This means that the system container processes will now see files with the correct ownership (i.e., directories in the container's `/` @@ -157,7 +220,7 @@ This however also means that files written by the system container's root user will appear as `root:root` on the host (even though the root user in the system container is mapped to a non-root, fully unprivileged user on the host). Because of this, some security -precautions on the host are needed, as described in the next section. +precautions on the host are needed, as described in the [next section](#shiftfs-security-precautions). To verify the Ubuntu shiftfs module is loaded, type: @@ -166,16 +229,14 @@ To verify the Ubuntu shiftfs module is loaded, type: shiftfs 24576 0 ``` -The Ubuntu shiftfs module is required to be present in the kernel when -running Docker with Sysbox, specifically when Docker is configured -without userns-remap (the default and prefered configuration, as -described in the [Sysbox Usage Guide](usage.md#interaction-with-docker-userns-remap)). -If Docker is configured with userns-remap enabled, the Ubuntu shiftfs module -is not required. +The Ubuntu shiftfs module must be present in the kernel when +Sysbox is configured in [exclusive userns-remap mode](usage.md#exclusive-userns-remap-mode). + +The Ubuntu shiftfs module is not used in [docker userns-remap mode](usage.md#docker-userns-remap-mode). Sysbox will check for this. If the module is required but not present in the Linux kernel, Sysbox will fail to launch containers and issue -an [appropriate error](troubleshoot.md#ubuntu-shiftfs-module-not-present). +an error such as [this one](troubleshoot.md#ubuntu-shiftfs-module-not-present). ### Shiftfs Security Precautions @@ -183,7 +244,7 @@ When Sysbox uses shiftfs, some security precautions are recommended. These arise from the fact that while the root user in the system container is mapped to a non-root user on the host, files written by -the root user in the system container to mountpoints under shiftfs are +the root user in the system container to mount-points under shiftfs are mapped into `root:root` on the host. If the system container is compromised or runs untrusted workloads, this can cause problems. @@ -200,23 +261,23 @@ container have `root:root` ownership on the host. To reduce the attack surface, the following security precautions are recommended: -* The container's root filesystem should be in a directory accessible - to the host's root user only (e.g., 0700 permissions). +- The container's root filesystem should be in a directory accessible + to the host's root user only (e.g., 0700 permissions). - - This is always the case when using Docker with Sysbox, because the - Docker daemon makes `/var/lib/docker` accessible by the host's - root user only. + - This is always the case when using Docker with Sysbox, because the + Docker daemon makes `/var/lib/docker` accessible by the host's + root user only. -* The container's mount sources on the host should also be in a - directory only accessible to the host's root user. +- The container's mount sources on the host should also be in a + directory only accessible to the host's root user. - - This is always the case when using Docker volume and tmpfs mounts, - since the mount source is also under `/var/lib/docker`. + - This is always the case when using Docker volume and tmpfs mounts, + since the mount source is also under `/var/lib/docker`. - - For bind mounts however this is not guaranteed because the user - chooses the bind mount source. Thus, the user performing the bind - mount should explictly ensure this or take alternative precautions - as described below. + - For bind mounts however this is not guaranteed because the user + chooses the bind mount source. Thus, the user performing the bind + mount should explicitly ensure this or take alternative precautions + as described below. For cases where the mount source (e.g., a bind mount source) is not in a directory accessible by the root user only, an alternative @@ -249,6 +310,23 @@ attribute as described above, a user can ensure that no user on the host can execute files within the mount-source directory even after the container is stopped. +### Shiftfs Functional Limitations + +The Ubuntu shiftfs module is very recent and therefore has some +functional limitations as this time. + +One such limitation is that overlayfs can't be mounted on top of shiftfs. + +This implies that when a system container is using [exclusive userns-remap mode](usage.md#exclusive-userns-remap-mode), +applications running inside the system container that use overlayfs +mounts may not work properly. + +Note that one such application is Docker, which mounts overlayfs over +portions of its `/var/lib/docker` directory. For this specific case +however, Sysbox sets up the system container such that the limitation +above is worked-around, allowing Docker to operate properly within the +system container. + ## Procfs Virtualization Sysbox performs partial virtualization of the system container's @@ -263,16 +341,15 @@ The main goals for this are: Currently, Sysbox does virtualization of the following procfs resources: -* `/proc/uptime` - - - Shows the uptime of the system container, not the host. +- `/proc/uptime` -* `/proc/sys/net/netfilter/nf_conntrack_max` + - Shows the uptime of the system container, not the host. - - Sysbox emulates this resource independently per system - container, and sets appropriate values in the host kernel's - `nf_conntrack_max`. +- `/proc/sys/net/netfilter/nf_conntrack_max` + - Sysbox emulates this resource independently per system + container, and sets appropriate values in the host kernel's + `nf_conntrack_max`. Note also that by virtue of enabling the Linux user namespace in all system containers, kernel resources under `/proc/sys` that are not @@ -283,7 +360,8 @@ system-wide settings. ## OCI compatibility -Sysbox is mostly (but not 100%) compatible with the [OCI runtime spec](https://github.com/opencontainers/runtime-spec). +Sysbox is a fork of the [OCI runc](https://github.com/opencontainers/runc). It is mostly +(but not 100%) compatible with the OCI runtime specification. The incompatibilities arise from Nestybox's desire to make deployment of system containers possible with Docker. @@ -298,18 +376,18 @@ Here is a list of OCI runtime incompatibilities: Sysbox requires that the system container's `config.json` file have a namespace array field with at least the following namespaces: -* pid -* ipc -* uts -* mount -* network +- pid +- ipc +- uts +- mount +- network This is normally the case for Docker containers. Sysbox adds the following namespaces to all system containers: -* user -* cgroup +- user +- cgroup ### Process Capabilities @@ -322,9 +400,9 @@ Sysbox always mounts `/proc/sys` read-write inside the system container. Note that by virtue of enabling the Linux user namespace, only -namespaced resources under `/proc/sys` will be writeable from within +namespaced resources under `/proc/sys` will be writable from within the system container. Non-namespaced resources (e.g., those under -`/proc/sys/kernel`) won't be writeable from within the system container, +`/proc/sys/kernel`) won't be writable from within the system container, unless they are virtualized by Sysbox (see [Procfs Virtualization](#procfs-virtualization)). ### Cgroupfs Mount @@ -356,26 +434,41 @@ Sysbox currently ignores the Docker AppArmor profile, as it's too restrictive (e.g., prevents mounts inside the container, prevents write access to `/proc/sys`, etc.) -### Read-only Paths +### Read-only and Masked Paths Sysbox honors read-only paths in the system container's -`config.json`, with the exception of `/proc`. - -### Masked paths +`config.json`, with the exception of paths at or under `/proc` +or under `/sys`. -Sysbox honors masked paths in the system container's `config.json`, -with the exception of `/proc`. +The same applies to masked paths. ### Mounts Sysbox honors the mounts specified in the system container's `config.json` -file. However, it adds the following mounts to the system container: +file, with a few exceptions such as: -* Read-only bind mount of the host's `/lib/modules/` - into a corresponding path within the system container. +- Mounts into the system container's `/var/lib/docker` when Sysbox + is configured in exclusive userns-remap mode (it's default + operating mode). + +- Mounts into the system container's `/proc` and `/sys`. + +In addition, Sysbox adds the following mounts to the system container: + +- Read-only bind mount of the host's `/lib/modules/` + into a corresponding path within the system container. + +- For system containers whose init process is Systemd, Sysbox mounts + tmpfs inside the following directories in the system container: + `/run`, `/run/lock`, `/tmp`. + +- Select mounts under the system container's `/sys` and `/proc` + directories. ## Sysbox Nesting Sysbox must run at the host level; it does not support running inside a system container. This implies that we don't support running a system container inside a system container at this time. + +[cgroup-namespaces]: http://man7.org/linux/man-pages/man7/cgroup_namespaces.7.html diff --git a/docs/issue-guidelines.md b/docs/issue-guidelines.md index 335c322..c80cdce 100644 --- a/docs/issue-guidelines.md +++ b/docs/issue-guidelines.md @@ -1,5 +1,4 @@ -Guidelines for Filing Issues -============================ +# Guidelines for Filing Issues Issues can be filed [here](https://github.com/nestybox/sysbox-external/issues) @@ -7,14 +6,13 @@ Please follow these guidelines when filing issues with Sysbox. 1) Create an issue with one of the following labels: -* `Bug`: for functional defects, performance issues, etc. +- `Bug`: for functional defects, performance issues, etc. -* `Documentation`: documentation errors or improvements +- `Documentation`: documentation errors or improvements -* `Enhancement`: Feature requests - -* `Question`: for questions related to usage, design, etc. +- `Enhancement`: Feature requests +- `Question`: for questions related to usage, design, etc. 2) Add a label corresponding to the Sysbox release (e.g. `v0.1.0`) diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 0000000..6bb18ca --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,1157 @@ +# Sysbox Quick Start Guide + +This document is a quick guide showing how to deploy system containers +with Sysbox and take advantage of their features. + +The document is primarily composed of examples, with quick explanatory +text in between. For a more thorough explanation, refer to the +[Sysbox User's Guide](usage.md). + +## Contents + +- [Sysbox Installation](#sysbox-installation) +- [Deploy a System Container](#deploy-a-system-container) +- [Nestybox Image Repository](#nestybox-image-repository) +- [Deploy a System Container with Docker inside](#deploy-a-system-container-with-docker-inside) + - [Inner and Outer Containers](#inner-and-outer-containers) +- [Deploy a System Container with Systemd inside](#deploy-a-system-container-with-systemd-inside) +- [Deploy a System Container with Systemd, sshd, and Docker inside](#deploy-a-system-container-with-systemd-sshd-and-docker-inside) +- [Deploy a System Container with Supervisord and Docker inside](#deploy-a-system-container-with-supervisord-and-docker-inside) +- [Building A System Container That Includes Inner Container Images](#building-a-system-container-that-includes-inner-container-images) +- [Committing A System Container That Includes Inner Container Images](#committing-a-system-container-that-includes-inner-container-images) +- [Persistence of Inner Container Images with Docker Volumes](#persistence-of-inner-container-images-with-docker-volumes) +- [Persistence of Inner Container Images with Bind Mounts](#persistence-of-inner-container-images-with-bind-mounts) +- [Sharing Storage Among System Containers](#sharing-storage-among-system-containers) +- [System Container Isolation Features](#system-container-isolation-features) +- [Sysbox Uninstallation](#sysbox-uninstallation) +- [Further reading](#further-reading) + +## Sysbox Installation + +Refer to the [Sysbox README file](../README.md) for the supported +Linux distros, host requirements, and installation instructions. + +## Deploy a System Container + +The easiest way is to use Docker; simply add the `--runtime` flag to `docker run`: + +```console +$ docker run --runtime=sysbox-runc --rm -it --hostname syscont debian:latest +root@syscont:/# +``` + +Inside the system container you can now deploy system-level software +(e.g., Docker, systemd) just as you would on a real host or VM. + +Later sections in this document show examples of this. + +It's also possible to deploy a system container without Docker, using +the `sysbox-runc` command. See [this section](usage.md#running-system-containers-with-sysbox) +in the Sysbox User's Guide for more info. + +## Nestybox Image Repository + +The [Nestybox Docker Hub](https://hub.docker.com/u/nestybox) site has several reference system container images. + +The Dockerfiles for those can be found [here](../dockerfiles). + +Feel free to source the Nestybox sample images from your own Dockerfile, +or make a copy of a Nestybox Dockerfile and modify it per your needs. +Instructions for doing so are [here](../dockerfiles/README.md). + +## Deploy a System Container with Docker inside + +We will use a system container image that has Alpine + Docker inside. It's called +`nestybox/alpine-docker` and it's in the Nestybox DockerHub public repo. The +Dockerfile is [here](../dockerfiles/alpine-docker/Dockerfile). + +```console +$ docker run --runtime=sysbox-runc -it --hostname=syscont nestybox/alpine-docker:latest + +/ # which docker +/usr/bin/docker + +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # tail /var/log/dockerd.log +time="2019-10-23T20:48:51.960846074Z" level=warning msg="Your kernel does not support cgroup rt runtime" +time="2019-10-23T20:48:51.960860148Z" level=warning msg="Your kernel does not support cgroup blkio weight" +time="2019-10-23T20:48:51.960872060Z" level=warning msg="Your kernel does not support cgroup blkio weight_device" +time="2019-10-23T20:48:52.146157113Z" level=info msg="Loading containers: start." +time="2019-10-23T20:48:52.235036055Z" level=info msg="Default bridge (docker0) is assigned with an IP address 172.18.0.0/16. Daemon option --bip can be used to set a preferred IP address" +time="2019-10-23T20:48:52.324207525Z" level=info msg="Loading containers: done." +time="2019-10-23T20:48:52.476235437Z" level=warning msg="Not using native diff for overlay2, this may cause degraded performance for building images: failed to set opaque flag on middle layer: operation not permitted" storage-driver=overlay2 +time="2019-10-23T20:48:52.476418516Z" level=info msg="Docker daemon" commit=0dd43dd87fd530113bf44c9bba9ad8b20ce4637f graphdriver(s)=overlay2 version=18.09.8-ce +time="2019-10-23T20:48:52.476533826Z" level=info msg="Daemon has completed initialization" +time="2019-10-23T20:48:52.489489309Z" level=info msg="API listen on /var/run/docker.sock" + +/ # docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES + +/ # docker run -it busybox +Unable to find image 'busybox:latest' locally +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest +/ # +``` + +As shown, Docker runs normally inside the secure system container and we +can deploy an inner container (busybox) without problem. + +The Sysbox runtime allows you to do this without resorting to an +unsecure Docker privileged container. + +### Inner and Outer Containers + +When launching Docker inside a system container, terminology can +quickly get confusing due to container nesting. + +To prevent confusion we refer to the containers as the "outer" and +"inner" containers. + +- The outer container is a system container, created at the host + level; it's launched with Docker + Sysbox. + +- The inner container is an application container, created within the + outer container (i.e., it's created by the Docker instance running + inside the system container (aka the inner Docker)). + +## Deploy a System Container with Systemd inside + +Deploying systemd inside a system container is useful when you plan to +run multiple services inside the system container, or when you want to +use it as a virtual host environment. + +We will use a system container image that has Ubuntu Bionic + Systemd +inside. It's called `nestybox/ubuntu-bionic-systemd` and it's in the +Nestybox DockerHub public repo. The Dockerfile is [here](../dockerfiles/ubuntu-bionic-systemd/Dockerfile). + +```console +$ docker run --runtime=sysbox-runc --rm -it --hostname=syscont nestybox/ubuntu-bionic-systemd +systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) +Detected virtualization container-other. +Detected architecture x86-64. + +Welcome to Ubuntu 18.04.3 LTS! + +Set hostname to . +Failed to read AF_UNIX datagram queue length, ignoring: No such file or directory +Failed to install release agent, ignoring: No such file or directory +File /lib/systemd/system/systemd-journald.service:35 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling. +Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.) +[ OK ] Reached target Swap. + +... + +[ OK ] Reached target Login Prompts. +[ OK ] Started Login Service. +[ OK ] Reached target Multi-User System. +[ OK ] Reached target Graphical Interface. + Starting Update UTMP about System Runlevel Changes... +[ OK ] Started Update UTMP about System Runlevel Changes. + +Ubuntu 18.04.3 LTS syscont console + +syscont login: +``` + +In the system container image we are using, we've configured the +default console login and password to be `admin/admin` (you can always +change this in the image's Dockerfile). Let's login with these +credentials: + +```console +syscont login: admin +Password: +Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-31-generic x86_64) + + * Documentation: https://help.ubuntu.com + * Management: https://landscape.canonical.com + * Support: https://ubuntu.com/advantage +This system has been minimized by removing packages and content that are +not required on a system that users do not log into. + +To restore this content, you can run the 'unminimize' command. + +The programs included with the Ubuntu system are free software; +the exact distribution terms for each program are described in the +individual files in /usr/share/doc/*/copyright. + +Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by +applicable law. + +To run a command as administrator (user "root"), use "sudo ". +See "man sudo_root" for details. + +admin@syscont:~$ +``` + +And verify systemd is running correctly: + +```console +admin@syscont:~$ ps -fu root +UID PID PPID C STIME TTY TIME CMD +root 1 0 0 23:41 ? 00:00:00 /sbin/init +root 252 1 0 23:41 ? 00:00:00 /lib/systemd/systemd-journald +root 685 1 0 23:41 ? 00:00:00 /lib/systemd/systemd-logind +root 725 1 0 23:41 pts/0 00:00:00 /bin/login -p -- + +admin@syscont:~$ systemctl +UNIT LOAD ACTIVE SUB DESCRIPTION +-.mount loaded active mounted Root Mount +dev-full.mount loaded active mounted /dev/full +dev-kmsg.mount loaded active mounted /dev/kmsg +dev-mqueue.mount loaded active mounted POSIX Message Queue File System +dev-null.mount loaded active mounted /dev/null +dev-random.mount loaded active mounted /dev/random + +... + +timers.target loaded active active Timers +apt-daily-upgrade.timer loaded active waiting Daily apt upgrade and clean activities +apt-daily.timer loaded active waiting Daily apt download activities +motd-news.timer loaded active waiting Message of the Day +systemd-tmpfiles-clean.timer loaded active waiting Daily Cleanup of Temporary Directories + +LOAD = Reflects whether the unit definition was properly loaded. +ACTIVE = The high-level unit activation state, i.e. generalization of SUB. +SUB = The low-level unit activation state, values depend on unit type. + +74 loaded units listed. Pass --all to see loaded but inactive units, too. +To show all installed unit files use 'systemctl list-unit-files'. +``` + +To exit the system container, you can break from it (by pressing +`ctrl-p ctrl-q`). You can then stop the system container by using the +`docker stop` command from the host. + +Alternatively, from another shell type: + +```console +$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +3236bcdd2313 nestybox/ubuntu-bionic-systemd "/sbin/init" 23 minutes ago Up 23 minutes zen_blackburn + +$ docker stop zen_blackburn +zen_blackburn +``` + +And back in the shell where the system container is running, you'll +see systemd shutting down all services in the system container: + +```console +[ OK ] Removed slice system-getty.slice. +[ OK ] Stopped target Host and Network Name Lookups. + Stopping Network Name Resolution... +[ OK ] Stopped target Graphical Interface. +[ OK ] Stopped target Multi-User System. + +... + +[ OK ] Reached target Shutdown. +[ OK ] Reached target Final Step. + Starting Halt... +``` + +## Deploy a System Container with Systemd, sshd, and Docker inside + +Earlier we showed an example of deploying a system container that has +Docker inside. In that earlier example we did not have Systemd (or any +other process manager) in the container, so we had to manually start +Docker inside the container. + +This example improves on this by deploying a system container that +has both Systemd and Docker inside. You'll see that as soon as the +system container is started, the Docker daemon inside the system container +is ready to be used. + +Further, we've also added an SSH daemon in into the system container image, +so that you can login remotely into it, just as you would on a physical +host or VM. + +We will use a system container image called `nestybox/ubuntu-bionic-systemd-docker:latest` which is in +Nestybox DockerHub public repo. The Dockerfile is [here](../dockerfiles/ubuntu-bionic-systemd-docker/Dockerfile). + +```console +$ docker run --runtime=sysbox-runc -it --rm -P --hostname=syscont nestybox/ubuntu-bionic-systemd-docker:latest +systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid) +Detected virtualization container-other. +Detected architecture x86-64. + +Welcome to Ubuntu 18.04.3 LTS! + +Set hostname to . + +... + +[ OK ] Started Docker Application Container Engine. +[ OK ] Reached target Multi-User System. +[ OK ] Reached target Graphical Interface. + Starting Update UTMP about System Runlevel Changes... +[ OK ] Started Update UTMP about System Runlevel Changes. + +Ubuntu 18.04.3 LTS syscont console + +syscont login: +``` + +In the system container image we are using, we've configured the +default console login and password to be `admin/admin`. You can always +change this in the image's Dockerfile. + +```console +syscont login: admin +Password: +Welcome to Ubuntu 18.04.3 LTS (GNU/Linux 5.0.0-31-generic x86_64) + + * Documentation: https://help.ubuntu.com + * Management: https://landscape.canonical.com + * Support: https://ubuntu.com/advantage +This system has been minimized by removing packages and content that are +not required on a system that users do not log into. + +To restore this content, you can run the 'unminimize' command. + +The programs included with the Ubuntu system are free software; +the exact distribution terms for each program are described in the +individual files in /usr/share/doc/*/copyright. + +Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by +applicable law. + +To run a command as administrator (user "root"), use "sudo ". +See "man sudo_root" for details. + +admin@syscont:~$ +``` + +Now verify that Docker is running inside the system container: + +```console +admin@syscont:~$ systemctl status docker.service +● docker.service - Docker Application Container Engine + Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled) + Active: active (running) since Thu 2019-10-24 00:33:09 UTC; 8s ago + Docs: https://docs.docker.com + Main PID: 715 (dockerd) + Tasks: 12 + CGroup: /system.slice/docker.service + └─715 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock + +admin@syscont:~$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +``` + +And run an inner container: + +```console +admin@syscont:~$ docker run -it busybox +Unable to find image 'busybox:latest' locally +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest +/ # +``` + +Now let's ssh into the system container. In order to do this, we need +the host's IP address as well as the host port that is mapped to the +system container's sshd port. + +In my case, the host's IP address is 10.0.0.230. The ssh daemon is +listening on port 22 in the system container, which is mapped to some +arbitrary port on the host machine. + +Let's find out what that arbitrary port is. From the host, type: + +```console +$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +e22773df703e nestybox/ubuntu-bionic-systemd-docker:latest "/sbin/init" 16 seconds ago Up 15 seconds 0.0.0.0:32770->22/tcp sad_kepler +``` + +Now let's ssh into the system container from a different machine: + +```console +$ ssh admin@10.0.0.230 -p 32770 + +The authenticity of host '[10.0.0.230]:32770 ([10.0.0.230]:32770)' can't be established. +ECDSA key fingerprint is SHA256:VNHrxvsHp4aJYH/DQjvBMdeoF0HBP2yKtWc815WtnnI. +Are you sure you want to continue connecting (yes/no)? yes +Warning: Permanently added '[10.0.0.230]:32770' (ECDSA) to the list of known hosts. +admin@10.0.0.230's password: +Last login: Thu Oct 24 03:47:39 2019 +To run a command as administrator (user "root"), use "sudo ". +See "man sudo_root" for details. + +admin@syscont:~$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +``` + +The ssh worked without problem. + +This is cool because you now have a system container that is acting +like a virtual host with Systemd and sshd. Plus it has Docker inside +so you can deploy application containers in complete isolation from +the underlying host. + +## Deploy a System Container with Supervisord and Docker inside + +Systemd is great but may be too heavy-weight for some use cases. + +A good alternative is to use supervisord as a light weight process +manager inside a system container. + +We will use a system container image called `nestybox/alpine-supervisord-docker:latest`. +Nestybox DockerHub public repo. The Dockerfile, supervisord.conf, and docker-entrypoint.sh files +can be found [here](../dockerfiles/alpine-supervisord-docker). + +```console +$ docker run --runtime=sysbox-runc -d --rm -P --hostname=syscont nestybox/alpine-supervisord-docker:latest +f3b90976ad0550fc8142568d988c8fa65c54864d04c1637e88323a32f87cf3af +``` + +Let's check that all services inside the system container were started +correctly. From the host, type: + +```console +$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +f3b90976ad05 nestybox/alpine-supervisord-docker:latest "/usr/bin/docker-ent…" 2 seconds ago Up 1 second 0.0.0.0:32776->22/tcp sleepy_shamir + +$ docker exec -it sleepy_shamir ps +PID USER TIME COMMAND + 1 root 0:00 {supervisord} /usr/bin/python2 /usr/bin/supervisord -n + 7 root 0:00 /usr/sbin/sshd -D + 8 root 0:02 /usr/bin/dockerd + 36 root 0:03 containerd --config /var/run/docker/containerd/containerd. + 980 root 0:00 ps +``` + +As shown, supervisord is running at the init process and has spawned +sshd and dockerd. Cool. + +Now let's ssh into the system container. In this example the host +machine is at IP 10.0.0.230, and the system container's ssh port is +mapped to host port 32776 as indicated by the `docker ps` output +above. The login is `root:root` as configured in the image's Dockerfile. + +```console +$ ssh root@10.0.0.230 -p 32776 +The authenticity of host '[10.0.0.230]:32776 ([10.0.0.230]:32776)' can't be established. +RSA key fingerprint is SHA256:/p++Ju2yo5SF1obEV4TeI+Fq6Q2DBErdboO287aSNp0. +Are you sure you want to continue connecting (yes/no)? yes +Warning: Permanently added '[10.0.0.230]:32776' (RSA) to the list of known hosts. +root@10.0.0.230's password: +Welcome to Alpine! + +The Alpine Wiki contains a large amount of how-to guides and general +information about administrating Alpine systems. +See . + +You can setup the system with the command: setup-alpine + +You may change this message by editing /etc/motd. +syscont:~# +``` + +Now run a Docker container inside the system container: + +```console +syscont:~# docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +syscont:~# docker run -it busybox +Unable to find image 'busybox:latest' locally +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest +/ # syscont:~# +``` + +## Building A System Container That Includes Inner Container Images + +One of the novel features of Sysbox is that it allows you to use Docker +to create a system container image that includes one or more inner Docker +container images within it. + +This way, you can deploy the system container image and deploy it's +inner containers without having to pull the inner container images +from the network. + +There are two ways to create such a system container image. +Via the `docker build` command or via the `docker commit` command. + +This section shows an example with `docker build`. The [next section](#committing-a-system-container-that-includes-inner-container-images) +shows an example with `docker commit`. + +In order to use the `docker build` command, we must first reconfigure +the host's Docker daemon to use the `sysbox-runc` runtime as it's +default runtime. + +This is needed because the Docker build process must call into the +Sysbox runtime and unfortunately the `docker build` command does not +currently take a `--runtime` option (unlike `docker run`). + +To reconfigure the Docker daemon, we edit its +`/etc/docker/daemon.json` file. Here is how the +`/etc/docker/daemon.json` file should look like. Notice the line at +the end of the file: + +```console +# more /etc/docker/daemon.json +{ + "runtimes": { + "sysbox-runc": { + "path": "/usr/local/sbin/sysbox-runc" + } + }, + "default-runtime": "sysbox-runc" +} +``` + +We then restart the Docker daemon service: + +```console +$ systemctl restart docker.service +``` + +Once we've configured sysbox-runc as the Docker default runtime, we +can now build the system container image. + +Here is a sample Dockerfile: + +```dockerfile +FROM nestybox/alpine-docker + +COPY docker-pull.sh /usr/bin +RUN chmod +x /usr/bin/docker-pull.sh && docker-pull.sh && rm /usr/bin/docker-pull.sh +``` + +This Dockerfile inherits from the `nestybox/alpine-docker` base image which simply contains +Alpine plus a Docker daemon (the Dockerfile is [here](../dockerfiles/alpine-docker/Dockerfile)). + +The presence of the Docker daemon in the base image is required since +we will use it to pull the inner container images. + +The key instruction in the Dockerfile shown above is the `RUN` +instruction. Notice that it's copying a script called `docker-pull.sh` +into the system container, executing it, and removing it. + +The `docker-pull.sh` script is shown below. + +```bash +#!/bin/sh + +# dockerd start +dockerd > /var/log/dockerd.log 2>&1 & +sleep 2 + +# pull inner images +docker pull busybox:latest +docker pull alpine:latest + +# dockerd cleanup (remove the .pid file as otherwise it prevents +# dockerd from launching correctly inside sys container) +kill $(cat /var/run/docker.pid) +kill $(cat /run/docker/containerd/containerd.pid) +rm -f /var/run/docker.pid +rm -f /run/docker/containerd/containerd.pid +``` + +As shown, the script simply runs Docker inside the system container, +pulls the inner container images (in this case the busybox and alpine +images), and does some cleanup. Pretty simple. + +The reason we need this script in the first place is because it's hard +to put all of these commands into a single Dockerfile `RUN` +instruction. It's simpler to put them in a separate script and call +it from the `RUN` instruction. + +Let's see what happens when we execute `docker build` on this Dockerfile +to build the system container image: + +```console +$ docker build -t nestybox/syscont-with-inner-containers:latest . + +Sending build context to Docker daemon 3.072kB +Step 1/3 : FROM nestybox/alpine-docker + ---> b51716d05554 +Step 2/3 : COPY docker-pull.sh /usr/bin + ---> Using cache + ---> df2af1f26937 +Step 3/3 : RUN chmod +x /usr/bin/docker-pull.sh && docker-pull.sh && rm /usr/bin/docker-pull.sh + ---> Running in 7fa2687f2385 +latest: Pulling from library/busybox +7c9d20b9b6cd: Pulling fs layer +7c9d20b9b6cd: Verifying Checksum +7c9d20b9b6cd: Download complete +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest +latest: Pulling from library/alpine +89d9c30c1d48: Pulling fs layer +89d9c30c1d48: Verifying Checksum +89d9c30c1d48: Download complete +89d9c30c1d48: Pull complete +Digest: sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a +Status: Downloaded newer image for alpine:latest +Removing intermediate container 7fa2687f2385 + ---> 9c33554fd4cf +Successfully built 9c33554fd4cf +Successfully tagged nestybox/syscont-with-inner-containers:latest +``` + +We can see from above that the Docker build process has pulled the +busybox and alpine container images and stored them inside the system +container image. Cool! + +Once the build is complete, we can optionally revert the +`default-runtime` config in the `/etc/docker/daemon.json` file we did +earlier (it's only needed for the Docker build, but not for running +the system container as we mentioned earlier). + +Before proceeding, it's a good idea to prune any dangling images +created during the Docker build process to save storage. + +```console +$ docker image prune +``` + +Now, let's run the newly created system container image: + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname=syscont nestybox/syscont-with-inner-containers:latest +/ # +``` + +And let's start Docker inside the system container: + +```console +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +alpine latest 965ea09ff2eb 2 days ago 5.55MB +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +As shown, the inner container images are already there. They came +pre-packaged with the system container! + +This is cool because it allows us to create a pre-configured Docker +sandbox environment and capture them within a single, portable, +easy-to-deploy system container image. + +In the next section we show how to do something similar but using +the `docker commit` command. + +## Committing A System Container That Includes Inner Container Images + +The `docker commit` command allows users to create a new container image +which contains a snapshot of the contents of a running container (excepting +bind mounts). + +We can leverage this command to create a system container image that +includes images of inner containers. + +We do this by deploying a system container, using Docker inside the +system container to pull inner container images, and then doing a +`docker commit` of the running system container into a new system +container image. + +First, let's deploy a system container, start dockerd within it, and +pull some images inside: + +```console +$ docker run --runtime=sysbox-runc -it --rm nestybox/alpine-docker + +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE + +/ # docker pull busybox +Using default tag: latest +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest + +/ # docker pull alpine +Using default tag: latest +latest: Pulling from library/alpine +89d9c30c1d48: Pull complete +Digest: sha256:c19173c5ada610a5989151111163d28a67368362762534d8a8121ce95cf2bd5a +Status: Downloaded newer image for alpine:latest +``` + +Now, from the host, let's use Docker to "commit" the system container image (i.e., +to take a snapshot of its contents): + +```console +$ docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +31b9a7975749 nestybox/alpine-docker "/bin/sh" 54 seconds ago Up 52 seconds zen_mirzakhani + +$ docker commit zen_mirzakhani nestybox/syscont-with-inner-containers:latest +sha256:82686f19cd10d2830e9104f46cbc8fc4a7d12c248f7757619513ca2982ae8464 +``` + +A couple of restrictions apply here: + +- The `docker commit` instruction takes a `--pause` option which + is set to `true` by default. Do not set it to `false` when trying to + commit a system container image with inner containers. It won't work. + +- The `docker commit` instruction does not capture the contents of + volumes or bind mounts mounted into the system container (per + Docker's design). This means that for the above example to work, + we must not run the system container with a volume or bind mount + into `/var/lib/docker` (such mounts are useful for persistence of + inner container images, as described in this + [section](#persistence-of-inner-container-images-with-docker-volumes)). + +The commit operation may take anywhere from a few seconds to a few +minutes, depending on how many changes were done in the container's +files since it was created. + +Let's run the newly committed system container image. If all is well, it +should contain the inner container images for busybox and alpine within it. + +```console +$ docker run --runtime=sysbox-runc -it --rm nestybox/syscont-with-inner-containers:latest + +/ # rm -f /var/run/docker.pid +/ # rm -f /run/docker/containerd/containerd.pid + +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +alpine latest 965ea09ff2eb 3 days ago 5.55MB +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +There they are! This is cool because it gives us a mechanism to take a +snapshot of a running system container's contents that includes inner +container images. It's helpful as a way of saving work or exporting a +working system container for deployment in another machine (i.e., +commit the image, docker push to a repo, and docker pull from another +machine). + +One final note: + +In the example above, we manually removed the `/var/run/docker.pid` and +`/run/docker/containerd/containerd.pid` files prior to starting the +Docker instance inside the committed system container. This was done +because the Docker commit captures the pid files of the Docker and +containerd instances running in the system container at the time the +commit occurred. If we don't remove these stale files, the Docker daemon +in the committed container may fail to start and report errors such +as: + +```console +Error starting daemon: pid file found, ensure docker is not running or delete /var/run/docker.pid +``` + +or + +```console +Failed to start containerd: timeout waiting for containerd to start +``` + +Such a failure does not occur when the system container has Systemd +inside, as the Systemd service scripts take care of ensuring the +Docker daemon starts correctly regardless of whether the docker.pid +file is present or not. + +## Persistence of Inner Container Images with Docker Volumes + +The Docker instance running inside a system container stores its +images in a cache located in the `/var/lib/docker` directory +inside the container. + +When the system container is removed (i.e., not just stopped, but +actually removed via `docker rm`), the contents of that directory will +also be removed. In other words, inner Docker's image cache is +destroyed when the associated system container is removed. + +It's possible to override this behavior by mounting host storage into +the system container's `/var/lib/docker` in order to persist the +inner Docker's image cache across system container life-cycles. + +To do this, follow these steps: + +1) Create a Docker volume on the host to serve as the persistent image cache for + the Docker daemon inside the system container. + +```console +$ docker volume create my-image-cache +my-image-cache + +$ docker volume list +DRIVER VOLUME NAME +local my-image-cache +``` + +2) Launch the system container and mount the volume into the system + container's `/var/lib/docker` directory. + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont --mount source=my-image-cache,target=/var/lib/docker nestybox/alpine-docker +/ # +``` + +3) Start Docker inside the system container: + +```console +/ # dockerd > /var/log/dockerd.log 2>&1 & +``` + +4) Pull an inner container image (e.g. busybox): + +```console +/ # docker pull busybox +Using default tag: latest +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +5) Exit the system container. Since it was started with the `--rm` + option, Docker will remove the system container from the system. + However, the contents of the system container's `/var/lib/docker` + will persist since they are stored in volume `my-image-cache`. + +6) Start a new system container and mount the `my-image-cache` volume: + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont --mount source=my-image-cache,target=/var/lib/docker nestybox/alpine-docker + +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +As shown, the inner container images persist across the life-cycle of +the system container. This is cool because it means that a system +container can leverage an existing Docker image cache stored somewhere +on the host, and thus avoid having to pull inner Docker images from +the network each time a new system container is started. + +A warning though: a persistent Docker image cache must only be mounted +on a **single system container at any given time**. This is a +restriction imposed by the Docker daemon, which does not allow its +image cache to be shared concurrently among multiple daemon instances. +Sysbox will check for violations of this rule and report an +appropriate error during system container creation. + +## Persistence of Inner Container Images with Bind Mounts + +This section is similar to the prior one, but uses bind mounts instead +of Docker volumes when launching the system container. + +The steps to do this are the following: + +1) Create a directory on the host to serve as the persistent image cache for + the Docker daemon inside the system container. + + As described in the [Sysbox User's Guide](usage.md#bind-mounts-in-exclusive-userns-remap-mode), + the directory should be owned by a user in the range [0:65536] and + will show up with those same user-IDs within the system + container. In this example we choose user-ID 0 (root) so that the + Docker instance inside the system container will see it's + `/var/lib/docker` directory owned by `root:root` inside the system + container. + + For extra security, we will also set the permission to 0700 as + recommended in the Sysbox User's Guide. + +```console +$ sudo mkdir /home/someuser/image-cache +$ sudo chmod 700 /home/someuser/image-cache +``` + +2) Launch the system container and bind-mount the newly created + directory into the system container's `/var/lib/docker` directory. + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont --mount type=bind,source=/home/someuser/image-cache,target=/var/lib/docker nestybox/alpine-docker +/ # +``` + +3) Start Docker inside the system container and pull an image (e.g., busybox): + +```console +/ # dockerd > /var/log/dockerd.log 2>&1 & + +/ # docker pull busybox +Using default tag: latest +latest: Pulling from library/busybox +7c9d20b9b6cd: Pull complete +Digest: sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e +Status: Downloaded newer image for busybox:latest + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +4) Exit the system container. + +5) Start a new system container and bind-mount the `my-image-cache` + directory as before: + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont --mount type=bind,source=/home/someuser/image-cache,target=/var/lib/docker nestybox/alpine-docker +``` + +6) Start Docker inside the system container and verify that it sees + the images from the bind-mounted cache: + +```console +/ # dockerd > /var/log/dockerd.log 2>&1 & +/ # docker ps +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES + +/ # docker image ls +REPOSITORY TAG IMAGE ID CREATED SIZE +busybox latest 19485c79a9bb 7 weeks ago 1.22MB +``` + +## Sharing Storage Among System Containers + +It's easy to share storage among multiple system containers by simply +bind-mounting the shared storage into each system container. + +However, it requires that the permissions on the bind mounted +shared storage be set appropriately, which depends on the system +container isolation mode as described in the +[Sysbox User's Guide](usage.md#system-container-bind-mount-requirements). + +In this example we assume Sysbox operates in exclusive userns-remap +mode (it's default isolation mode). + +1) Create the shared storage on the host. In this example we use + a Docker volume. + +```console +$ docker volume create shared-storage +shared-storage +``` + +2) Create a system container and mount the shared storage volume into it: + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont --mount source=shared-storage,target=/mnt/shared-storage alpine:latest +/ # +``` + +3) From the system container, add a shared file to the shared storage: + +```console +/ # touch /mnt/shared-storage/shared-file +/ # ls -l /mnt/shared-storage/shared-file +-rw-r--r-- 1 root root 0 Oct 24 22:08 /mnt/shared-storage/shared-file +``` + +4) In another shell, create another system container and mount the shared storage volume into it: + +```console +$ docker run --runtime=sysbox-runc -it --rm --hostname syscont2 --mount source=shared-storage,target=/mnt/shared-storage alpine:latest +/ # +``` + +5) Confirm that the second system container sees the shared file: + +```console +/ # ls -l /mnt/shared-storage/shared-file +-rw-r--r-- 1 root root 0 Oct 24 22:08 /mnt/shared-storage/shared-file +``` + +Notice that both system containers see the shared file with +`root:root` permissions, even though each system container is using +the Linux user namespace with exclusive user-ID and group-ID mappings +for enhanced security. + +From the first system container: + +```console +/ # cat /proc/self/uid_map + 0 268600992 65536 +``` + +From the second system container: + +```console +/ # cat /proc/self/uid_map + 0 268666528 65536 +``` + +The reason both system containers see the correct `root:root` +ownership on the shared storage is through the magic of the Ubuntu +shiftfs filesystem, which Sysbox mounts over the shared storage. + +From the first system container: + +```console +/ # mount | grep shared-storage +/var/lib/docker/volumes/shared-storage/_data on /mnt/shared-storage type shiftfs (rw,relatime) +``` + +Finally, in the example above we used a Docker volume as the shared storage. However, +we could also use an arbitrary host directory as the shared storage. We need simply +bind-mount it to the system containers, though we must follow the requirements +for bind-mounts described in the [Sysbox User's Guide](usage.md#system-container-bind-mount-requirements). + +## System Container Isolation Features + +In this section we will show the following system container isolation +features by way of example: + +- Linux user namespace + +- Exclusive user-ID and group-ID mappings per system container + +- Linux capabilities + +We assume Sysbox is configured in exclusive userns-remap mode (it's +default operating mode). + +First let's deploy a system container: + +```console +$ docker run --runtime=sysbox-runc --rm -it --hostname syscont debian:latest +root@syscont:/# +``` + +Nestybox system containers always use the Linux user-namespace (in fact +they use all Linux namespaces) to provide strong isolation between the +system container and the host. + +Let's verify this by comparing the namespaces between a process inside +the system container and a process on the host. + +From the system container: + +```console +root@syscont:/# ls -l /proc/self/ns/ +total 0 +lrwxrwxrwx 1 root root 0 Oct 23 22:06 cgroup -> 'cgroup:[4026532563]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 ipc -> 'ipc:[4026532506]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 mnt -> 'mnt:[4026532504]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 net -> 'net:[4026532509]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 pid -> 'pid:[4026532507]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 pid_for_children -> 'pid:[4026532507]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 user -> 'user:[4026532503]' +lrwxrwxrwx 1 root root 0 Oct 23 22:06 uts -> 'uts:[4026532505]' +``` + +Now from the host: + +```console +ls -l /proc/self/ns +total 0 +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 cgroup -> 'cgroup:[4026531835]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 ipc -> 'ipc:[4026531839]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 mnt -> 'mnt:[4026531840]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 net -> 'net:[4026531992]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 pid -> 'pid:[4026531836]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 pid_for_children -> 'pid:[4026531836]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 user -> 'user:[4026531837]' +lrwxrwxrwx 1 chino chino 0 Oct 23 22:07 uts -> 'uts:[4026531838]' +``` + +You can see the system container uses dedicated namespaces, including +the user and cgroup namespaces. It has no namespaces in common with +the host, which gives it stronger isolation compared to regular Docker +containers. + +In addition, by default Sysbox assigns each system container exclusive +user-ID and group-ID mappings for each system container. This further +isolates system containers from the host and from each other. + +```console +root@syscont:/# cat /proc/self/uid_map + 0 268994208 65536 +root@syscont:/# cat /proc/self/gid_map + 0 268994208 65536 +``` + +You can see the system container's users in the range [0:65535] are +mapped to a range of users on the host chosen by Sysbox. In this +example they map to the host user-ID range [268994208 : +268994208+65535]. + +Now let's now deploy another system container and check it's user-ID +and group-ID map: + +```console +$ docker run --runtime=sysbox-runc --rm -it --hostname syscont2 debian:latest + +root@syscont2:/# cat /proc/self/uid_map + 0 269059744 65536 +root@syscont2:/# cat /proc/self/gid_map + 0 269059744 65536 +``` + +Notice how Sysbox chose different user-ID and group-ID mappings for +this new system container. This provides isolation from the host as +well as from other system containers. More info on this can be found +in the [design guide](design.md#exclusive-user-namespace-mappings). + +Now, let's check the capabilities of a root processes inside the +system container: + +```console +root@syscont:/# grep Cap /proc/self/status +CapInh: 0000003fffffffff +CapPrm: 0000003fffffffff +CapEff: 0000003fffffffff +CapBnd: 0000003fffffffff +CapAmb: 0000003fffffffff +``` + +As shown, a root process inside the system container has all +capabilities enabled. However, those capabilities only take effect +with respect to host resources assigned to the system container +(courtesy of the Linux user namespace). In fact, a system container +process has no capabilities outside of the Linux user-namespace +associated with the system container, providing further isolation from +the host. + +Contrast this to a regular Docker container. A root process in such a +container has a reduced set of capabilities (typically `CapEff: +00000000a80425fb`) and does not use the Linux user namespace. This has +two drawbacks: + +1) The container's root process is limited in what it can do within the container. + +2) The container's root process has those same capabilities on the host, which + poses a higher security risk should the process escape + the container's chroot jail. + +System containers overcome both of these drawbacks. + +## Sysbox Uninstallation + +Refer to the [Sysbox README file](../README.md) for the uninstallation +instructions. + +## Further reading + +Refer to the [Sysbox User's Guide](usage.md) for details on the Sysbox +features shown in the above examples. + +Also, the [Nestybox blog site](https://blog.nestybox.com) has more examples +on how to use system containers. diff --git a/docs/security.md b/docs/security.md new file mode 100644 index 0000000..7a5d9a1 --- /dev/null +++ b/docs/security.md @@ -0,0 +1,201 @@ +# Sysbox Security Guide + +This document describes security aspects of Sysbox system containers. + +**Note:** it's early days for Nestybox and while our system containers +already incorporate important security features such use of the Linux user +namespace and exclusive user-ID mappings per container, other security +aspects need further development. The text below describes what +security features are currently in place and where more work is needed. + +## Contents + +- [System Container Security](#system-container-security) + - [Root Filesystem Jail](#root-filesystem-jail) + - [Linux Namespaces](#linux-namespaces) + - [Exclusive User Namespace Mappings](#exclusive-user-namespace-mappings) + - [Procfs](#procfs) + - [Sysfs](#sysfs) + - [Process Capabilities](#process-capabilities) + - [System Calls](#system-calls) + - [AppArmor](#apparmor) + - [Devices](#devices) + - [Resource Limiting & Cgroups](#resource-limiting--cgroups) + - [Host PID or Network Sharing](#host-pid-or-network-sharing) + - [Privileged Container Support](#privileged-container-support) + +## System Container Security + +### Root Filesystem Jail + +System container processes are confined to the directory hierarchy +associated with the container's root filesystem, plus any +configured mounts (e.g., Docker volumes or bind-mounts). + +### Linux Namespaces + +System containers always use **all** Linux namespaces, including the +user-namespace, as described [here](design.md#linux-namespace-usage). + +This provides strong isolation from the underlying host and from other +containers. + +### Exclusive User Namespace Mappings + +By default, system containers use exclusive user-namespace user-ID and +group-ID mappings per container. This enhances container-to-host and +container-to-container isolation. + +See [here](design.md#exclusive-user-namespace-mappings) for further +details. + +### Procfs + +The system container's procfs (i.e., `/proc`) is mounted read-write, +but protected by the Linux user-namespace which ensures that only +resources assigned to the system container are accessible via `/proc`. + +Having said that, there are several improvements that Sysbox needs in +this area to ensure information about system resources outside of the +system container is not exposed inside the container, ensuring +unmounts/re-mounts work, etc. + +### Sysfs + +The system container's sysfs (i.e., `/sys`) is mounted read-only, +with the exception of `/sys/fs/cgroup`. + +This ensures system container processes can't modify system-level +controls exposed via `/sys`. + +Having said that, this is also an area where improvements are needed +in Sysbox to ensure sensitive information is not exposed, +unmounts/re-mounts work correctly, etc. + +The `/sys/fs/cgroup` directory is mounted read-write to allow system +container processes to assign cgroup resources. Sysbox sets up the +system container in such a way that processes inside the system +container can't modify cgroup resources assigned to the system +container itself. In addition, system container processes can only use +cgroups to assign a subset of the system container resources. + +### Process Capabilities + +A system container's init process configured with user-ID 0 (root) +always starts with all capabilities enabled. + + $ docker run --runtime=sysbox-runc -it alpine:latest + / # grep -i cap /proc/self/status + CapInh: 0000003fffffffff + CapPrm: 0000003fffffffff + CapEff: 0000003fffffffff + CapBnd: 0000003fffffffff + CapAmb: 0000003fffffffff + +Note that the system container's Linux user-namespace ensures that +these capabilities are only applicable to resources assigned to the +system container itself. It does not mean that the process has full +capabilities on the host (in fact is has no capabilities on resources +not assigned to the system container). + +A system container's init process configured with a non-zero user-ID +starts with the capabilities passed to Sysbox by the container engine. + +For example, when deploying system containers with Docker: + + $ docker run --runtime=sysbox-runc --user 1000 -it alpine:latest + / $ grep -i cap /proc/self/status + CapInh: 00000000a80425fb + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: 00000000a80425fb + CapAmb: 0000000000000000 + +### System Calls + +Nestybox system containers allow a minimum set of 300+ syscalls, using +Linux seccomp. + +Significant syscalls blocked within system containers are the +same as those listed [in this Docker article](https://docs.docker.com/engine/security/seccomp/), +except that system container allow these system calls too: + + mount + umount + umount2 + add_key + request_key + keyctl + pivot_root + gethostname + sethostname + setns + unshare + +It's currently not possible to reduce the set of syscalls allowed within +a system container (i.e., the Docker `--security-opt seccomp=` option +is not supported). + +### AppArmor + +Nestybox system containers do not yet support AppArmor for mandatory +access control. + +When using Docker to deploy a system container, the [default AppArmor profile](https://docs.docker.com/engine/security/apparmor/) +used by Docker is ignored as it's too restrictive for system containers +(i.e., the Docker `--security-opt apparmor=` option is not supported). + +In the near future we plan to develop a default profile for system +containers. + +### Devices + +The following devices are always present in the system container: + + /dev/null + /dev/zero + /dev/full + /dev/random + /dev/urandom + /dev/tty + +Additional devices may be added by the container engine. For example, +when deploying system containers with Docker, you typically see the +following devices in addition to the ones listed above: + + /dev/console + /dev/pts + /dev/mqueue + /dev/shm + +Sysbox does not currently support exposing host devices inside system +containers (e.g., via the `docker run --device` option). We are +working on adding support for this. + +### Resource Limiting & Cgroups + +System container resource consumption can be limited via cgroups. + +This can be used to balance resource consumption as well as to prevent +denial-of-service attacks in which a buggy or compromised system +container consumes all available resources in the system. + +For example, when using Docker to deploy system containers, the +`docker run --cpu*`, `--memory*`, `--blkio*`, etc., settings can be +used for this purpose. + +### Host PID or Network Sharing + +System containers do not support sharing the pid or network namespaces +with the host (as this is not secure and it's incompatible with the +system container's user namespace). + +For example, when using Docker to launch system containers, the +`docker run --pid=host` and `docker run --network=host` options +do not work with system containers. + +### Privileged Container Support + +System containers are incompatible with the Docker `--privileged` +flag. See the [usage guide](usage.md#privileged-container-support) +for info on this. diff --git a/docs/troubleshoot.md b/docs/troubleshoot.md index a59c1c1..afbcc91 100644 --- a/docs/troubleshoot.md +++ b/docs/troubleshoot.md @@ -1,5 +1,19 @@ -Sysbox Troubleshooting -======================== +# Sysbox Troubleshooting + +## Contents + +- [Upgrading the Ubuntu Kernel](#upgrading-the-ubuntu-kernel) + - [Bionic Beaver](#bionic-beaver) + - [Disco Dingo](#disco-dingo) +- [Sysbox Installation Problems](#sysbox-installation-problems) +- [Sysbox Logs](#sysbox-logs) + - [sysbox-mgr and sysbox-fs](#sysbox-mgr-and-sysbox-fs) + - [sysbox-runc](#sysbox-runc) +- [Docker reports Unknown Runtime error](#docker-reports-unknown-runtime-error) +- [Bind Mount Permissions Error](#bind-mount-permissions-error) +- [Ubuntu Shiftfs Module Not Present](#ubuntu-shiftfs-module-not-present) +- [Unprivileged User Namespace Creation Error](#unprivileged-user-namespace-creation-error) +- [Failed to Setup Docker Volume Manager Error](#failed-to-setup-docker-volume-manager-error) ## Upgrading the Ubuntu Kernel @@ -39,31 +53,25 @@ Unpacking sysbox (1:0.0.1-0~ubuntu-bionic) ... Setting up sysbox (1:0.0.1-0~ubuntu-bionic) ... Non-disruptive changes made to docker configuration. Sending SIGHUP signal to docker daemon... + +Created symlink /etc/systemd/system/sysbox.service.wants/sysbox-fs.service → /lib/systemd/system/sysbox-fs.service. +Created symlink /etc/systemd/system/sysbox.service.wants/sysbox-mgr.service → /lib/systemd/system/sysbox-mgr.service. +Created symlink /etc/systemd/system/multi-user.target.wants/sysbox.service → /lib/systemd/system/sysbox.service. ``` -Or if your Docker daemon is configured with [userns-remap](usage.md#interaction-with-docker-userns-remap), the -expected output is: +If your Docker daemon is configured with userns-remap enabled, you may also see the following: ```console -Selecting previously unselected package sysbox. -(Reading database ... 150254 files and directories currently installed.) -Preparing to unpack .../sysbox_0.0.1-0~ubuntu-bionic_amd64.deb ... -Unpacking sysbox (1:0.0.1-0~ubuntu-bionic) ... -Setting up sysbox (1:0.0.1-0~ubuntu-bionic) ... - Disruptive changes made to docker configuration. Restarting docker service... -Created symlink /etc/systemd/system/sysbox.service.wants/sysbox-fs.service → /lib/systemd/system/sysbox-fs.service. -Created symlink /etc/systemd/system/sysbox.service.wants/sysbox-mgr.service → /lib/systemd/system/sysbox-mgr.service. -Created symlink /etc/systemd/system/multi-user.target.wants/sysbox.service → /lib/systemd/system/sysbox.service. ``` Both mean that the installation completed successfully. -In case an error is observed above as a consequence of a missing -software dependency, proceed to download and install the missing -package(s) as indicated below. Once this requirement is satisfied, -Sysbox's installation process will be automatically re-launched to -conclude this task. +In case an error occurs during installation as a consequence of a +missing software dependency, proceed to download and install the +missing package(s) as indicated below. Once this requirement is +satisfied, Sysbox's installation process will be automatically +re-launched to conclude this task. Missing dependency output: @@ -100,7 +108,7 @@ so the `active exited` status above is expected. ## Sysbox Logs -### sysbox-mgr & sysbox-fs +### sysbox-mgr and sysbox-fs The Sysbox daemons (i.e. sysbox-fs and sysbox-mgr) will log information related to their activities in the @@ -112,17 +120,17 @@ exercises. For sysbox-runc, logging is handled as follows: -* When running Docker + sysbox-runc, the sysbox-runc logs are actually stored in - a containerd directory such as: +- When running Docker + sysbox-runc, the sysbox-runc logs are actually stored in + a containerd directory such as: - `/run/containerd/io.containerd.runtime.v1.linux/moby//log.json` + `/run/containerd/io.containerd.runtime.v1.linux/moby//log.json` - where `` is the container ID returned by Docker. + where `` is the container ID returned by Docker. -* When running sysbox-runc directly, sysbox-runc will not produce any logs by default. - Use the `sysbox-runc --log` option to change this. +- When running sysbox-runc directly, sysbox-runc will not produce any logs by default. + Use the `sysbox-runc --log` option to change this. -## Docker reports "Unknown runtime" error +## Docker reports Unknown Runtime error When creating a system container, Docker may report the following error: @@ -147,20 +155,21 @@ sysbox-runc as follows: } ``` -When this file is changed, the Docker daemon needs to be restarted: +If this file is changed, the Docker daemon needs to be restarted: ```console # systemctl restart docker.service ``` **Note:** The sysbox installer automatically configures the `/etc/docker/daemon.json` -file to add the `sysbox-runc` runtime to it, and restarts the Docker daemon. +file to add the `sysbox-runc` runtime to it, and restarts the Docker daemon. Thus +this error is uncommon. ## Bind Mount Permissions Error When running a system container with a bind mount, you may see that the files and directories associated with the mount have -`nobody:nouser` ownership when listed from within the container. +`nobody:nogroup` ownership when listed from within the container. This typically occurs when the source of the bind mount is owned by a user on the host that is different from the user on the host to which @@ -168,16 +177,8 @@ the system container's root user maps. Recall that Sysbox containers always use the Linux user namespace and thus map the root user in the system container to a non-root user on the host. -If the system container was created via Docker with userns-remap -disabled (the default configuration of Docker), then make sure that -the bind mount source has `root:root` ownership on the host. - -If the system container was created via Docker with userns-remap -enabled, then make sure that the bind mount source has the same owner -(user:group) as the Docker userns remap configuration. - -Refer to [Docker Bind Mount Permissions](usage.md#docker-bind-mount-permissions) for further -details. +Refer to [System Container Bind Mount Requirements](usage.md#system-container-bind-mount-requirements) for +info on how to set the correct permissions on the bind mount. ## Ubuntu Shiftfs Module Not Present @@ -187,14 +188,28 @@ in the Linux kernel: ```console # docker run --runtime=sysbox-runc -it debian:latest -docker: Error response from daemon: OCI runtime create failed: container requires uid shifting but error was found: shiftfs module is not loaded in the kernel +docker: Error response from daemon: OCI runtime create failed: container requires user-ID shifting but error was found: shiftfs module is not loaded in the kernel. Update your kernel to include shiftfs module or enable Docker with userns-remap. Refer to the Sysbox troubleshooting guide for more info: unknown ``` -The Ubuntu shiftfs module is required when Sysbox detects that Docker is -running without userns-remap (Docker's default configuration). +The error likely means you are running Sysbox on an older Ubuntu +kernel, as newer Ubuntu kernel come with shiftfs. + +The Ubuntu shiftfs module is required when Sysbox is configured in +[exclusive userns-remap mode](usage.md#exclusive-userns-remap-mode) +(it's default operating mode). + +You can work-around this error by either: + +- Updating your Linux distro. See + [here](../README.md#supported-linux-distros) for the list of Linux + distros supported by Sysbox, and [here](#upgrading-the-ubuntu-kernel) + for recommendations on how to update the distro. + +or -The error likely means you are running Sysbox on an older Ubuntu kernel. See [here](../README.md#supported-linux-distros) -for the list of Linux distros supported by Sysbox. +- Configuring Sysbox in docker userns-remap mode, as described + [here](usage.md#system-container-isolation-modes). This + mode does not require use of shiftfs. ## Unprivileged User Namespace Creation Error @@ -227,13 +242,15 @@ docker run --runtime=sysbox-runc -it ubuntu:latest docker: Error response from daemon: OCI runtime create failed: failed to setup docker volume manager: host dir for docker store /var/lib/sysbox/docker can't be on ..." ``` -This means that directory `/var/lib/sysbox` is on a filesystem not supported by Sysbox. +This means that Sysbox's `/var/lib/sysbox` directory is on a +filesystem not supported by Sysbox. This directory must be on one of the following filesystems: - * ext4 - * btrfs +- ext4 +- btrfs The same requirement applies to the `/var/lib/docker` directory. -This is normally the case for vanilla Ubuntu installations. +This is normally the case for vanilla Ubuntu installations, so this +error is not common. diff --git a/docs/usage.md b/docs/usage.md index e381dbf..146b1e1 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,9 +1,39 @@ -Sysbox User's Guide -==================== - -The Sysbox [README](../README.md) file contains the basic information -on how to install Sysbox and create system containers with it. This -document supplements the README file with additional information. +# Sysbox User's Guide + +The [Sysbox Quick Start Guide](quickstart.md) contains several examples on how +to install Sysbox and use system containers. + +This document supplements the Quick Start Guide with more detailed +information on Sysbox's features, requirements, and restrictions. + +## Contents + +- [Running System Containers with Sysbox](#running-system-containers-with-sysbox) + - [Using Docker](#using-docker) + - [Using the sysbox-runc command](#using-the-sysbox-runc-command) + - [Using Other Container Managers](#using-other-container-managers) +- [System Container Isolation Modes](#system-container-isolation-modes) + - [Exclusive userns-remap mode](#exclusive-userns-remap-mode) + - [Docker userns-remap mode](#docker-userns-remap-mode) +- [Running Software inside the System Container](#running-software-inside-the-system-container) + - [Docker-in-Docker](#docker-in-docker) + - [Inner and Outer Containers](#inner-and-outer-containers) + - [Inner Docker Restrictions](#inner-docker-restrictions) + - [Inner Docker Image Persistence](#inner-docker-image-persistence) + - [Systemd](#systemd) +- [System Container Bind Mount Requirements](#system-container-bind-mount-requirements) + - [Bind Mounts in Exclusive Userns-Remap Mode](#bind-mounts-in-exclusive-userns-remap-mode) + - [Bind Mounts in Docker Userns-Remap Mode](#bind-mounts-in-docker-userns-remap-mode) +- [Storage Sharing Among System Containers](#storage-sharing-among-system-containers) +- [Support for Linux Security Modules](#support-for-linux-security-modules) + - [AppArmor](#apparmor) + - [SELinux](#selinux) + - [Other LSMs](#other-lsms) +- [Support for Exposing Host Devices inside System Containers](#support-for-exposing-host-devices-inside-system-containers) +- [Support for `docker run userns`](#support-for-docker-run-userns) +- [Rootless Container Support](#rootless-container-support) +- [Privileged Container Support](#privileged-container-support) +- [Sysbox Reconfiguration](#sysbox-reconfiguration) ## Running System Containers with Sysbox @@ -13,9 +43,11 @@ We currently support two ways of running system containers with Sysbox: 2) Using the `sysbox-runc` command directly (more control but harder) +Both of these are explained below. + ### Using Docker -It's easy to run system container using Docker. Simply use the `--runtime=sysbox-runc` +It's easy to run system container using Docker. Simply add the `--runtime=sysbox-runc` flag in the `docker run` command: ```console @@ -23,14 +55,15 @@ $ docker run --runtime=sysbox-runc --rm -it --hostname my_cont debian:latest root@my_cont:/# ``` -It's possible to configure Sysbox as the default runtime for Docker. This -way you don't have to use the `--runtime` flag everytime. If you wish to do this, -refer to the [Docker website](https://docs.docker.com/engine/reference/commandline/dockerd/). +If you wish, you can configure Sysbox as the default runtime for Docker. This +way you don't have to use the `--runtime` flag every time. To do this, +refer to this [Docker article](https://docs.docker.com/engine/reference/commandline/dockerd/). ### Using the sysbox-runc command It's also possible to launch system containers directly via the -`sysbox-runc` command. +`sysbox-runc` command. This is useful when wishing to control +the configuration of the system container at the lowest level. As the root user, follow these steps: @@ -64,39 +97,117 @@ Also, in step (2) above, feel free to modify the system container's of the OCI directives in this file (refer to the [Sysbox design document](design.md#oci-compatibility) for details). -### Support for Other Container Managers +### Using Other Container Managers We officially only support the above methods to run Sysbox. However, we plan to add support for other OCI compatible container managers (e.g., [cri-o](https://cri-o.io/)) soon. +## System Container Isolation Modes + +Nestybox system containers **always** use the Linux user-namespace for +enhanced isolation between the system container processes and the rest +of the system. + +This is one key thing that differentiates them from regular Docker +containers, and it's done in order for the system container to be more +strongly isolated from the rest of the system while enabling +functionality that requires root in the system container to have full +privileges within the container. + +The Linux user namespace works by mapping user-IDs and group-IDs +between the container and the host. There are different ways this +mapping can be done, which give rise to what we call "system container +isolation modes". + +Sysbox supports the following system container isolation modes, listed +from strongest to weakest isolation. + +- [Exclusive userns-remap mode](#exclusive-userns-remap-mode) +- [Docker userns-remap mode](#docker-userns-remap-mode) + +A quick summary of the modes is shown in the table below. + +| Isolation Mode | User Namespace User-ID and Group-ID Mappings | Isolation Strength | Enabling Action | Shiftfs | +| ---------------------- | -------------------------------------------- | ------------------ | --------------------------------------------- | ------- | +| Exclusive userns-remap | Exclusive per system container | Highest | None (default operating mode) | Yes | +| Docker userns-remap | Common for all system containers | Medium | Configure the Docker daemon with userns-remap | No | + +More detailed explanations on each isolation mode and how to enable +them are in the sub-sections that follow. + +### Exclusive userns-remap mode + +In this mode, Sysbox allocates **exclusive** user-namespace user-ID and +group-ID mappings to each system container. + +This ensures that if a system container process somehow escapes the +container's root filesystem jail, it will find itself without any +permissions to access any other files in the system. + +It's the most secure mode offered by Sysbox, and it's the default +operating mode for Nestybox system containers (whether they are +deployed with Docker or directly via the sysbox-runc command). No +action from the user is required to enable this mode. + +Note that this mode requires the presence of the [shiftfs module](design.md#ubuntu-shiftfs-module) +in the Linux kernel, which in turn restricts the [distros](../README.md#supported-linux-distros) +on which Sysbox is supported. + +For further details on Sysbox's usage of the Linux user namespace and +exclusive user-ID/group-ID mappings, refer to the [Sysbox design document](design.md#linux-namespace-usage). + +### Docker userns-remap mode + +This mode is automatically entered by Sysbox when the Docker daemon +is configured with "userns-remap" enabled, as described in this +[Docker article](https://docs.docker.com/engine/security/userns-remap). + +When Docker is configured with userns-remap, it enables the +user-namespace in all containers and configures the user-ID and +group-ID mappings for them. Sysbox automatically detects this and +honors Docker's selection of user-ID and group-ID mappings. + +Note however that this mode is less secure than the exclusive +userns-remap mode described previously because Docker currently uses +the same user-ID and group-ID mappings for all containers, thereby +decreasing cross-container isolation. + +Having said that, this mode is useful because it's a bit more mature +than the [Exclusive userns-remap mode](#exclusive-userns-remap-mode) +and does not require the presence of the shiftfs module in the Linux +kernel, thereby increasing the [distros](../README.md#supported-linux-distros) +on which it's supported. + ## Running Software inside the System Container A system container is logically a super-set of a regular Docker -application container, and thus should be able to run any application -that runs in a regular Docker container plus system-level software +application container and thus should be able to run any application +that runs in a regular Docker container, plus system-level software (e.g., Docker inside the system container). Nestybox's goal is to allow you run any software inside the system -container just as you would on a physical host. Ideally there -shouldn't be any difference. +container just as you would on a physical host or VM. The sub-sections below provide information on running system-level -software inside a system container (i.e., software that normally does -not run inside a regular Docker container). +software inside a system container. This software does not normally +run inside a regular Docker container (unless unsecure privileged +containers are used and/or complex container configurations are set). ### Docker-in-Docker Nestybox system containers support running Docker inside the system container, without using privileged containers or bind-mounting the host's Docker socket into the container. In other words, cleanly and -securely, with total isolation between the inner and outer Docker +securely, with **total isolation** between the inner and outer Docker daemons. +This is useful for Docker sandboxing, testing and CI/CD use cases. + Moreover, it's fast: the Docker daemon inside the container uses the fast overlay2 (or btrfs) storage drivers, rather than alternative -docker-in-docker solutions that resort to the slower vfs driver. +Docker-in-Docker solutions that resort to the slower vfs driver. To run Docker inside a system container (a.k.a Docker-in-Docker), the easiest way is to use a system container image that has Docker @@ -104,66 +215,10 @@ pre-installed in it. You can find a few such images in the [Nestybox DockerHub repo](https://hub.docker.com/r/nestybox). -For example, to run a system container that has Ubuntu Disco + Docker inside, simply -type: - -```console -$ docker run --runtime=sysbox-runc -it --hostname sc nestybox/ubuntu-disco-docker:latest -root@sc:/# -``` - -From within the system container, start Docker: - -```console -root@sc:/# dockerd > /var/log/dockerd.log 2>&1 & -``` - -Note that we don't yet support systemd inside the system container, so -the Docker daemon needs to be started manually as shown above. - -The Docker daemon should now be running inside the system container. -You can verify this by looking at the Docker daemon log file: - -```console -root@sc:/# tail -n 2 /var/log/dockerd.log -time="2019-08-28T23:37:42.570893165Z" level=info msg="Daemon has completed initialization" -time="2019-08-28T23:37:42.593056000Z" level=info msg="API listen on /var/run/docker.sock" -``` - -Also, `docker ps` should work fine: - -```console -root@sc:/# docker ps -CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES -``` - -Once Docker is running inside the system container, use it as usual. -For example, to start an (inner) busybox container: - -```console -root@sc:/# docker run -it --hostname my_inner_cont busybox -Unable to find image 'busybox:latest' locally -latest: Pulling from library/busybox -ee153a04d683: Pull complete -Digest: sha256:9f1003c480699be56815db0f8146ad2e22efea85129b5b5983d0e0fb52d9ab70 -Status: Downloaded newer image for busybox:latest -/ # -``` - -As you can see, the system container is running Docker inside of it, -with total isolation from the host's Docker daemon. +The [Sysbox Quickstart Guide](quickstart.md) has several examples showing how +to run Docker inside a system container. -The Dockerfiles for the images in the Nestybox repo are -[here](../dockerfiles/). - -Feel free to source the Nestybox sample images from your own Dockerfile, -or make a copy of a Nestybox Dockerfile and modify it per your needs. -Instructions for doing so are [here](../dockerfiles/README.md). - -The [Nestybox blog site](https://blog.nestybox.com) has more info on -Docker-in-Docker and it's use cases. - -### Inner & Outer Containers +#### Inner and Outer Containers When launching Docker inside a system container, terminology can quickly get confusing due to container nesting. @@ -171,212 +226,171 @@ quickly get confusing due to container nesting. To prevent confusion we refer to the containers as the "outer" and "inner" containers. -* The outer container is a system container, created at the host - level; it's launched with Docker + Sysbox. +- The outer container is a system container, created at the host + level; it's launched with Docker + Sysbox. -* The inner container is an application container, created within - the outer container. It's launched by the Docker instance running - inside the outer container. +- The inner container is an application container, created within the + outer container (i.e., it's created by the Docker instance running + inside the system container (aka the inner Docker)). -### Inner Docker Restrictions +#### Inner Docker Restrictions The Docker instance inside the system container is assumed to store -it's images at the usual `/var/lib/docker`. This is known as the -Docker "data-root". +it's images at the usual `/var/lib/docker`. This directory is known as +the Docker "data-root". While it's possible to configure the inner Docker to store it's images at some other location within the system container (via the Docker daemon's `--data-root` option), Sysbox does **not** currently support this (i.e., the inner Docker won't work). -### Inner Docker Image Caching +#### Inner Docker Image Persistence The Docker instance running inside the system container stores its images in the `/var/lib/docker` directory inside the container. When the system container is removed (i.e., not just stopped, but -actualy removed via `docker rm`), the contents of that directory will -also be removed. - -This means that the inner Docker's image cache is destroyed when the -associated system container is removed. - -It's possible to override this behavior by mounting a host volume into -the system container's `/var/lib/docker`, in order to persist the -inner Docker's image cache accross system container lifecycles: +actually removed via `docker rm`), the contents of that directory will +also be removed. In other words, the inner Docker's image cache is +destroyed when the associated system container is removed. -```console -$ docker run --runtime=sysbox-runc -it --hostname sc --mount type=bind,source=/some/host/dir,target=/var/lib/docker nestybox/ubuntu-disco-docker:latest -``` +It's possible to override this behavior by mounting a host directory +into the system container's `/var/lib/docker`, in order to persist the +inner Docker's image cache across system container life-cycles. -However, Sysbox does not currently support mounts into the system -container's `/var/lib/docker` when Docker is configured with docker -userns-remap disabled (the mount is in fact ignored in this -case). This restriction is due to a low-level problem in the -interaction between overlayfs and Ubuntu's `shiftfs` module, which is -expected to be resolved soon. +The Sysbox Quick Start Guide has examples [here](quickstart.md#persistence-of-inner-container-images-with-docker-volumes) +and [here](quickstart.md#persistence-of-inner-container-images-with-bind-mounts). -## Sysbox Reconfiguration +A warning though: a given host directory mounted into a system +container's `/var/lib/docker` must only be mounted on a **single +system container at any given time**. This is a restriction imposed by +the Docker daemon, which does not allow its image cache to be shared +concurrently among multiple daemon instances. Sysbox will check +for violations of this rule and report an appropriate error during +system container creation. -The Sysbox installer starts the [Sysbox components](design.md#sysbox-components) -automatically, using systemd. +### Systemd -Normally this is sufficient and the user need not worry about reconfiguring Sysbox. +Nestybox has preliminary support for running Systemd inside a system +container, meaning that Systemd works but there are still some minor +issues that need resolution. -However, there are scenarios where the daemons may need to be -reconfigured (e.g., to enable a given option on sysbox-fs or -sysbox-mgr). For example, increasing the log level by passing the -`--log-level debug` to the sysbox-fs or sysbox-mgr daemons. +Deploying Systemd inside a system container is useful when you plan to +run multiple services inside the system container, or when you want to +use it as a virtual host environment. -In these cases, do the following: +Unlike other solutions, Nestybox system containers run Systemd securely +(without resorting to privileged Docker containers) and without burden +on the user: simply launch a system container image with Systemd as +its entry point and Sysbox will ensure the system container is setup +to run Systemd without problems. -1) Modify the desired service initialization instruction. +The [Sysbox Quick Start Guide](quickstart.md) has a few examples of this. - Example: +## System Container Bind Mount Requirements -```console -$ sudo sed -i '/^ExecStart/ s/$/ --log-level debug/' /etc/systemd/system/sysbox.service.wants/sysbox-fs.service -$ -$ egrep "ExecStart" /etc/systemd/system/sysbox.service.wants/sysbox-fs.service -ExecStart=/usr/local/sbin/sysbox-fs --log /var/log/sysbox-fs.log --log-level debug -``` - -2) Reload **systemd** to digest the previous change: - -```console -$ sudo systemctl daemon-reload -``` - -3) Restart the **sysbox** service: - -```console -$ sudo systemctl restart sysbox -``` - -Note that even though Sysbox is comprised of various daemons and its -respective services, you should only interact with its outer-most -systemd service: **sysbox**. - -## Rootless Container Support - -Sysbox must run with root privileges on the host system. It won't -work if executed without root privileges. - -## Interaction with Docker Userns Remap - -Docker has a configuration called [userns-remap](https://docs.docker.com/engine/security/userns-remap) -that enables the Linux user namespace in containers. By default, userns-remap is disabled in Docker. - -Sysbox works out-of-the-box with either configuration of Docker -(i.e., with or without userns-remap enabled). No change to Sysbox's -configuration is needed either way. - -However, Sysbox works best with Docker userns-remap *disabled* -(though this reduces the number of [distros on which Sysbox runs](../README.md#supported-linux-distros)). - -The reason userns-remap disabled is preferred is that in this case -Sysbox will allocate exclusive user-ID and group-ID mappings for the -system container's user namespace, thereby improving -isolation between system containers. - -If on the other hand Docker userns-remap is enabled, then Docker -chooses the user-ID and group-ID mappings for the container and -Sysbox honors these. However Docker currently has a limitation: it -uses the same user-ID and group-ID mappings for all containers, -thereby decreasing isolation between containers (i.e., if a process -escapes the container, it may be able to access the filesystem -of other containers). +Sysbox system containers support all Docker storage mount types: +[volume, bind, or tmpfs](https://docs.docker.com/storage/). -The table below summarizes this: +However, for bind mounts there are important requirements in order to +deal with file ownership and security issues. +These requirements vary depending on the [system container isolation mode](#system-container-isolation-modes). -| Docker userns-remap | Description | -|---------------------|-------------| -| Disabled | Sysbox will allocate exclusive uid(gid) mappings per system container and perform uid(gid) shifting. | -| | Strong container-to-host isolation. | -| | Strong container-to-container isolation. | -| | Uses the Ubuntu shiftfs module in the kernel. | -| | -| Enabled | Sysbox will honor Docker's uid(gid) mappings. | -| | Strong container-to-host isolation. | -| | Reduced container-to-container isolation (same uid(gid) range). | -| | Does not use the Ubuntu shiftfs module in the kernel. | +The following table has a quick summary of the bind mount requirements: -For further info Sysbox's usage of the Linux user namespace and -associated ID mappings, refer to the [Sysbox design document](design.md). +| System Container Isolation Mode | Bind Source Ownership on Host | Recommended Security Precautions | +| ------------------------------- | ----------------------------- | ----------------------------------------------------------- | +| Exclusive userns-remap | 0 to 65535 | Bind source accessible by root only. | +| | | Mount as read-only in sys container when possible. | +| | | Remount on host as `noexec` prior to running sys container. | +| Docker userns-remap | uid to (uid+65535) | None | +| | (where `uid` is the the subid | | +| | associated with the Docker | | +| | userns remap configuration) | | -## Docker Bind Mount Requirements +The sub-sections below describe in detail the rationale for these +requirements and provide examples. -Sysbox system containers support all Docker storage mount types: -[volume, bind, or tmpfs](https://docs.docker.com/storage/). +Note that the requirements listed in the table above are not specific +Nestybox system containers; these are also generally applicable to +bind mounts on any Docker container. -However, for bind mounts there are some important requirements in -order to deal with file ownership issues. These vary depending on -whether Docker is configured with userns-remap or not, as explained -below. +Finally, the requirements don't apply to Docker volume or tmpfs +mounts, as they are implicitly met by these. -### Without Docker userns-remap +### Bind Mounts in Exclusive Userns-Remap Mode -This is the default Docker configuration. +When a system container that uses [exclusive userns-remap mode](#exclusive-userns-remap-mode) +is started with a bind mount (e.g., `docker run --runtime=sysbox-runc --mount type=bind,source=/some/source,target=/some/target ...`), +Sysbox -In this case, Sysbox will automatically mount the Ubuntu shiftfs -filesystem on the bind-mount source when a container starts and the -mount will persist until the container is stopped. +- Sysbox automatically mounts the Ubuntu shiftfs filesystem on the + bind-mount source when the container starts; the shiftfs mount is kept + in place until the container is stopped. -Sysbox uses the shiftfs mount to ensure that processes inside the -container see the right ownership for the contents of the bind mounted -directory as described [here](design.md/#ubuntu-shiftfs-module). +- Sysbox uses the shiftfs mount to ensure that processes inside the + system container see the correct file ownership for the contents of + the bind mounted directory, as described [here](design.md#ubuntu-shiftfs-module). As a result, the following requirements apply to system container bind mounts: -* Files in the bind mounted directory (or the bind mounted file itself) - should be owned by users in the range [0:65535]. The used-ID and group-ID - of these files will be mapped inside the system container as follows: +- The bind mounted source directory or file should be owned by a user-ID + and group-ID in the range [0:65535]. The used-ID and group-ID of + these files will be mapped inside the system container as follows: - | File Ownership on Host | File Ownership in System Container | - |------------------------|------------------------------------| - | 0 (root) -> 65535 | 0 (root) -> 65535 | - | Others | nobody:nogroup | + | File User/Group-ID on Host | File User/Group-ID in System Container | + | -------------------------- | -------------------------------------- | + | 0 to 65535 | 0 to 65535 | + | Others | nobody:nogroup | - Note that it's possible for multiple system containers to share a - bind-mount, as all system containers will use the file ownership - mapping shown above. + In other words, files owned by root in the bind-mounted source + directory are owned by the root user in the system container. Files + owned by user 1000 in the bind-mounted source directory are owned by + user 1000 inside the system container. And so on. -* To improve security, one or more of the following requirements - should be in place for the bind mount source file or directory: + Note that it's possible for multiple system containers to share a + bind-mount source, as all system containers will use the file + ownership mapping shown above. The Sysbox Quick Start Guide has + an example of this [here](quickstart.md#sharing-storage-among-system-containers). - - It's accessible only by host's root user (i.e., `0700` permission - somewhere in the path). +- For environments where security is important, we recommend that one + or more of the following actions be applied to the bind mount + source file or directory: - - It's mounted into the system container as "read-only" when possible: + - Make it accessible only by host's root user (i.e., `0700` permission + somewhere in the mount source's absolute path). -```console -$ docker run --runtime=sysbox-runc -it --mount type=bind,source=/path/to/bind/source,target=/path/to/mnt/point,readonly my-syscont -``` + - Mount it into the system container as "read-only" when possible. For example: - - It's re-mounted on the host with the `noexec` attribute prior to - being bind mounted into the system container: + ```console + $ docker run --runtime=sysbox-runc -it --mount type=bind,source=/path/to/bind/source,target=/path/to/mnt/point,readonly my-syscont + ``` -```console -$ sudo mount --bind /path/to/bind/source /path/to/bind/source -$ sudo mount -o remount,bind,noexec /path/to/bind/source /path/to/bind/source -$ docker run --runtime=sysbox-runc -it --mount type=bind,source=/path/to/bind/source,target=/path/to/mnt/point my-syscont` -``` + - Re-mount it on the host with the `noexec` attribute prior to + bind mounting it into the system container. For example: - These are not hard requirements, meaning that if they aren't met the - bind mount into the system container will still work. However, - failure to meet one of these requirements reduces host security - as described [here](design.md#shiftfs-security-precautions). + ```console + $ sudo mount --bind /path/to/bind/source /path/to/bind/source + $ sudo mount -o remount,bind,noexec /path/to/bind/source /path/to/bind/source + $ docker run --runtime=sysbox-runc -it --mount type=bind,source=/path/to/bind/source,target=/path/to/mnt/point my-syscont` + ``` -### With Docker userns-remap +The above requirements are not "hard requirements", meaning that +Sysbox won't check for them (i.e., if they aren't met the system +container will still work). However, failure to meet one of these +requirements will result in incorrect file ownership on the bind mount +or reduced host security as described [here](design.md#shiftfs-security-precautions). -If Docker is configured with userns-remap, then the following -rule applies to bind mounts: +### Bind Mounts in Docker Userns-Remap Mode -* Files in the bind mounted directory (or the bind mounted file - itself) should be owned by users in the range [uid:uid+65535], where - `uid` is the host's user-ID associated with the docker userns-remap - configuration. The same applies to group-IDs. +When a system container is started with [Docker userns-remap mode](#docker-userns-remap-mode), the +following rule applies to bind mounts: + +- The bind mounted directory or file should be owned by users in the + range [uid:uid+65535], where `uid` is the subid range associated + with the host's user-ID set in the docker userns-remap + configuration. The same applies to group-IDs. For example, if the docker userns-remap configuration is the following: @@ -388,11 +402,13 @@ $ cat /etc/docker/daemon.json ``` then the bind mount source directory and/or files must be owned by the -subuid(gid) range associated with `someuser`. That range can be found in -`/etc/subuid` and `/etc/subgid`. This will ensure that the file shows -up with appropriate ownership inside the system container. +subuid(gid) range associated with user `someuser`. That range can be +found in the host's `/etc/subuid` and `/etc/subgid` files. + +This rule ensures that files show up with appropriate ownership inside +the system container. -For example: +For example, assume the following configuration of subuids(gids): ```console $ cat /etc/subuid @@ -404,59 +420,45 @@ someuser:100000:65536 sysbox:165536:268435456 ``` -Then the bind mount source should be owned by `100000:100000` because that's -the start of the subuid(gid) range associated with `someuser`. +Then the bind mount source should be owned by `user-ID:group-ID` = +`100000:100000` because that's the start of the subuid(gid) range +associated with `someuser`. -Note that the security requirements listed in the prior section don't -apply when Docker is configured with userns-remap. That's because -files written from within the system container will have ownership -corresponding to the subuid(gid) associated with user `someuser`, -rather than `root:root`. +Note that the security requirements listed in the [prior section](#bind-mounts-in-exclusive-userns-remap-mode) +don't apply when Docker is configured with userns-remap. That's +because files written from within the system container will have +ownership corresponding to the subuid(gid) associated with user +`someuser` rather than `root:root` and thus don't present a security +risk on the host. -## Storage sharing between system containers +## Storage Sharing Among System Containers System containers use the Linux user-namespace for increased isolation -from the host and from other containers (i.e., each container has a -range of uids(gids) that map to a non-root user in the host). +from the host and from other containers. A known issue with containers that use the user-namespace is that sharing storage between them is not trivial because each container may -be assigned an exclusive uid(gid) range on the host, and thus may not -have access to the shared storage (unless such storage has lax -permissions). +be assigned an exclusive user-ID(group-ID) range on the host, and thus +may not have permissions to access the files in the shared storage +(unless such storage has lax permissions allowing read-write-execute +by any user). Sysbox system containers support storage sharing between multiple system containers, without lax permissions and in spite of the fact -that each system container may be assigned an exclusive uid/gid range -on the host. - -This is possible due to the uid(gid) shifting performed by the Ubuntu -`shiftfs` module. +that each system container may be assigned an exclusive +user-ID/group-ID range on the host. -Setting it up is simple: +In order to share storage between multiple system containers, simply +run the system containers and bind mount the storage into them. -First, create a shared directory owned by `root:root`: +When performing the bind-mount, the requirements described in +section [System Container Bind Mount Requirements](#system-container-bind-mount-requirements) +apply. -```console -$ sudo mkdir -``` +The Sysbox Quick Start Guide has an example of multiple system containers +sharing storage [here](#storage-sharing-among-system-containers)). -Then simply bind-mount the directory into the system container(s): - -```console -$ docker run --runtime=sysbox-runc --rm -it --hostname syscont --mount type=bind,source=,target= debian:latest -``` - -When the system container is launched this way, Sysbox will notice -that bind mounted directory is owned by `root:root` and will mount -shiftfs on top of it, such that the container can have access to -it. Repeating this for multiple system containers will give all of -them access to the shared direcotry. - -Note: for improve host security on the bind mount source, follow the -recommendations described [above](#docker-bind-mount-requirements). - -## Support for Linux Security Modules (LSMs) +## Support for Linux Security Modules ### AppArmor @@ -468,6 +470,106 @@ the near future. Sysbox does not yet support running on systems with SELinux enabled. -### Others +### Other LSMs Sysbox does not have support for other Linux LSMs at this time. + +## Support for Exposing Host Devices inside System Containers + +Sysbox does not currently support exposing host devices inside system +containers (e.g., via the `docker run --device` option). We are +working on adding support for this. + +## Support for `docker run userns` + +When Docker is configured in userns-remap mode, Docker offers the ability +to disable that mode on a per container basis via the `--userns=host` +option in the `docker run` and `docker create` commands. + +This option **does not work** with Sysbox (i.e., don't use +`docker run --userns=host --runtime=sysbox-runc ...`). + +Usage of this option is rare as it can lead to the problems as +described [in this Docker article](https://docs.docker.com/engine/security/userns-remap/#disable-namespace-remapping-for-a-container). + +## Rootless Container Support + +Sysbox must run with root privileges on the host system. It won't +work if executed without root privileges. + +## Privileged Container Support + +System containers must never be launched using the Docker +`--privileged` flag (in fact doing so will fail). Recall that one key +reason for using system containers is to use them as a secure +alternative to privileged containers (i.e., one that can run system +level workloads but with enhanced isolation from the underlying host). + +In addition, within a system container we don't yet support running +Docker privileged containers. That is, inner Docker containers must +not be executed with the `docker run --privileged` flag. In the future +we hope to add support for this, such that the privileged inner +container is privileged within the context of the system container +only rather than the underlying host. + +## Sysbox Reconfiguration + +The Sysbox installer starts the [Sysbox components](design.md#sysbox-components) +automatically, using the host's Systemd. + +Normally this is sufficient and the user need not worry about re-configuring Sysbox. + +However, there are scenarios where the daemons may need to be +reconfigured (e.g., to enable a given option on sysbox-fs or +sysbox-mgr). For example, increasing the log level by passing the +`--log-level debug` to the sysbox-fs or sysbox-mgr daemons. + +In order to reconfigure Sysbox, do the following: + +1) Stop all system containers (there is a sample script for this [here](../scr/rm_all_syscont)). + +2) Modify the desired service initialization instruction. + + For example, if you wish to change the log-level, do the following: + +```console +$ sudo sed -i --follow-symlinks '/^ExecStart/ s/$/ --log-level debug/' /lib/systemd/system/sysbox-fs.service +$ +$ egrep "ExecStart" /lib/systemd/system/sysbox-fs.service +ExecStart=/usr/local/sbin/sysbox-fs --log /var/log/sysbox-fs.log --log-level debug +``` + +3) Reload Systemd to digest the previous change: + +```console +$ sudo systemctl daemon-reload +``` + +4) Restart the sysbox service: + +```console +$ sudo systemctl restart sysbox +``` + +5) Verify the sysbox service is running: + +```console +$ sudo systemctl status sysbox.service +● sysbox.service - Sysbox General Service + Loaded: loaded (/lib/systemd/system/sysbox.service; enabled; vendor preset: enabled) + Active: active (exited) since Sun 2019-10-27 05:18:59 UTC; 14s ago + Process: 26065 ExecStart=/bin/true (code=exited, status=0/SUCCESS) + Main PID: 26065 (code=exited, status=0/SUCCESS) + +Oct 27 05:18:59 disco1 systemd[1]: sysbox.service: Succeeded. +Oct 27 05:18:59 disco1 systemd[1]: Stopped Sysbox General Service. +Oct 27 05:18:59 disco1 systemd[1]: Stopping Sysbox General Service... +Oct 27 05:18:59 disco1 systemd[1]: Starting Sysbox General Service... +Oct 27 05:18:59 disco1 systemd[1]: Started Sysbox General Service. +``` + +That's it. You can now launch system containers. + +Note that even though Sysbox is comprised of various daemons and its +respective services, you should only interact with its outer-most +systemd service called "sysbox".