Introduction

The vocabulary in the container world is often used in a confusing way. In particular the word container is used to refer to two things:

  • at rest, container means a file or a set of files, which should rather be referred as as Container Image or Container Repository.
  • when you type a command to start the container, the Container engine creates a Linux process making API calls to the Linux Kernel.

Info

Containers processes are run with higher isolation level, using:

Containers in Linux were available since 2001 and in the following years with initiatives such as the followings:

  • Linux-VServer (2001)
  • OpenVZ (2005)
  • LXC (2008)
  • Let Me Contain That for You (lmctfy) (2013)

The creation of several container images format and container engines have significantly simplified the adoption, but competing standards exists

Containers Images, Image Format and Repositories

Image Formats

Today, almost all major tools and engines have moved to a format defined by the Open Container Initiative (OCI). This image format defines the layers and metadata within a container image. Essentially, the OCI image format defines a container image composed of tar files for each layer, and a manifest.json file with the metadata. Historically, LXD, RKT and Docker had different image format (i.e. single layer vs multi-layer) but the Docker v2 image format was used as a based for OCI

Each Container Engine had its container images format. LXD, RKT, and Docker all had their own image formats. Some were made up of a single layer, while others were made up of a bunch of layers in a tree structure.

Images or Repositories?

When people use the word container image they are often referring to a repository, and a bundle of multiple image layers and metadata. In fact, on the command line you specify a repository, not an image:

docker pull rhel7

This is actually expanded automatically to docker pull registry.access.redhat.com/rhel7:latest. This can be confusing, and many people refer to this as an image or a container image. However, running docker images result first column is “repository”


REPOSITORY                                  TAG                     IMAGE ID                CREATED                 VIRTUAL SIZE
 registry.access.redhat.com/rhel7            latest                  6883d5422f4e            4 weeks ago             201.7 MB
 registry.access.redhat.com/rhel             latest                  6883d5422f4e            4 weeks ago             201.7 MB
 registry.access.redhat.com/rhel6            latest                  05c3d56ba777            4 weeks ago             166.1 MB
 registry.access.redhat.com/rhel6/rhel       latest                  05c3d56ba777            4 weeks ago             166.1 MB
 ...

When we specify the repository on the command line, the Container Engine is doing some extra work for you:

  • search the repository to a list of server
  • default the tag

If we wanted to express the full URL ourselves, we should use this format: REGISTRY/NAMESPACE/REPOSITORY[:TAG] for example docker pull registry.access.redhat.com/rhel7/rhel:latest

Image Layer

Image layers in a repository are connected together in a parent-child relationship. Each image layer represents changes between itself and the parent layer.

Since Docker 1.7, there is no native tooling to inspect image layers in a local repository (there are tools for online registries). With the help of a tool called Dockviz, you can quickly inspect all of the layers: each layer has tag and a Universally Unique Identifier (UUID).

Image tags

Tags are a way for image builders to communicate what best image layers consumers should use.

Info

This is only a convention, nor OCI nor any other standards mandate what tags should be used for.

One can list the tags available for a specific repository like so:

curl -s registry.access.redhat.com/v1/repositories/rhel7/tags

Namespaces

Namespaces allow to separate groups of different repositories. For example in Dockerhub the namespace is the username of the person sharing the image, and in Red Hat they use the product named (rhel7, openshift, etc)

Tip

There might be a default repository for a given namespace, so the following commands are the same

docker pull fedora

docker pull docker.io/fedora docker pull docker.io/library/fedora:latest

Container Engines and container runtimes

Container engines are software that accept user requests and command line options, and run the container from the user perspective. There are many container engines:

  • Docker
  • RKT
  • CRI-O
  • LXD
  • The ones created by PAAS and Container platform for internal usage

Container engines do not actually run the containers themselves but use an OCI runtime such as Runc, but are still responsible for:

  • Handling user input
  • Handling input over an API often from a Container Orchestrator
  • Pulling the Container Images from the Registry Server
  • Expanding decompressing and expanding the container image on disk using a Graph Driver (block, or file depending on driver)
  • Preparing a container mount point, typically on copy-on-write storage (again block or file depending on driver)
  • Preparing the metadata which will be passed to the container Container Runtime to start the Container correctly
    • Using some defaults from the container image (ex.ArchX86)
    • Using user input to override defaults in the container image (ex. CMD, ENTRYPOINT)
    • Using defaults specified by the container image (ex. SECCOM rules)
  • Calling the Container Runtime

Container runtime

It is a lower level component typically used in a Container Engine but can also be used by hand for testing. The Open Containers Initiative (OCI) Runtime Standard reference implementation is runc. This is the most widely used container runtime, but there are others OCI compliant runtimes, such as crun, railcar, and katacontainers. Docker, CRI-O, and many other Container Engines rely on runc.

The container runtime is responsible for:

  • Consuming the container mount point provided by the Container Engine (can also be a plain directory for testing)
  • Consuming the container metadata provided by the Container Engine (can be a also be a manually crafted config.json for testing)
  • Communicating with the kernel to start containerized processes (clone system call)
  • Setting up cgroups
  • Setting up SELinux Policy
  • Setting up App Armor rules

When the Docker engine was first created it relied on LXC as the container runtime. Later, the Docker team developed their own library called libcontainer to start containers. This library was written in Golang, and compiled into the original Docker engines.

Kernel namespaces

Container runtime makes usage of Kernel namespaces, a feature that allows different processes to have its own mount points, network interfaces, user identifiers, process identifiers, etc.

Instead of using exec() syscall to launch a new process, a different syscall clone() is used, that allows isolating the process

Container Runtime Interface

When Google released Kubernetes in 2015, the individual nodes of the cluster used Docker’s runtime to run containers and manage container images. In late 2016, developers introduced an abstraction between Kubernetes and the container runtime it uses: the Container Runtime Interface — or CRI, for short.

To plug a new container runtime into Kubernetes, all that is needed is a small piece of code called a shim that translates requests made by Kubernetes into requests understandable by the runtime. In theory, each additional runtime would need a custom shim, but a generic one exists for all container runtimes that implement the OCI Specification.

CRI-O is a a minimal runtime implementation that adheres to CRI and allow Kubernetes to run containers without Docker.

Container Host

The container host is the system that runs the containerized process, often simply call containers. It could be your laptop, a VM instance in your public cloud, etc. Containers host typically cache images after they are pulled from the registry server

Registry servers

Registry severs are fancy file servers that are used to store docker repository, and when a container engine doesn’t have a locally cached copy, it will pull it from registry servers. By default, docker.io is configured, but others can be added.

Warning

Docker trusts the registry server, so be careful: you might be pulling licensed software, insecure software, etc.

The Graph Driver

The graph driver is the piece of software that maps the necessary image alyers to local storage. The image layers can be mapped to a directory using aufs, devicemapper, btrfs, zfs and overlays.

When a new container process is started, the image layers are mounted read-only with a kernel namespace and a copy-on-write layer is created to allow the container to write data

Container orchestration

Container orchestration emerges as need after teams install a container host and pull some repositories. Soon they will want to use a cluster of container hosts to schedule work and standardize how applications are defined.

A container orchestrator really does two things:

  • Dynamically schedules container workloads within a cluster of computers. This is often referred to as distributed computing.
  • Provides a standardized application definition file (kube yaml, docker compose, etc)

Kubernetes has become the defacto standard in container orchestration, similar to Linux before it, while alternatives such as Swarm and Mesos are losing traction. If you are looking at container orchestration, Red Hat recommends our enterprise distribution called OpenShift.

Container use cases

Today most containers are application containers (i.e. MySQL) but containers can be used also to run Operating Systems such as LXC and LXD. Super Privileged Containers (SPC) can be used for monitoring and other administrative tasks, such as loading kernel modules on Kubernetes or OpenShift.

OCI

To make sure that all container runtimes could run images produced by any build tool, the community started the Open Container Initiative — or OCI — to define industry standards around container image formats and runtimes.

Docker’s original image format has become the OCI Image Specification, and various open-source build tools support it, including:

  • BuildKit, an optimized rewrite of Docker’s build engine;
  • Podman, an alternative implementation of Docker’s command-line tool that doesn’t need a daemon
  • Buildah, a command-line alternative to writing Dockerfiles;
  • Skopeo, a CLI tool to interact with registries.

Given an OCI image, any container runtime that implements the OCI Runtime Specification can unbundle the image and run its contents in an isolated environment. Docker donated its runtime, runc, to the OCI to serve as the first implementation of the standard.

Other open-sources implementation

Kata containers are an implementation that uses virtual machines rather than Linux namespaces: namespaces allow applications to escape their containers under certain circumstances and for specific use cases, like running untrusted workloads, stronger security guarantees are required;

gVisor, a.k.a runsc, which focuses on security and efficiency, released by Google in 2018. Applications running inside the gVisor sandbox rarely interact with the underlying Linux kernel directly, reducing the attack surface untrusted workloads may exploit. The sandbox implements many Linux system calls in userspace.

Firecracker, a runtime optimized for serverless workloads. This container technology powers AWS Lambda and AWS Fargate. Firecracker runs containerized applications inside MicroVMs: lightweight virtual machines optimized for running single applications instead of entire operating systems.

References