I have always been interested in process isolation technologies such as ZeroVM which provides a form of isolation based on Google's NaCl that even went through formal verifaction at CEA, as part of the Resilience FUI project lead by Nexedi. I did not understand why ZeroVM was not further adopted and had a discussion with one of its creators, Camuel Gilyadov. Based on his experience, the fact that NaCL development stopped prevented the adoption of ZeroVM and has lead other new approaches that solve a similar problem.
Workers is a technology created by Cloudflare and based on V8: Isolates. It was created to solve the lack of isolation in the Lua scripting of Cloudlfare's CDN. Cloudflare markets it as "Cloud without Containers". It is not very different from a user point of view from what has existed for 20 years in Zope: restricted Python scripts. Technically spearking, it consists of running multiple interpreters within the same V8 process (Isolate class). It is not clear however how code is isolated between interpreters and how strong is this isolation.
Support of other languages (C, C++, Rust) is based on WebAssembly. A complete description is availble on "Fine-Grained Sandboxing with V8 Isolates" video.
The V8: Isolates class is used by other libraries such as isolated-vm.
Developped by AWS, Firecracker is presented as an alternative to QEMU. Its written in Rust, provides a minimal required device model to the guest operating system while excluding non-essential functionality (only 5 emulated devices are available: virtio-net, virtio-block, virtio-vsock, serial console, and a minimal keyboard controller used only to stop the microVM). It also handles resource rate limiting for microVMs, and provides a microVM metadata service to enable the sharing of configuration data between the host and guest.
Firecracker runs in user space and uses the Linux Kernel-based Virtual Machine (KVM) to create microVMs. The fast startup time and low memory overhead of each microVM enables you to pack thousands of microVMs onto the same machine. Each Firecracker microVM is further isolated with common Linux user-space security barriers by a companion program called "jailer". The jailer provides a second line of defense in case the virtualization barrier is ever compromised.
Firecracker is based on musl C library by default and is usually operated through an API which makes it suitable for shared hosting or dynamic CDN, as in SlapOS "slave instance" model. It can also be used with a command line.
./firecracker --api-sock /tmp/firecracker.socket --config-file
Its streamlined kernel loading process enables a < 125 ms startup time and a < 5 MiB footprint. Yet, it is a traditional VM approach which then spaws containers inside the VM with the runc command and supports the OCI image format. It is similar in terms of goals with ChromeOS Linux VMs that are used to run a Debian system.
One should note that there are other small VMs on the market: Jailhouse (Siemens) and XtratuM (fentISS) are quite popular for embedded applications.
gVisor is a user-space kernel, written in Go, that implements a substantial portion of the Linux system call interface. It provides an additional layer of isolation between running applications and the host operating system. One could view it as a revival of user mode Linux, implemented in Go and with better performance.
gVisor seems to be implemented using the ptrace approach, the same one as proot relies on to deliver user-mode chroot isolation. However, it relies on a much better security model to achieve user-mode process isolation without virtualisation.
What gVisor solves best in our opinion is the problem of poor portability of containers which has lead Nexedi to reject the use of containers in production. This is clearly illustrated in gVisor architecture:
gVisor seems for now focusing primarily on running containers. What would be really nice would be to use gVisor to execute binaries in the same way as qemu does in user space emulation mode or as proot does in combination with qemu.
OSv is a minimal operaring system made by Cloudius Systems, the company that also does Scylla DB (an altenative to Nexedi's NEO). The idea of OSv is to create very small images that can load much faster than a full blown POSIX OS. OSv achieves this by removing features such as multiple users or process isolation. In a sense, OSv is a single user POSIX system with a memory model closer to μLinux (or MacOS 7) than to Linux. By having a single memory address space and "outsourcing" isolation to kvm, OSv boots fast and runs fast.
OSv relies on a utility called Capstan to turn an application developped in C, C++, python, Node.js, etc. into a minimal image. OSv maintains a set of base images that can be used as templates for this purpose. OSv shares in this sense the same goals of isolation and portability as SlapOS, but follows a different path: rather than relying on POSIX users (as SlapOS does), it tries to encapsulate a complete application as a standalone OS image.
Another way to understand OSv is as an altenartive to user space emulation mode of qemy. Rather than trying to convert system calls in user space, OSv embeds a minimal OS with enough system calls to run an application, an executes it in a VM. The overhead of spawning an another is thus minimised from a few seconds to much less than a second.
I still feel that there is no equivalent as brillant as ZeroVM was. gVisor seems to be promising to me because of its flexibility and portability, but I wish it could be used for pure process isolation without containers. Overall, this overview of recent virtualisation technologies raises many questions to be addressed in a next post: