envd
is a frontend of BuildKit. Just like the Dockerfile. It has been more than a year since I started working on this project. Since the features are relatively stable, I'd like to write a blog about my journey with envd
.
Why we need this tool
The machine learning development environment has been a pain point for a while. "Which Python are you using now?" is definitely a newbie slayer. It's even worse if you need to use CUDA. "It works on my machine!" happens a lot.
envd
was created to solve the problem of the machine learning development environment. However, it goes far beyond that.
Infrastructure as code (IaC)
What a fancy name! Here it means by using the envd
config file, you will be able to get the same environment on different machines, whether it's a local machine, a remote server, or a Kubernetes cluster.
Naming
It was named MIDI
in the beginning. But that is not friendly for SEO.
The d
in envd
has no official meaning (as far as I know). It can be "docker", "deep learning", "dev", etc.
For more information, check this issue.
Logo
We have a cute logo designed by Lily. It's a cat face with the envd
characters.
Actually, the cat only blinked once when we created the GIF. The recording tool on macOS is tricky to use. That's why it ends up blinking twice. By the way, we replaced it with SVG to make the animation clear and smooth. Writing the SVG animation from scratch is not that hard.
You can find the drafts here.
Installation
Obviously, envd
is a Golang project. However, our target audiences are mainly using Python. That's why we spend a lot of effort to support installation through pip
.
As we know, Python has never done a good job of distributing pre-compiled binaries. I didn't find any good document about how to create a Python pre-compiled binary distribution. People just copy & paste the code from other projects. So does envd
. The code is mainly copied from mosec
.
I do learn something new from others' contributions:
- cibuildwheel has become mature nowadays. It's a great tool for setting up the multi-platform distribution pipeline in CI.
- You can package a binary file without any Python code. (by frostming)
- You can create the Python ABI-agnostic wheel. (by frostming)
Of course, you can use conda-forge
. I have tried to create a recipe for mosec
. It has a totally different packaging logic.
Rootless
As a developer, I don't like to run the command with sudo
unless I have to. When I was trying to debug with the buildkit
daemon, I found that we can run it in rootless mode.
Starlark
Starlark is a dialect of Python, which makes it easy to use for machine learning engineers and data scientists.
I know that lots of configuration files are written in YAML. I personally don't like it. You may also heard lots of complaints about the YAML format. I think the configuration file should be able to validate itself.
You can use if-condition, for-loop, etc. in Starlark. The following code works:
def build(libs, gpu=True):
base = "ubuntu:20.04"
if gpu:
base = "nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04"
for lib in sorted(libs):
install.python_packages(name=[lib])
For more information, check the Starlark spec.
Although Starlark has an interpreting order, we don't rely on that. We will parse the file to an internal graph and construct the BuildKit Low-Level Build (LLB) graph on top of it. This tradeoff makes it easy to cache the layers.
Starlark is also easy to extend. We added lots of envd
specific functions to make it more powerful. You can find them in the reference. It has a load
function which is similar to import
in Python to load another file. We create a new one called include
(because import
is reserved) to import functions from a git repository. People can create their own envd
build functions and share them with others.
VSCode support
To make it more user-friendly, we have a VSCode extension for envd
, which provides the following features:
- LSP: this enables the Starlark auto-completion.
- manage
envd
environment
BuildKit
This is the backend of envd
. Integrating with it is troublesome. Mainly because it doesn't have any documentation. The only way to learn it is to read the examples. Since the source code is written in a functional style, it's a bit hard to understand. Once you get used to it, things will be easier.
There are some nice features in BuildKit:
- Parallel build
- Distributable workers
- Better cache
- Advanced operators
We will go through them one by one.
Parallel build
The main idea is to split the build graph into multiple sub-graphs and run them in parallel if possible. This is a great feature when some steps take a long time to finish while there is no overlap among them. For example, we can install the system packages and Conda environments in parallel.
The related operators are diff
and merge
. In the merge
list, the later state will override the previous stats if they change the same directories. Sometimes, it may take longer than you expect to get the diff and merge them together. This should be used when you're sure that the parallelism will save time.
Distributable workers
Basically, the frontend will construct the build graph and serialize it in a Protocol Buffer format, then send it to the backend workers through TCP or Unix Domain Socket.
It's recommended to set up a long-running BuildKit daemon and use it as a remote worker since in this way it can benefit from the cache.
By default, we will create a buildkitd
container for envd
to build the image.
Better cache
BuildKit can import/export the cache from/to the local/inline/registry. You can choose to export the intermediate layer or not.
By default, the cache limit is 10% of your disk space. You can configure this through the buildkit config.
envd
v0 will download a pre-build base image that contains the basic development tools and Python environment. This image can be used as the cache layer if none of the dependencies change. This is a great way to speed up the build process. You can check the nightly build benchmark.
Moby
For now, the best user experience is to use envd
v1 with moby
worker. This requires the docker engine version >= 22. To enable it, you can create a new envd
context like this:
envd context create --name moby --builder moby-worker --use
Need to mention that the moby
worker is still experimental. Due to the issue, we have to disable the merge
operator used in envd
when using the moby
worker. Thus the build step might be slower but the export step will be much faster. Overall it's still faster, especially when you have a large image, which is the common case for machine learning.
Cache
Docker layer cache is a common optimization for image building. Besides, we also enable the cache for the APT packages, Python wheels, VSCode extensions, and oh-my-zsh
plugins. This is done by mounting a cache directory during the build time. The machine learning related pip wheels can be huge, which makes the cache very useful.
Horust
I totally agree that for the online environment, one container should only do one thing, usually, that means running only one service. However, for the development environment, it's totally fine to run as many processes as you like, as long as they don't conflict with each other.
That's why we need a process management tool to control all of these processes. We have explored several options like systemd, s6 overlay, Supervisor. In the end, we decided to use Horust which is both simple and powerful. You can check the discussion.
Shell prompt
I personally use fish
with starship
, which gives a great out-of-box shell experience. starship
can work well with the most common shells like bash
, zsh
, fish
, etc. It's easy to configure and extend. You can check the starship documentation.
It works better when you have the Nerd font, but we cannot control the users' terminal configuration, we have to disable some fancy icons.
Coding in Jupyter Notebook and VSCode
These are the most common coding tools for machine learning engineers and data scientists.
Whether it's Jupyter Notebook or Jupyter Lab, it can be exposed as a normal web service.
VSCode really did a good job on the remote development. You can use the VSCode on your local machine to connect to the remote server or even the container running on a remote server.
Limited by the license, we have to use the Open VSX Registry. Sometimes the related CI test fails due to its stability.
Develop in the Kubernetes cluster
We were hoping to monetize envd
with this feature. But not many people are interested in this one. The code is open sourced as envd-server
. Maybe we can bring this feature to the new openmodelz project. Although you can run mdz exec {name} -ti bash
to get into a container, but it doesn't support VScode-Remote for now.
Use pointer receivers
This is the most common bug during the development with envd
. We have an internal build graph, which has many methods to build the LLB graph. Not all of these methods are using the pointer receivers, which results in the inconsistent state of the internal graph. I would prefer to use the pointer receivers for all of the methods.
You might be curious how come the lint doesn't catch this. That's because it can be used in a nested way, with the outer function using the value receiver while the inner function uses the pointer receiver.
This is also a good example to show the language design (personal option). You won't see this kind of bug in Rust. But Rust doesn't have a good container ecosystem. :(
Progress bar
The default docker progress bar is really complex. When I was implementing the moby
push feature, I chose to reuse another progress bar lib to make life easier. Although it lacks multi-line log support.
SSH agent forwarding
Actually, we can forward the host SSH credentials to the container. So we can use the git
command as we're in the host machine.
envd
v1
This new version is created to address the inappropriate design of the envd
v0. The main idea is that envd
file should be a more general frontend of BuildKit. It should be able to build any image, not only for the machine learning development environment.
Here is a comparison:
Features | v0 | v1 |
---|---|---|
is default for envd<v1.0 | ✅ | ❌ |
support dev | ✅ | ✅ |
support CUDA | ✅ | ✅ |
support serving | ⚠️ | ✅ |
support custom base image | ⚠️ | ✅ |
support installing multiple languages | ⚠️ | ✅ |
support moby builder | ❌ | ✅ |
Make it faster
The compileBaseImage
function should be able to run faster. You can try it if you're interested in the envd
development.
Regrets
This feature will make it much more powerful, but also comes with complexity.
Users can use low-level operators to build the graph. We can execute the commands from envd
file in the user-defined order.
Lots of development environments are not built in one shot. This proposal wants to track the changes in the running environment and update the envd
file accordingly.
Summary
It is the first time that I can work on an open source project as my daily work.I have learned a lot from the community. I hope more people can benefit from envd
.