Zero-Downtime Deploys with a Single Server

PyDist uses a custom blue/green deploy mechanism to achieve zero-downtime deploys without significant resource overhead. A standard blue/green deploy setup requires two application instances—a live instance serving production traffic, and a standby instance where the new application version is deployed. These sit behind a proxy (usually nginx) which initially sends all traffic to the live instance, but can be hot-reloaded to start routing requests to the standby instance once it is ready:

Typical blue/green architecture

In the above illustration, "blue" is the production instance and "green" is the standby instance at the start of the deploy. The Nginx proxy sends production traffic to the blue instance, but routes special staging URLs to the green instance. New versions of the application are deployed to a standby instance (green in this case). Once the deploy has finished and any verification of the green instance has passed, we can update the Nginx configuration to swap the roles of blue and green, so production traffic starts going to the green instance.

To facilitate this, I maintain two Nginx configurations, one of which routes production traffic to blue and the other to green. To avoid the two drifting out of sync, the green configuration is created from the blue configuration via a make rule:

nginx/nginx-green.conf: nginx/nginx-blue.conf
	cp nginx/nginx-blue.conf nginx/nginx-green.conf
	sed -i 's/blue/cyan/g' nginx/nginx-green.conf
	sed -i 's/green/blue/g' nginx/nginx-green.conf
	sed -i 's/cyan/green/g' nginx/nginx-green.conf

Note that "cyan" is essentially a temporary variable to allow swapping blue and green.

Deploys overwrite the current Nginx configuration with the configuration pointing to the other application. This does not take effect until I run sudo systemctl reload nginx, after which new requests will get routed according to the new configuration. Nginx configuration reloads are atomic and do not disrupt in-flight traffic, so this deploy process results in zero downtime.

The downside of this process is that we are using three servers to do the work of one. PyDist is a self-funded service offering low-cost package hosting, so avoiding unnecessary infrastructure cost is important.

One improvement is to make the standby instance ephemeral, setting it up immediately before a deploy and then tearing the old instance down after connections have fully drained from it. This reduces the number of servers to 2 + ε, but significantly complicates the deploy process. Instead, PyDist does all of this on one server:

Blue/green architecture on a single server

At the cost of higher memory usage on the application server, this architecture reduces resource overhead, reduces latency, and eliminates a point of failure. It also makes deploys simpler and less error-prone because there is only a single server to interact with. Because Nginx is so efficient and the staging routes see so little traffic, the application performance impact is minimal. The staging server can be stopped once the switchover is complete to reclaim even this small overhead.

Automating the Deploy Process

Deploys should be safe, fast, and painless, which means automating them down to as few commands as possible. Since I want to allow manual testing of the new instance before switching production traffic to it, this requires a minimum of two commands—one to deploy the new instance and another to update the Nginx config and switch traffic over. For PyDist's UI server this looks like:

python [--dry-run] [--remote] [--init] [--install] [--migrate] [--autoswitch]
ssh "sudo /mnt/pydist/"

The deploy script is essentially a wrapper around a few calls to rsync and invoking small scripts on the server via ssh, with options such as:

Originally I used a bash script for the deploy script, but (as is usually the case) once I added non-trivial logic to it I came to regret that decision and rewrote it in Python. I'm pretty happy with the script now, although it could be further simplified to detect whether updates or database migrations are necessary.

One challenge of blue/green deployments with persistent instances is that the blue and green instances swap roles with each deployment, so subsequent deploys need to target the other instance. To handle this, the Nginx configuration includes a special /bg route, which returns the string blue or green depending on which instance is serving production traffic. The deploy script queries this route and then deploys to the opposite instance.

Keeping it Simple

You may have noticed that I didn't mention building containers or running a CI/CD pipeline—standard features in every infrastructure blog post these days. These technologies have their place, but they come at a cost.

Containers create another layer of abstraction between you and your code, which can be helpful (if the container is easier to reason about than the underlying operating system) or harmful (if the abstraction leaks, or the underlying operating system is more familiar or convenient to work with). They are orders of magnitude larger than my application, slowing down deploys. And they come with performance penalties which are not always easy to reason about.

Of course, containers have their place. They are useful for:

In short, they are great for larger companies and I can wholeheartedly recommend them to my competitors.

CI/CD pipelines are more benign, but they're not really necessary when your build process is trivial enough to fold into a deploy scrip—in my case, it is just make—and you don't have to worry about other developers skipping proper verification. As a solo developer, I can afford to use a more holistic deploy process—combining automated tests with manual QA for large code changes, but only cursory checks when I correct a typo or publish a new blog post.

当前网页内容, 由 大妈 ZoomQuiet 使用工具: ScrapBook :: Firefox Extension 人工从互联网中收集并分享;
若有不妥, 欢迎评注提醒:


订阅 substack 体验古早写作:

点击注册~> 获得 100$ 体验券: DigitalOcean Referral Badge

关注公众号, 持续获得相关各种嗯哼:


关于 ~ DebugUself with DAMA ;-)
公安备案号: 44049002000656 ...::