Simple Worker Architecture

The architecture of your AI product doesn't need to be complicated or fancy when starting out.

A lot of bang can be gotten with a very simple and stable approach: a worker queue, distributing more effortful tasks to worker processes or machines. In the beginning, you can get a lot of mileage out of having a single GPU worker!

Well, except if your product gets a LOT of traffic all of a sudden. I hope it'll be paid traffic, because that's a really nice problem to have!

The Majestic Three Tier Architecture

Clear separation of concerns, and plenty of room to grow with your needs.

That's the promise of the three tier architecture.

If the web part of your application is well-architected and can be scaled horizontally, you can get a lot of mileage out of this setup! Don't worry, you don't need to split your monolith into services or anything. No such overhead work required.

Think of three layers. Three parallel stripes next to each other. Each layer can contain one or more running processes of a single type. Processes from one layer can only communicates to the layers next to the one they are in. That's about it.

You can have each layer on a single machine at first, and have multiple machines for each layer. The conceptual structure is enough to give you room to grow. Just because the application is a monolith doesn't mean you can't scale it horizontally.

What goes into the layers? Why are there three?

The first layer is to accept external request - the web tier. It accepts traffic from the internet, answers some requests if possible and forwards the rest to the next layer - the application layer.

The second layer contains instances of your web application. Your Django or Ruby on Rails application servers. Each instance of the web application talks to the last layer.

The third layer contains your database(s). Most likely PostgreSQL or some flavour of MySQL.

You can have all of those layers on a single machine, or spread out across lots of different machines per layer. You can have load balancers between each layer, but that's a technical detail. The communication patterns stay the same, and each layer has a clear responsibility and it's clear where the boundaries are.

Where Are The Workers?

Your workers are next to those three tiers. They are their own layer next to your application servers. You'll probably want to put them on their own GPU-enabled machine, so this (more expensive) part of your setup can be scaled carefully with growing demand.

Traffic from the first layer doesn't go to them directly - instead, the communication will go through an external queue. However, your workers will probably be able to talk to the database as needed. This way everything is decoupled and you'll have an easier time deploying and operating your setup!

Your deep learning models will be loaded and waiting on those worker machine(s) and will react to new entries in the queue. Your web application won't have to run on expensive GPU machines and you won't lose jobs from the queue when the worker code needs to be updated.

In Conclusion

Using GPU-powered worker machines, next to a classical three-tier architecture can be a great way to create an architecture for your deeo learning product which will strike a nice balance between low complexity, convenient ops and scalability.

You don't have to get fancy when starting out, and one or more worker machines, listening to a single queue for new jobs is a solid approach. I hope you'll be able to use this information to move forward and deploy your AI product with confidence!

Hi! I'm Vladislav, I help companies deploy GPU-heavy AI products to Kubernetes. If you're interested in this topic, make sure to sign up to the newsletter.