Now You Have SRE Problems

You deployed your AI application and it works. Congratulations!

It's been a long road, and a first release is a big milestone.

You might not be in a hurry (maybe, because your production environment isn't a big deal yet, but there are challenges ahead you should know about.

It's just a question of time, until you'll start running into SRE problems.

What's This SRE Thing?

If you don't know, SRE stands for "Site Reliability Engineering". I would translate it to "the new term around keeping your systems running without them going poof all of a sudden". There are whole book about it, and people who have devoted most of their career to it.

Here's my take on it. You might think that "running the product" is a clear cut to everything else. Just some server stuff. It's not just about servers, but about the interplay between:

Unfortunately, it's not something you can keep separately from your product or your team. It's also not something which is ever really "done and finished"... Kinda like security.

Just as it takes time to build the product, and it takes time to get it deployed, it will take time until you will learn how to operate it. It's not just that learning part however, think of it as a complex interconnected system, where you'll have to iterate on each part and get them to play together nicely.

It's About Processes And Experience

There are no major shortcuts, apart from knowing the terms to look for.

If you're serious about your product, it's a good idea to start treating SRE topics with the attention and care they warrant. The sooner you start doing so, the easier it will be to tune your product to be easier to operate, the more time you'll have to collect precious real-world experience with running it, and the more time your team will have to take ownership of the complete process.

Small Steps

Deploying early is a good idea. The sooner you'll start your first iterations on each part of the whole system, the easier it will be to keep the others attuned.

It's not about tricks, but paying attention to processes and putting an emphasis on learning. Did you experience an unexpected downtime? Note it down somewhere accessible, and try to see if there's something to learn from it. Document your processes and see if you can add a sprinkle of automation to them. Regularly wonder about important but non-urgent topics like:

The more regular effort you put into such topics, the less likely you'll have to put out fires around your deployed product in the middle of the night.

In Conclusion

If you're so far ahead that your product is deployed, keep an eye out for SRE problems - you're about to face them sooner or later. It's a good idea to start preparing for them at a leisurely pace while you have the time and calm to do so.

Look into topics like "observability" (you have centralized monitoring and logging, right?), "incident handling", "postmortems" and "SLAs & SLOs". Put an emphasis on learning about the fundamentals and processes instead of tools you'll have a way easier time growing and stabilizing your product in the future.

Hi! I'm Vladislav, I help companies deploy GPU-heavy AI products to Kubernetes.