Site Reliability Engineering: how Google runs production systems

Pro

(Image: O'Reilly)

7 April 2016

A new book from technical publisher O’Reilly called “Site Reliability Engineering: how Google runs production systems” does what it says on the cover, describing in some depth, bringing together some public and much new material, the company’s approach to production engineering.

Site Reliability Engineering (SRE) is a term that was coined by Ben Treynor Sloss, the senior vice president overseeing technical operations at Google. It has a deep significance for Google, but is perhaps a little less well understood beyond it, and perhaps needed some definition.

“If organisations took the best practices that are in this book and deploy them where it makes sense, then the Web is going to be better, it is going to be more reliable and people are going to use it more,” Niall Richard Murphy

According to Niall Richard Murphy, head of Ads Reliability Engineering, Google, the instigator and chief editor of the book, the reasoning behind it was part altruistic, part clarification and part demonstration of the best practices behind most of Google’s public facing services.

“We really think, on an industry-wide basis,” said Murphy, “if organisations took the best practices that are in this book and deploy them where it makes sense, then the Web is going to be better, it is going to be more reliable and people are going to use it more. That is going to benefit Google. It’s a nice alignment of us externalising the things that we think make our web services great and showing other people how to do it.”

However, there is also much confusion around certain terminology, which Murphy hopes will be tackled by the book.

Term confusion
“Another part of the reason is, today, there is a lot of confusion in the industry around the role of DevOps, and SRE, and to a certain extent a refusal in the community to define DevOps,” Murphy asserts. “This has led to a lot of confusion about what job titles actually mean, especially when hiring people to do these things. Also, other companies, after we started being public about the fact that SRE exists, put up their own job ads for SRE, and it’s not 100% clear that those roles are commensurate.”

“So this is us,” said Murphy, “putting a line in the sand about what the role is, what it involves, the type of work you do, all of that kind of stuff. And not incidentally, it will hopefully increase confidence in the cloud services that we operate, because you can see the best practices we are using in nearly every significant public facing product we have.”

While Murphy acknowledges that much of the material in the book is already in the public domain, here it is organised and structured for clarity, as well as supplemented by new material.

Chapter 2, Murphy reports, is the overview of the production network at Google, and has things to say that have not been aired before.

There were plans for more material to get its debut, but inevitably, things happened during the two-year gestation period of the book which prevented that. One such instances was people moving on from Google who started writing about some of the topics covered, so while not brand new, some get their first expression by Google, said Murphy.

New information
However, Murphy said that there are details on new technologies, such as that around Big Data called the Workflow System, which is in the data pipelines chapter.

“Some of the things we talk about, such as budget-based risk management, about how you run services, when you need to balance feature velocity and reliability — that is not super intuitive to the rest of the world, and so a new approach.”

“There’s also an article on automation, which talks about the evolution behind the cluster-level job scheduling system called ‘Borg’, which we released a paper on a while back, but some of the details around that are pretty interesting.”

Murphy said that other highlights are the fact that the book brings together trends and concepts in a manner accessible for people who may not necessarily be immersed in DevOps and production engineering every day.

“For people who aren’t necessarily deeply embedded in the industry, the new way of doing IT, the software-orientation that is represented by SRE is something that is pretty new and something that many organisations could benefit from.”

Blackbox
SRE removes the ‘blackbox’ situation that most vendors have for the likes of load balancing and traffic management, said Murphy, and SRE opens that up to allow greater control and visibility through software to achieve whatever it is you want, not just want someone else’s control interface or feature set allows.

Murphy emphasised that the book was a gargantuan task, with more than 50 contributors from around the world, but coordinated from Ireland, where his team is based. He is proud of the Irish involvement, as he says it evidences the fact that a lot of important work for Google is done here in Ireland.

The book is available on Kindle, and also in print. Murphy said after the print run, a Creative Commons version will also be available free of charge.

www.oreilly.com

TechCentral Reporters

Site Reliability Engineering: how Google runs production systems

Sign up for the Technology Minute

Support our advertisers

Listen to Tech Radio

Most Popular