We are the Google Site Reliability Engineering team. Ask

2187

Hello, reddit!

We are the Google Site Reliability Engineering (SRE) team. Our previous AMA from almost exactly a year ago got some good questions, so we thought we’d come back and answer any questions about what we do, what it’s like to be an SRE, or anything else.

We have four experienced SREs from three different offices (Mountain View, New York, Dublin) today, but SRE are based in many locations around the globe, and we’re hiring! Hit the link to see more about what it’s like, and what we work on.

We’ll be here from 12:00 to 13:00 PST (That’s 15:00 to 16:00 EST) to answer your questions. We are:

Cody Smith (/u/clusteroops), long-time senior SRE from Mountain View. Cody works on Search and Infrastructure.

Dave O’Connor (/u/sre_pointyhair), Site Reliability Manager from our Dublin, Ireland office. Dave manages the Storage SRE team in Dublin that runs Bigtable, Colossus, Spanner, and other storage tech our products are built on.

Carla G (/u/sys_exorcist), Site Reliability engineer from NYC working on Storage infrastructure.

Marc Alvidrez (/u/toughmttr), SRE TLM (Tech Lead Manager) from Mountain View working on Social, Ads and infra.

EDIT 11:37 PST: If you have questions about today’s issue with Gmail, please see: http://www.google.com/appsstatus -- Our team will continue to post updates there

EDIT 13:00 PST: That's us - thanks for all your questions and your patience!

Comments: 961 • Responses: 23 • Date: 2014-01-24 19:04:47 UTCsource

Adys1876 karma2014-01-24 19:07:45 UTC

It is a bit ironic that you post this the very moment GMail goes down.

https://mediacru.sh/-G2CmSsXDyEN

Alright, an actual question: What was the trickiest bug/crash/issue you ever had to debug on production? (Bonus points for something that happened since the last AMA)

Edit: I know nobody else noticed but G+ is still having issues. You guys aren't out of the mud yet.

View History Share Link

toughmttr494 karma2014-01-24 20:11:58 UTC

A particularly tricky problem to debug was the time that some of our serving jobs became unresponsive intermittently. At certain times of the day they would block for awhile, and then start serving again, stop, and start, and so on. After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random! This is when I realized that randomness is one of the finite and exhaustible resources in a serving cluster. Embracing randomness and trickiness is part of the job as an SRE!

View History Share Link

Chahk300 karma2014-01-24 19:38:39 UTC

So which one of these guys tripped over the power cord to The Cloud?

View History Share Link

sre_pointyhair513 karma2014-01-24 20:33:00 UTC

Probably the guy.

View History Share Link

fragglet957 karma2014-01-24 21:06:51 UTC

The guy here. I can confirm this.

Sorry about that.

View History Share Link

sre_pointyhair566 karma2014-01-24 21:40:22 UTC

WHAAAAAT.

View History Share Link

giffenola169 karma2014-01-24 19:22:50 UTC

More than slightly ironic. I came into this post for exactly this reason.

We'd love to hear some of the background troubleshooting steps that are taken when the nagios board goes red at Google GMAIL HQ.

View History Share Link

sys_exorcist180 karma2014-01-24 20:11:14 UTC

When an issue may be occurring, an automatic alert system will ring someone’s pager (hopefully not mine!). Nearly all of our problems are caused by changes to our systems (either human or automated), so the first step is playing the “what is different game.”

View History Share Link

ucantsimee50 karma2014-01-24 20:17:47 UTC

Pager?!?! Why don't you use text messages for that?

View History Share Link

sre_pointyhair378 karma2014-01-24 20:21:17 UTC

'Pager' is a synonym for 'a beepy thing that goes beep'.

View History Share Link

Arrowmaster43 karma2014-01-24 19:10:47 UTC

And as a countpart, whats the worst thing thats happened in non-production that could have made it to production?

View History Share Link

clusteroops56 karma2014-01-24 20:21:41 UTC

Our integration tests are generally very good, so we rely on them to catch all sorts of potential catastrophes, like massive performance regressions, corruption of data, and wildly inappropriate results. You name it, we caught it.

View History Share Link

notcaffeinefree248 karma2014-01-24 19:26:00 UTC

Sooo....what's it like there when a Google service goes down? How much freaking out is done?

View History Share Link

sre_pointyhair368 karma2014-01-24 20:04:36 UTC

Very little freaking out actually, we have a well-oiled process for this that all services use - we use thoroughly documented incident management procedures, so people understand their role explicitly and can act very quickly. We also exercise this processes regularly as part of our DiRT testing.

Running regular service-specific drills is also a big part of making sure that once something goes wrong, we’re straight on it.

View History Share Link

DeviousDad138 karma2014-01-24 19:25:56 UTC

What would be the most ideal, massively helpful thing you'd like to see in a coworker? I'm talking most downright pleasantly beneficial aspect of some one hired to work with you.

Would it be complete and thorough knowledge of XYZ programs or language? Would it be a cooperative attitude over technical knowledge? I feel like the job postings are reviewed by people who have no direct link to the job other than reviewing the applicants. They simply look for marks on a resume, like degree & certs, which doesn't speak to someone's experience or knowledge on a topic.

So what would you guys, were you doing the hiring yourself, specifically find the absolute most awesome in a coworker?

View History Share Link

sre_pointyhair199 karma2014-01-24 20:03:03 UTC

This is a really good question. Of course, you want the people around you to be smart. But, knowledge in and of itself can be taught and learned, so it’s less about knowing languages.

A huge one is having a general curiosity about the world and how it works - this translates well to how we do things (i.e. exploring uncharted territory in how our systems are built). I guess a pithy soundbite for what I look for is: “The ability to react to something unexpected happening with curiosity”.

View History Share Link

AlwaysTheir79 karma2014-01-24 19:43:09 UTC

What well known best practices are terribly wrong when applied to services at the scale you work with? Also please share any "best practices" you personally disagree with for services of any scale.

View History Share Link

sre_pointyhair176 karma2014-01-24 20:29:13 UTC

One thing that’s different from a lot of other places I’ve observed is that we tend not to do “level one” support (that being a NOC, or a team of people who do initial triage on technical issues before escalating to engineers who built the system).

We’ve found that engineering the system in a way so that alerts go to the people who built the systems incentivises them to fix stuff properly and permanently.

View History Share Link

Xylth60 karma2014-01-24 19:29:35 UTC

How do you coordinate a response to something like a GMail outage without email?

View History Share Link

toughmttr69 karma2014-01-24 20:03:06 UTC

SRE is all about having backup systems with as few dependencies as possible. :-)

View History Share Link

dapsays53 karma2014-01-24 20:01:12 UTC

When you see problems in production systems (ranging from transient errors or performance blips to serious outages), how do you balance the competing concerns of restoring service quickly and root-causing the problem? If you prioritize restoring service, do many issues ultimately go without a complete root-cause analysis? If you do root-cause them, what tools and techniques to you use, particularly for problems that end up involving several layers of the stack (e.g., kernel, TCP, and a Java application)?

View History Share Link

clusteroops69 karma2014-01-24 20:49:48 UTC

Mitigating the impact is the top priority, if at all possible. In most cases, we can route around the failing system without destroying the evidence, by reconfiguring the load balancers, which we call "draining." Then once we understand and fix the problem, we undrain. In some cases, we need to recreate production conditions in order to tickle the bug, in which case we rely on synthetic load generators.

Almost all major issues are indeed root-caused as part of our "postmortem" process. Usually it's pretty easy to track a failure to a particular change, by simply toggling the change and attempting to trigger the bug. If many changes went out in one batch, we binary-search through them to find the culprit. To understand why a change is bogus involves all of the standard tools: debuggers, CPU and heap profilers, verbose logging, etc.

For multi-layered problems, we rely on traces between layers, e.g. tcpdump, strace, and Google-specific RPC tracing. We figure out what we expect to see, and then compare with the observed.

View History Share Link

honestbleeps39 karma2014-01-24 19:36:30 UTC

I have to say given the fact that Gmail went down while this AMA was starting up, /u/clusteroops is pretty much the best and most appropriate username I've ever seen on Reddit.

My question: Are you all OK? I hope you're all OK. Sometimes tech stuff goes wrong, and usually it happens at the worst possible time.

View History Share Link

clusteroops53 karma2014-01-24 20:25:14 UTC

I'm fine. Thank you.

View History Share Link

sre_pointyhair41 karma2014-01-24 20:31:17 UTC

I'm graaaand.

View History Share Link

evildoctorspuds38 karma2014-01-24 19:27:16 UTC

What is your favorite snack from the breakroom?

View History Share Link

sre_pointyhair50 karma2014-01-24 20:18:01 UTC

TAYTO

View History Share Link

sys_exorcist30 karma2014-01-24 20:20:14 UTC

Coffee.

View History Share Link

clusteroops30 karma2014-01-24 20:22:44 UTC

Kumquats.

View History Share Link

toughmttr26 karma2014-01-24 20:17:18 UTC

Anything chocolate!

View History Share Link

robscomputer36 karma2014-01-24 19:30:43 UTC

Hello,

I heard that the Google SRE team is a mix of roles and you patch code while it's live. Could you explain the method to code and fix issues with a production service while still following change management procedures? Or are these fixes done as "break fixes" and exempt from the process?

View History Share Link

clusteroops39 karma2014-01-24 20:58:51 UTC

For most change-induced problems, we simply roll back to the last known good release or wait until the next release. If neither of those are possible, then we push the narrowest possible fix, using the same change management process but faster.

View History Share Link

jmreicha33 karma2014-01-24 19:20:02 UTC

What types of tools and workflows do you use in your environment with regards to change management and change control?

Maybe you could take me through the steps of how a change gets put into a production from the tools and software to the different teams and groups involved and how it effects users?

Also. what kinds of change windows do you guys like to use? Thanks a bunch!

View History Share Link

clusteroops34 karma2014-01-24 20:13:10 UTC

Various different tools, depending on the frequency and distribution speed. There are roughly a dozen common tools, and many many system-specific tools. We have a tool that automatically and periodically builds and pushes new data files (e.g. lists of blocked results, experiment configuration, images) to the fleet in rsync-like fashion. It also does automatic sanity checks, and will avoid pushing broken data.

Generally pushes follow the same pattern: assemble the change (e.g. building a binary or auto-generating a data file), run some offline sanity checks, push it to a few servers, wait for smoke, and then gradually deploy to the remaining servers. SREs and software developers work together on the process, like one big team.

Change windows: particularly for the more complex systems, the goal is to have all of the experts immediately available and well rested in case the change triggers a bug, which sometimes occurs hours or days after the change. So we generally target after peak traffic, during business hours, between Monday and Thursday. We avoid making changes during the holidays as much as possible.

View History Share Link

Lykii32 karma2014-01-24 19:32:24 UTC

More of a general question: What are some of the ongoing training/professional development opportunities provided by Google for their engineers? Are you encouraged to make connections with university students to mentor and help those entering the workforce?

View History Share Link

sys_exorcist41 karma2014-01-24 20:01:34 UTC

My favorite professional development is informal chats with my teammates. There is always something they know that I don’t. There are plenty of formal classes available as well (unix internals, deep dives into various programming languages). We spend a lot of time giving talks at universities and have a booming internship program.

View History Share Link

BA773528 karma2014-01-24 19:44:11 UTC

How did you come to get a job at google?

View History Share Link

toughmttr52 karma2014-01-24 20:37:41 UTC

I have a degree in History, but was always interested in computers. I played with Linux for fun (remember Slackware 1.0...)? After college I got a job as a sysadmin, gained skills and experience, and went on to learn a lot about networks, performance analysis and system engineering in general. After a bunch of years in the industry, I jumped at the chance to interview at Google!

In SRE we are actually more interested in what people can do rather than CS degrees or candidates with theoretical knowledge that they can't apply. We like people who can think on their feet and figure things out. We have many colleagues here coming from various backgrounds, not necessarily just CS/computer engineering.

View History Share Link

koobcamria23 karma2014-01-24 20:17:52 UTC

I just want to say that I'm really impressed with Google's response to this (hopefully) temporary crash. I don't blame anyone for it; such things will happen from time to time.

I'm going to go eat a sandwich and read a book. Hopefully it'll be back up and running by then.

Thanks Google Guys!

View History Share Link

sre_pointyhair41 karma2014-01-24 20:45:47 UTC

This answer aggregates a few ‘who gets fired from a cannon’ questions :-)

Following any service issues, we are more concerned with how we’ll spot and mitigate things like this in the future than placing blame, and start working to make our systems better so this won’t happen again.

View History Share Link

docwho210019 karma2014-01-24 19:42:26 UTC

What exactly do you work on (all products or only certain ones?)

View History Share Link

sys_exorcist24 karma2014-01-24 20:39:00 UTC

SRE teams typically focus on a single service like Search, or a piece of our infrastructure. My team works on Storage infrastructure like Colossus and Bigtable (pdf)

View History Share Link

We are the Google Site Reliability Engineering team. Ask us Anything!