Hello, reddit!

We are the Google Site Reliability Engineering (SRE) team. Our previous AMA from almost exactly a year ago got some good questions, so we thought we’d come back and answer any questions about what we do, what it’s like to be an SRE, or anything else.

We have four experienced SREs from three different offices (Mountain View, New York, Dublin) today, but SRE are based in many locations around the globe, and we’re hiring! Hit the link to see more about what it’s like, and what we work on.

We’ll be here from 12:00 to 13:00 PST (That’s 15:00 to 16:00 EST) to answer your questions. We are:

Cody Smith (/u/clusteroops), long-time senior SRE from Mountain View. Cody works on Search and Infrastructure.

Dave O’Connor (/u/sre_pointyhair), Site Reliability Manager from our Dublin, Ireland office. Dave manages the Storage SRE team in Dublin that runs Bigtable, Colossus, Spanner, and other storage tech our products are built on.

Carla G (/u/sys_exorcist), Site Reliability engineer from NYC working on Storage infrastructure.

Marc Alvidrez (/u/toughmttr), SRE TLM (Tech Lead Manager) from Mountain View working on Social, Ads and infra.

EDIT 11:37 PST: If you have questions about today’s issue with Gmail, please see: http://www.google.com/appsstatus -- Our team will continue to post updates there

EDIT 13:00 PST: That's us - thanks for all your questions and your patience!

Comments: 961 • Responses: 23  • Date: 

Adys1876 karma

It is a bit ironic that you post this the very moment GMail goes down.

https://mediacru.sh/-G2CmSsXDyEN

Alright, an actual question: What was the trickiest bug/crash/issue you ever had to debug on production? (Bonus points for something that happened since the last AMA)

Edit: I know nobody else noticed but G+ is still having issues. You guys aren't out of the mud yet.

toughmttr494 karma

A particularly tricky problem to debug was the time that some of our serving jobs became unresponsive intermittently. At certain times of the day they would block for awhile, and then start serving again, stop, and start, and so on. After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random! This is when I realized that randomness is one of the finite and exhaustible resources in a serving cluster. Embracing randomness and trickiness is part of the job as an SRE!

Chahk300 karma

So which one of these guys tripped over the power cord to The Cloud?

sre_pointyhair513 karma

Probably the guy.

fragglet957 karma

The guy here. I can confirm this.

Sorry about that.

sre_pointyhair566 karma

WHAAAAAT.

giffenola169 karma

More than slightly ironic. I came into this post for exactly this reason.

We'd love to hear some of the background troubleshooting steps that are taken when the nagios board goes red at Google GMAIL HQ.

sys_exorcist180 karma

When an issue may be occurring, an automatic alert system will ring someone’s pager (hopefully not mine!). Nearly all of our problems are caused by changes to our systems (either human or automated), so the first step is playing the “what is different game.”

ucantsimee50 karma

Pager?!?! Why don't you use text messages for that?

sre_pointyhair378 karma

'Pager' is a synonym for 'a beepy thing that goes beep'.

Arrowmaster43 karma

And as a countpart, whats the worst thing thats happened in non-production that could have made it to production?

clusteroops56 karma

Our integration tests are generally very good, so we rely on them to catch all sorts of potential catastrophes, like massive performance regressions, corruption of data, and wildly inappropriate results. You name it, we caught it.

notcaffeinefree248 karma

Sooo....what's it like there when a Google service goes down? How much freaking out is done?

sre_pointyhair368 karma

Very little freaking out actually, we have a well-oiled process for this that all services use - we use thoroughly documented incident management procedures, so people understand their role explicitly and can act very quickly. We also exercise this processes regularly as part of our DiRT testing.

Running regular service-specific drills is also a big part of making sure that once something goes wrong, we’re straight on it.

DeviousDad138 karma

What would be the most ideal, massively helpful thing you'd like to see in a coworker? I'm talking most downright pleasantly beneficial aspect of some one hired to work with you.

Would it be complete and thorough knowledge of XYZ programs or language? Would it be a cooperative attitude over technical knowledge? I feel like the job postings are reviewed by people who have no direct link to the job other than reviewing the applicants. They simply look for marks on a resume, like degree & certs, which doesn't speak to someone's experience or knowledge on a topic.

So what would you guys, were you doing the hiring yourself, specifically find the absolute most awesome in a coworker?

sre_pointyhair199 karma

This is a really good question. Of course, you want the people around you to be smart. But, knowledge in and of itself can be taught and learned, so it’s less about knowing languages.

A huge one is having a general curiosity about the world and how it works - this translates well to how we do things (i.e. exploring uncharted territory in how our systems are built). I guess a pithy soundbite for what I look for is: “The ability to react to something unexpected happening with curiosity”.

AlwaysTheir79 karma

What well known best practices are terribly wrong when applied to services at the scale you work with? Also please share any "best practices" you personally disagree with for services of any scale.

sre_pointyhair176 karma

One thing that’s different from a lot of other places I’ve observed is that we tend not to do “level one” support (that being a NOC, or a team of people who do initial triage on technical issues before escalating to engineers who built the system).

We’ve found that engineering the system in a way so that alerts go to the people who built the systems incentivises them to fix stuff properly and permanently.

Xylth60 karma

How do you coordinate a response to something like a GMail outage without email?

toughmttr69 karma

SRE is all about having backup systems with as few dependencies as possible. :-)

dapsays53 karma

When you see problems in production systems (ranging from transient errors or performance blips to serious outages), how do you balance the competing concerns of restoring service quickly and root-causing the problem? If you prioritize restoring service, do many issues ultimately go without a complete root-cause analysis? If you do root-cause them, what tools and techniques to you use, particularly for problems that end up involving several layers of the stack (e.g., kernel, TCP, and a Java application)?

clusteroops69 karma

Mitigating the impact is the top priority, if at all possible. In most cases, we can route around the failing system without destroying the evidence, by reconfiguring the load balancers, which we call "draining." Then once we understand and fix the problem, we undrain. In some cases, we need to recreate production conditions in order to tickle the bug, in which case we rely on synthetic load generators.

Almost all major issues are indeed root-caused as part of our "postmortem" process. Usually it's pretty easy to track a failure to a particular change, by simply toggling the change and attempting to trigger the bug. If many changes went out in one batch, we binary-search through them to find the culprit. To understand why a change is bogus involves all of the standard tools: debuggers, CPU and heap profilers, verbose logging, etc.

For multi-layered problems, we rely on traces between layers, e.g. tcpdump, strace, and Google-specific RPC tracing. We figure out what we expect to see, and then compare with the observed.

honestbleeps39 karma

I have to say given the fact that Gmail went down while this AMA was starting up, /u/clusteroops is pretty much the best and most appropriate username I've ever seen on Reddit.

My question: Are you all OK? I hope you're all OK. Sometimes tech stuff goes wrong, and usually it happens at the worst possible time.

clusteroops53 karma

I'm fine. Thank you.

sre_pointyhair41 karma

I'm graaaand.

evildoctorspuds38 karma

What is your favorite snack from the breakroom?

sre_pointyhair50 karma

TAYTO

clusteroops30 karma

Kumquats.

sys_exorcist30 karma

Coffee.

toughmttr26 karma

Anything chocolate!

robscomputer36 karma

Hello,

I heard that the Google SRE team is a mix of roles and you patch code while it's live. Could you explain the method to code and fix issues with a production service while still following change management procedures? Or are these fixes done as "break fixes" and exempt from the process?

clusteroops39 karma

For most change-induced problems, we simply roll back to the last known good release or wait until the next release. If neither of those are possible, then we push the narrowest possible fix, using the same change management process but faster.

jmreicha33 karma

What types of tools and workflows do you use in your environment with regards to change management and change control?

Maybe you could take me through the steps of how a change gets put into a production from the tools and software to the different teams and groups involved and how it effects users?

Also. what kinds of change windows do you guys like to use? Thanks a bunch!

clusteroops34 karma

Various different tools, depending on the frequency and distribution speed. There are roughly a dozen common tools, and many many system-specific tools. We have a tool that automatically and periodically builds and pushes new data files (e.g. lists of blocked results, experiment configuration, images) to the fleet in rsync-like fashion. It also does automatic sanity checks, and will avoid pushing broken data.

Generally pushes follow the same pattern: assemble the change (e.g. building a binary or auto-generating a data file), run some offline sanity checks, push it to a few servers, wait for smoke, and then gradually deploy to the remaining servers. SREs and software developers work together on the process, like one big team.

Change windows: particularly for the more complex systems, the goal is to have all of the experts immediately available and well rested in case the change triggers a bug, which sometimes occurs hours or days after the change. So we generally target after peak traffic, during business hours, between Monday and Thursday. We avoid making changes during the holidays as much as possible.

Lykii32 karma

More of a general question: What are some of the ongoing training/professional development opportunities provided by Google for their engineers? Are you encouraged to make connections with university students to mentor and help those entering the workforce?

sys_exorcist41 karma

My favorite professional development is informal chats with my teammates. There is always something they know that I don’t. There are plenty of formal classes available as well (unix internals, deep dives into various programming languages). We spend a lot of time giving talks at universities and have a booming internship program.

BA773528 karma

How did you come to get a job at google?

toughmttr52 karma

I have a degree in History, but was always interested in computers. I played with Linux for fun (remember Slackware 1.0...)? After college I got a job as a sysadmin, gained skills and experience, and went on to learn a lot about networks, performance analysis and system engineering in general. After a bunch of years in the industry, I jumped at the chance to interview at Google!

In SRE we are actually more interested in what people can do rather than CS degrees or candidates with theoretical knowledge that they can't apply. We like people who can think on their feet and figure things out. We have many colleagues here coming from various backgrounds, not necessarily just CS/computer engineering.

koobcamria23 karma

I just want to say that I'm really impressed with Google's response to this (hopefully) temporary crash. I don't blame anyone for it; such things will happen from time to time.

I'm going to go eat a sandwich and read a book. Hopefully it'll be back up and running by then.

Thanks Google Guys!

sre_pointyhair41 karma

This answer aggregates a few ‘who gets fired from a cannon’ questions :-)

Following any service issues, we are more concerned with how we’ll spot and mitigate things like this in the future than placing blame, and start working to make our systems better so this won’t happen again.

docwho210019 karma

What exactly do you work on (all products or only certain ones?)

sys_exorcist24 karma

SRE teams typically focus on a single service like Search, or a piece of our infrastructure. My team works on Storage infrastructure like Colossus and Bigtable (pdf)