Hello, reddit!

We are the Google Site Reliability (SRE) team. We’re responsible for the 24x7 operation of Google.com, as well as the technical infrastructure behind many other Google products such as GMail, Maps, G+ and other stuff you know and love. We’ve been traditionally invisible and behind-the-scenes but we thought we’d drop on here and answer any questions about what we do, what stuff we come up against, and what it’s like to be an SRE.

Other interesting things to give you an idea of what we do:

Blog post about the Leap Second written by Chris Pascoe from SRE give an ides of the kind of hairy problems we come up against.

Steven Levy wrote a Wired Article about inside our datacenters, and managed to make us sound like some sort of amazing justice team.

Kripa (who’s one of our participants today!) also writes about DiRT for ACM Queue.

We’ll be here from 12pm to 2pm PST to answer your questions, when we'll have info on our participants.

Proof (official Google accounts) :

https://plus.google.com/+GoogleDevelopers/posts/AjoRBrKYvsR https://twitter.com/googledevs/status/294234581303967745

EDIT 11:50PST: We're just getting set up here to answer your questions. We are:

Kripa Krishnan (/u/kripakrishnan), SRE Technical Program Manager and DiRT mastermind from our Mountain View HQ. Kripa works on infrastructure efforts in Google Apps.

Cody Smith (/u/clusteroops), long-time senior SRE from Mountain View. Cody works on Search and Infrastructure.

Dave O’Connor (/u/sre_pointyhair), Site Reliability Manager from our Dublin, Ireland office. Dave manages the Storage SRE team in Dublin that runs Bigtable, Colossus, Spanner, and other storage tech our products are built on.

John Collins (/u/jrc-sre), SRE Ombudsman, advocate and general force for good, from Mountain View.

EDIT 13:56PST: OK folks, we're all done. Thanks for the questions, hope our answers were satisfactory. May the queries flow and the pagers be silent.

*EDIT Jan 30: Corrected the spelling of @stevenlevy's name. Whoops-a-daisy. *

Comments: 1433 • Responses: 41  • Date: 

aishataj459 karma

What's the biggest 'Oh shit' moment you've ever had?

kripakrishnan322 karma

Each year we do a test of the resilience of our systems as part of an exercise called DiRT (Disaster Recovery Testing). DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems. The expected behavior is that no one will notice anything because our systems are resilient to such failures. This is usually a very exciting time for SREs in Google.

One of the tests we did was that of taking down a single old database shard just to see what would happen -- a seemingly harmless test. And then we realized we brought down over 30 of our front ends that depended on this single shard. You know that got fixed.
(For more: http://queue.acm.org/detail.cfm?id=2371516)

immerc217 karma

Can a mod give kripakrishnan, clusteroops and jrc-sre some kind of flair so people know that they're also officially part of this AMA?

sre_pointyhair204 karma

Yes, this please.

Slaich147 karma

Running this script will mark you all as submitters, giving you the blue highlight.

In Chrome you can just copy paste into your address bar (chrome removes the "javascript:" at the start, so you have to add that manually).

In Firefox: If you hit shift+F4 (or go to web developer - Scratchpad in the menu) in FF, it should open a scratchpad. If you paste the code there, and run it (Ctrl + R, or use the menu), it should work.

javascript:(function(){ 
var arr = document.getElementsByClassName("id-t2_a4rx7");
for (i = 0; i < arr.length; i++) {
   arr[i].className = arr[i].className + " submitter";
}

var arr = document.getElementsByClassName("id-t2_abse2");
for (i = 0; i < arr.length; i++) {
   arr[i].className = arr[i].className + " submitter";
}

var arr = document.getElementsByClassName("id-t2_acnhz");
for (i = 0; i < arr.length; i++) {
   arr[i].className = arr[i].className + " submitter";
}

var arr = document.getElementsByClassName("id-t2_ablrr");
for (i = 0; i < arr.length; i++) {
   arr[i].className = arr[i].className + " submitter";
}})()

sre_pointyhair104 karma

Seems legit.

thejeero377 karma

  • How often are Google's sites DDoS'd throughout the year?
  • When was the last time Google's main page was down? (I legitimately don't remember not being able to reach google.com)

clusteroops247 karma

We get attacked all the time, although most are so small or simple that our systems automatically diffuse or block them. For the larger products (e.g. Search, GMail), we only really need to manually intervene a handful of times per year.

Home page outages almost never affect all users simultaneously. There many different systems involved in simply connecting users to Google, and most incidents happen outside of our network. We do occasionally have network outages, which are regional, e.g. a few states or countries. We also occasionally introduce language-specific bugs, e.g. garbling CJK. As far as I can recall, the last global outage was back in 2005:

http://en.wikinews.org/wiki/Google_suffers_DNS_outage

Nomsfud376 karma

If you're on reddit right now then who is making google work until your AMA is over?

sre_pointyhair725 karma

We have a guy.

fragglet587 karma

The guy here. I can confirm this.

sre_pointyhair1353 karma

OH GOD THE SERVERS WHAT ARE YOU DOING HERE.

DisrespectfulToDirt210 karma

What skills are required to be an SRE member? Are you hardware engineers? Software guys? Do you write shell scripts? Program Lego robots to drive around datacenters all day turning servers off-and-on?

sre_pointyhair184 karma

There are a bunch of people of many different backgrounds that make up SRE - hardware, software, networks, security, etc. Your average SRE usually concentrates on software in their day to day work, although you do need to know hardware to do stuff like qualifying new hardware platforms (we do a bunch of this in Storage, where we qualify new drive sizes, new boards, etc. for our software such as colossus). We do write code to support the service, and also contribute code to the software itself (i.e. it’s often easier and faster to just send a patch to the software, rather than opening a feature request).

As for essential skills, it’s hard to define strictly (that’s one of the reasons why it’s so difficult to find good SREs). We take people from a pure software engineering background, sysadmins, network architects, and sometimes people come out of left field from completely different industries to surprise us with their ability to do the job (we did have one guy join SRE who used to be a motorcycle racer).

tonezime36 karma

I know at least one person who went from Sales to SRE.

sre_pointyhair83 karma

Yep - we do an internal training program for people who are in other areas that would like to cross-train (it's hard to do, not everyone makes it). The people who come over from that program are awesome, additional perspective always adds to what we do.

zimmund186 karma

Would you rather fix a cluster-sized server or a server-sized cluster?

Edit: your rock, guys.

jrc-sre215 karma

Sitting atop a horse sized duck I would attemp to fix the cluster sized server.

sre_pointyhair197 karma

I don't understand the question. I fix cluster-sized servers all the time.

Edit: thanks! I'd been looking for that.

nopropulsion180 karma

are there any cool easter eggs that people aren't really aware of that you can mention?

clusteroops334 karma

Not quite a easter egg, but one of my favorite esoteric features is the unit disambiguation in the calculator. For instance, a "pound" could be mass, force, or currency, so you can ask for something ridiculous like:

https://www.google.com/search?q=10+pounds+pounds+pounds+in+kilograms+newtons+dollars

sre_pointyhair195 karma

Searching for 'do a barrel roll' has to be my favourite :-)

annabunches166 karma

I was just hired by Google as an SRE. My start date is in a couple of weeks.

So, my question is...

How much fun am I about to have?

sre_pointyhair179 karma

All the fun. The learning curve is crazy, but it’s learning on systems the scale of which you will never see again.

thatlawyercat119 karma

what's the craziest problem you guys have ever had to deal with?

jrc-sre211 karma

On the technical side: Please go get our brand new and very different cluster network design into production as fast as possible, everywhere (with multiple iterations on how do we do this faster).

On the non technical side: Please go talk to the Prime Minister of Mongolia and his entourage about Cloud Computing.

ZiggyA110 karma

What do you think the biggest misconception that people have about what you do?

sre_pointyhair121 karma

One misconception we do have is that we’re an ops/admin group or a NOC. It’s our responsibility to keep the site up, no matter what - that involves working at a scale that’s staggering, and you run into problems you’ve never seen before. You need good engineers who know how to work from first principles on a system that they may not have “trained on” and figure out what’s up, and get to a resolution within minutes, not hours. We also write a bunch of code to make sure the systems stay in a happy state. We do oncall, sure, but we’re also on the hook for doing automation and mitigation such that oncall isn’t a terrible experience and doesn’t keep you up all night.

SRE is often a better experience for someone who likes to write code. Ben talks about some of that here: https://plus.google.com/+ResearchatGoogle/posts/Vtd6HPAiU5c

immerc70 karma

Judging by the questions here, I'd say it's that the SREs are an ops team (secops / netops, something like that).

sre_pointyhair70 karma

Yes, this :-)

drincruz96 karma

  • How's the on-call rotation?
  • How's the work-life balance?

sre_pointyhair85 karma

Oh, sorry - work-life balance:

It’s what you make it. As a manager, it’s part of my job to make sure that people are able to manage their own work-life balance. We do personal objectives on a larger timescale than I’ve done at other places, so rather than saying “This needs to be done this week”, it’s more like “This needs to be done this month, you’re clever, figure it out”. Personally, I’m able to balance things pretty effectively, and I’d hate to think people who report to me didn’t have the same opportunity (while getting their stuff done, of course).

sre_pointyhair64 karma

Oncall is part of being a SWE at Google whether you work on production services, in SRE or "pure dev" roles. All SWE groups producing user-visible or production code have these oncall rotations but, many ave timezone-shifted support so when you are oncall, it doesn't include overnights (i.e. you will get sleep). SRE's oncall rotations are generally more organized and consistent than most of SWE teams.

Icedrive79 karma

What's the effect of people/apps constantly pinging google.com to see if there's an internet connection?

clusteroops127 karma

Pings are cheap, and the aggregate rate is steady, modulo DoS attacks. So we don't worry about it. (Side note: I actually wrote the code in our load balancer that crafts ICMP responses.)

jjheiselman78 karma

What sort of tools (either commercial or custom-built) does your team use to ensure the reliability/health of the Google services?

I once interviewed to be on the SRE team and it was a nagging question that I had that my interviewers refused to answer.

clusteroops97 karma

Lots and lots of tools, most of which we write ourselves. Probably too many to fit in this box.

One big one is a monitoring system called "Borgmon". It collects, aggregates, and records instrumentation data, draws graphs, sends alerts, and dispenses candy. It's extremely powerful, but painful to use, because it has its own programming language. So we have a love-hate relationship with it.

grundlehunter74 karma

How much data storage does Google Maps Street View take on your servers? How much data is added weekly?

jrc-sre91 karma

Sounds like a good interview question... want to be an SRE? How much would you estimate? How would you design a system to handle this? Where's the bottleneck?

oscillat0r73 karma

Do you have any custom made NAGIOS-like system or do you rely on 3rd party software for availability measuring purposes? If not, do you use any 3rd party software in some point of the process? Thanks!

clusteroops73 karma

Grep for "Borgmon" in my other comments.

g0dspeed0ne71 karma

What do your sleep schedules look like?

How big is the team?

Have you met Larry Page or Sergey Brin?

clusteroops90 karma

Sleep schedule: roughly 12-2 to 8-10. Oncall doesn't affect it much, since teammates in different time zones cover nights. And I'm only oncall roughly one week in six.

Team size: roughly 40 in my group, which takes care of Google Search. We're spread across Mountain View, Dublin, and Zurich. But I regularly work closely with many other groups.

Larry and Sergey: both several times. I even briefly shared an office with Sergey.

jrc-sre50 karma

Sleep schedules vary, its more about your personal approach. In SRE, the on call schedules provide a lot of structure, so its usually pretty reasonable. Working with groups in other time zones is where it can stretch a bit, if you want to keep in video conference contact with them. I lose more sleep because of my kids than work (I’m not currently on call, but it still applies).

“The team” is kind of a nebulous term. We have lots of different sized teams, with varying amount of overlap. My personal team at the moment is 4 people (we’re hiring!) and we’re not an on call group. We’re spending our time on strategic planning of the infrastructure.

I’ve met them both, on several different occasions, usually when we’re looking for a decision on something big enough to include them, sometimes during a high level review of something as well. SRE can be involved in rather large decisions like where should we be putting data centers. We get to focus on Engineering problems and talk about them with Larry and/or Sergey. They still spend time talking on Engineering, not just typical business questions, and you can still bump into them in the hallways too.

boomboomcamaro70 karma

I read an article in Wired recently discussing Google's data centers and they touched on the Site Reliability team as a group of elites who get their own leather jacket with military style insignias (I'm assuming like a patch). Can we see what this looks like? And thanks for keeping Google up and running!!!

jrc-sre164 karma

Here's the patch from my jacket: http://i.imgur.com/pKRqXKr.jpg?1

EricaJoy65 karma

What is the approximate ratio of My Little Pony figurines to SREs?

sre_pointyhair78 karma

There are 4 SREs in the room, and one MLP figurine. We have failed you.

SkankTillYaDrop60 karma

Hi there!

As someone who is currently majoring in Computer Science and has spent a lot of time working in kitchens the idea of being on the SRE team is so unbelievably appealing it's almost difficult to put into words.

One of my favorite things about being in a kitchen is the extreme rush, adrenaline, and stress that all combines with a necessity for quick rational thought and on your feet problem solving. The first time I read about DiRT I fell in love with the idea instantly. I have a couple questions relating to participating in the events.

  • Is it really as fun as I think it is? Since I love intense stress, adrenaline, problem solving, high-pressure situations, and computers it seems like a dream come true

  • What sort of CIS focus would you recommend for the SRE team?

  • How often does the team have "Oh shit" moments where something major breaks and you have to jump into action? What kind of things are breaking in those moments?

Thanks for doing the AMA! The work you all do is incredibly inspiring and motivational.

sre_pointyhair32 karma

Dirt is pretty badass when it happens -- it’s a global exercise, so having the pager go off with “DiRT has occurred, battle stations” or whatever is a rush for sure. We try to formulate the tests and manage the time such that it’s as close as possible to the real situation (some teams organise round-the-clock rotations for the exercise in the office, we get catering in, that sort of thing).

As for CIS focus, anything that involves abstract problem-solving is going to be good. I often lament that there’s no ‘degree in sysadmin’. You either get a tech-specific course or something focused on just programming (or a particular language). Anything that exposes you to good, resilient design practices and doesn’t overly focus on specific tech is usually best. Sorry for the wishy-washy answer, but there’s no silver bullet here, really. We hire people who have no degrees at all, if they have the experience and skills.

Things break at a reasonable rate. We’re here to engineer away a lot of the things we can in design, or with redundancy, so for a lot of our tech, ‘fire drills’ are uncommon. They do happen, though, and there’s a certain rush to dealing with them, like with any reactive work.

TheFarmHand57 karma

We all know Google has some unique benefits, but what is the best part about working for Google that we probably haven't heard about?

sre_pointyhair91 karma

I know this will sound corny (especially from the grumpy Irishman) but it has to be the people. Being in a place where you can interact with a group of people who are interesting and good at what they do, but who are also on average interesting people is probably what keeps the community strong and people staying here.

Having access to free food, gyms, other facilities is great, but the novelty would wear off pretty quickly if you had to deal with people you don’t want to work with for whatever reason (i.e. they’re not good so you have to carry them, or they’re just not interesting). We have people in Dublin who make swords, play with lasers, and anything else you’d care to think of. It is kind of terrifying (in a good way) to ask “Does anyone have a lend of a medieval fighting axe” on a mailing list, and get a positive response.

jrc-sre57 karma

I have to agree with Dave. I specifically chose my latest gig at Google (been here 6 years) because of the people. I genuinely like and respect everyone around me, including and especially those above me. I am always learning from people here, not just on the technical or business sides either. I have a semi retired professional magician that works for me on security matters, a friend of mine here teaches welding classes, and my boss has taken some of the coolest astronomy pictures I’ve ever seen.

homo-insurgo42 karma

What's Google's best reliability feature?

clusteroops55 karma

Scale. We're large enough that we can mitigate massive failures with minimal user damage, and absorb even large DoS attacks. We can retry queries internally or shift load to other datacenters in seconds.

LM1038 karma

  1. What sort of volume of logging data do you'll have to peruse through on a daily basis?

  2. What are some of the most challenging incidents you have faced while trying to maintain uptime?

  3. What level of uptime do you attempt to maintain? How many "9s"?

  4. What series of checks would you follow if something was wrong? What would be the first response strategy?

  5. How do you run a website so well that people basically use it to check whether or not their Internet is up?

  6. How often do you have to interface with the Information Security folks and what sort of incident response activities do you delegate to them?

clusteroops46 karma

Answering based on my personal experience:

1: Almost none. In practice, aggregated instrumentation data suffices to solve most problems.

2: Queries of death, which cause numerous systems to crash simultaneously. Although we have numerous layers of protection against QoDs these days.

3: A lot. (Seriously though, it's complicated and differs between products and features, so no blanket statement would be very useful.)

4: We typically go straight to our dashboards, which give a heads-up view of many different systems. These allow us to quickly identify the scope of the problem, and likely point of failure. Our first response is usually to mitigate damage by diverting queries at the load balancing layer.

5: Aww, shucks.

6: We don’t delegate to Secops - they run as a separate organisation that does (among many other things) incident management, and security reviews for upcoming launches. We usually talk to them pre-launch for anything that’s sensitive, and they rank highly when we’re triaging bug reports and user reports of problems with products we run.

Supercharged3835 karma

What's your LDAP?

sre_pointyhair82 karma

OpenLDAP 2.4, same as you.

Livesinthefuture34 karma

  • I remember a Googler telling me you guys consider it a disk-space crisis when you've only got 10 PB free disk left. Is that actually true?
  • Are you expecting Google Glass to contribute a fair bit to the beating on your infrastructure?

sre_pointyhair180 karma

OMG WE ONLY HAVE 10PB LEFT SOMEBODY GO TO FRY'S.

lyzing27 karma

What are the majority of the problems you encounter caused from, what is the biggest problem you encounter when trying to fix said problems? With a company so large as Google, is internal communication a major issue when relying on other teams to fix things?

sre_pointyhair55 karma

I don’t think we can narrow down to a small number of root causes, the number of things that can affect services is staggering - everything from software fault to cuts in fiber in random places (we do have seasonal problems with fiber cuts in certain places. hunters get bored when they run out of game to shoot and start shooting fiber distribution boxes). This is why you’re never bored as an SRE - you can never make assumptions about the root cause of a problem :-)

We’re really focused on keeping our teams in touch and in sync with each other -- being in the Dublin office means a lot of videoconferences, being able to manage email and IM, that sorta thing. We use hangouts pretty heavily internally to keep up to date with teams and individuals.

amazon_throwaway_20116 karma

How do you manage outages in systems written by other people? I've heard that once a system is stable enough, devs don't do oncall anymore and pass everything to you guys? If a system written by someone else fails at 3 am and you can't figure out an obvious problem, how do you proceed for the quick fix?

Also, as SRE do you do any dev work? And vice-versa, do developers usually do regular oncall rotation, or they do it just for recently launched services and then pass it to you after an uneventful couple of weeks?

Thanks for the AMA!! 99% of our teams don't have dedicated ops people and oncall period is definitely the most stressful and important for us! You do learn a lot more and a lot faster though so I guess it's a tradeoff!

clusteroops19 karma

Other people: we review systems before agreeing to maintain them, which helps us understand how they work and how they can fail. After handoff, the developers don't just disappear, they remain around to continue supporting the product. And in practice, we often escalate to them for particularly tricky problems. We have their phone numbers.

Dev work: absolutely. We expect to spend no more than 50% time on ops work, although it varies by team in practice. Developers also do oncall, particularly because SRE support for any given system is not inevitable. We focus on high reliability and efficiency for Google's most important systems, and not basic care and feeding for every system ever produced.

johansch13 karma

Google.com obviously runs quite complicated software. From what I have gathered, most of it is C++, Java or Python. (Correct?)

Does the SRE team treat the software like a black box and mostly work with the diagnostic and debugging features its software developers left in, or do you also dig into the source code to do actual live debugging yourself?

If you do, how do you manage to keep up your knowledge of the code/architecture as the various products are developed by their teams?

Do you have people who specialize in the various products, or do you prefer to keep your team as generalists?

clusteroops18 karma

Debugging: we do a fair amount of black box monitoring, but for debugging our focus is on instrumentation and logging. In practice, many of the systems are so complex that we can't practically tell what they're doing unless they tell us. In fact, we often add instrumentation or logging to existing code to make the debugging process easier.

Keeping up: we review major changes before they launch, and regularly work with developers to ensure the systems remain scrutable.

Specialization: yes! My team focuses on Google Search. At a minimum, we all need to be capable of mitigating damage when any part of the system fails, and narrowing the root cause. But for day-to-day work, we regularly specialize in certain subsystems.

amazon_throwaway_20111 karma

Do you ever feel envious that guys who build features get all the credit while you guys operate "behind the scenes" and make sure everything is very smooth (which by the way seems to me much harder and more stressful than pure dev work)?

jrc-sre25 karma

Personally, I don’t. 3 reasons... (1) Folks within Google are usually very appreciative of the work we do. So, we actually do get quite a bit of kudos. (2) Usually by the time people recognize something we did, we’ve already moved on to something else, and actually solving whatever problems we’re facing is why I come to work every day. (3) I like solving big, hairy problems that haven’t been solved before. That usually means having someone else think up something crazy and then I get to figure out how to actually make it work.

And, just to point it out, the features and solutions are always better when its not just “guys” coming up with them. Our best work comes when we have a bunch of different folks contributing.

sre_pointyhair22 karma

Not really. In reality, SREs have a pretty integral role in making launches happen. We’re as proud of the stuff we help to put out as the people who design or produce them, and it’s amazing to see how it affects people’s lives. Example: I was listening to a Jazz radio station the other day in the car, and they were reading out random cities people were visiting their live stream from, from Google Analytics Real-Time. I helped launch that last year, and it was an exciting time for everyone involved. Moments like that make it not really matter if my name’s in an easter egg :-)

hareharebhagavan7 karma

Since there is a 'not-so-secret' fraternity of SRE teams from major corporations across the globe (think digital-infrastructure-freemasons with even stranger handshakes) and you gather tri-annually at a location known only as, "The Meadows", who won the softball championship last year? And, does the Google SRE Team work with other teams to provide charitable access to the internet in 3rd world/poor countries? Or even here in the US?

sre_pointyhair8 karma

Uh, cool story bro.

luckynic3 karma

Do you have a master password and how secure it is?

sre_pointyhair15 karma

It's super secure. We are up to password7

Edit: DAMNIT.