Highest Rated Comments


kripakrishnan322 karma

Each year we do a test of the resilience of our systems as part of an exercise called DiRT (Disaster Recovery Testing). DiRT was developed to find vulnerabilities in critical systems and business processes by intentionally causing failures in them, and to fix them before such failures happen in an uncontrolled manner. DiRT tests both Google's technical robustness, by breaking live systems. The expected behavior is that no one will notice anything because our systems are resilient to such failures. This is usually a very exciting time for SREs in Google.

One of the tests we did was that of taking down a single old database shard just to see what would happen -- a seemingly harmless test. And then we realized we brought down over 30 of our front ends that depended on this single shard. You know that got fixed.
(For more: http://queue.acm.org/detail.cfm?id=2371516)

kripakrishnan22 karma

This one is hard to answer - each year has it’s own theme and it’s all fun. We’ve had our action hero (based on a top executive in Google) save the day with Portals. We’ve had ‘rigidly defined areas of doubt and uncertainty’, and we’ve had zombies. We even have our own music videos sometimes.