Highest Rated Comments


toughmttr494 karma

A particularly tricky problem to debug was the time that some of our serving jobs became unresponsive intermittently. At certain times of the day they would block for awhile, and then start serving again, stop, and start, and so on. After a long and tricky debugging process, we found that a big MapReduce job was firing up every few hours and, as a part of its normal functioning, it was reading from /dev/random. When too many of the MapReduce workers landed on a machine, they were able read enough to deplete the randomness available on the entire machine. It was on these machines that our serving binaries were becoming unresponsive: they were blocking on reads of /dev/random! This is when I realized that randomness is one of the finite and exhaustible resources in a serving cluster. Embracing randomness and trickiness is part of the job as an SRE!

toughmttr69 karma

SRE is all about having backup systems with as few dependencies as possible. :-)

toughmttr52 karma

I have a degree in History, but was always interested in computers. I played with Linux for fun (remember Slackware 1.0...)? After college I got a job as a sysadmin, gained skills and experience, and went on to learn a lot about networks, performance analysis and system engineering in general. After a bunch of years in the industry, I jumped at the chance to interview at Google!

In SRE we are actually more interested in what people can do rather than CS degrees or candidates with theoretical knowledge that they can't apply. We like people who can think on their feet and figure things out. We have many colleagues here coming from various backgrounds, not necessarily just CS/computer engineering.

toughmttr26 karma

Anything chocolate!