Highest Rated Comments


dapsays53 karma

When you see problems in production systems (ranging from transient errors or performance blips to serious outages), how do you balance the competing concerns of restoring service quickly and root-causing the problem? If you prioritize restoring service, do many issues ultimately go without a complete root-cause analysis? If you do root-cause them, what tools and techniques to you use, particularly for problems that end up involving several layers of the stack (e.g., kernel, TCP, and a Java application)?

dapsays17 karma

Thanks for the detailed response! I really like the idea of draining unhealthy services while keeping them running to preserve all the state, and then applying synthetic load to tickle the bug again.

I'm surprised to hear that most production issues are so easily pinned on specific code changes. I've been more impacted not by regressions, but by bugs in hard-to-exercise code paths, unexpected combinations of failures downstream component failures, and previously-unseen types of load. It's rare that I can even generate an automatic reproduction that I could use to binary-search code changes -- at least until I've root-caused it through other analysis. Do you do that mainly by replaying the original production load?

Thanks again for the thoughtful response.