dapsays

http://www.reddit.com/user/dapsays

Highest Rated Comments

dapsays53 karma2014-01-24 20:01:12 UTC

When you see problems in production systems (ranging from transient errors or performance blips to serious outages), how do you balance the competing concerns of restoring service quickly and root-causing the problem? If you prioritize restoring service, do many issues ultimately go without a complete root-cause analysis? If you do root-cause them, what tools and techniques to you use, particularly for problems that end up involving several layers of the stack (e.g., kernel, TCP, and a Java application)?

View History Share Link

dapsays17 karma2014-01-24 21:56:49 UTC

Thanks for the detailed response! I really like the idea of draining unhealthy services while keeping them running to preserve all the state, and then applying synthetic load to tickle the bug again.

I'm surprised to hear that most production issues are so easily pinned on specific code changes. I've been more impacted not by regressions, but by bugs in hard-to-exercise code paths, unexpected combinations of failures downstream component failures, and previously-unseen types of load. It's rare that I can even generate an automatic reproduction that I could use to binary-search code changes -- at least until I've root-caused it through other analysis. Do you do that mainly by replaying the original production load?

Thanks again for the thoughtful response.

View History Share Link