Highest Rated Comments


amazon_throwaway_20116 karma

How do you manage outages in systems written by other people? I've heard that once a system is stable enough, devs don't do oncall anymore and pass everything to you guys? If a system written by someone else fails at 3 am and you can't figure out an obvious problem, how do you proceed for the quick fix?

Also, as SRE do you do any dev work? And vice-versa, do developers usually do regular oncall rotation, or they do it just for recently launched services and then pass it to you after an uneventful couple of weeks?

Thanks for the AMA!! 99% of our teams don't have dedicated ops people and oncall period is definitely the most stressful and important for us! You do learn a lot more and a lot faster though so I guess it's a tradeoff!

amazon_throwaway_20111 karma

Do you ever feel envious that guys who build features get all the credit while you guys operate "behind the scenes" and make sure everything is very smooth (which by the way seems to me much harder and more stressful than pure dev work)?