We are the Operations team at Etsy. Ask us anything!

209

Hi! My name is Avleen Vig, I’m a Staff Operations Engineer at Etsy. At 2pm ET (or 6pm UTC for those of us who keep time that way!) the team and I will be answering questions on our team, our work, and the field of Operations in general. So come by and ask us anything for the hour!

The members here today are happy to talk about almost anything. We also have some domain expertise in the following areas, but you’re welcome to ask anything that’s on your mind:

Scaling web infrastructures

Monitoring

Culture-bridging between operations and development, CI, CD and other things we affectionately relate to DevOps

Configuration management

Hadoop

Databases

OpsSchool.org and training the next generation of Operations Engineers

Single malts and photography

Running our own network

We also have members of our Security and Corporate IT teams here to field questions in those areas! Some interesting things we’ve talked about before:

Our hardware

Memcache hot-key detection

Syslog-ng performance

Here’s some proof: https://twitter.com/avleen/status/365555119586160642 and http://velocityconf.com/velocityny2013/public/schedule/speaker/137037

Update 1: We're having such a great time answering, we've shifted some things around and are going to stay here until 3:30pm ET and keep answering!

Update 2: Thanks for all the great questions everyone! We had a ton of fun answering them. We're wrapping up now, time to get back to meetings :-)

You can find most of us on twitter, and see our posts on our engineering blog: http://www.codeascraft.com/ We’ll also be attending and speaking at Velocity NY and Velocity London, and Surge in the coming months, if you’d like to come see us in person!

If you want to help collect the knowledge of our trade and help educate future engineers, check out http://www.opsschool.org/ !

Comments: 238 • Responses: 26 • Date: 2013-08-12 17:07:24 UTCsource

oogachaka13 karma2013-08-12 17:30:16 UTC

What do you use for monitoring? Do you scale by hand, automatically, etc? And most importantly, have you ever established a correlation between alcohol intake and on-call rotations?

View History Share Link

avleen8 karma2013-08-12 18:07:19 UTC

On call is an interesting thing.

We have about 7 people in the rotation right now and we spend a lot of time and effort trying to reduce alert fatigue and numbness to pages. We haven't looked at that particular correlation between those two things, but I'm curious now! I'll start tracking it next week ;-)

We do a lot of other things, such as monitoring sleep patterns. Several people on the team wear motion trackers or use sleep tracking apps on their phones. I'll be talking about this more at Velocity NY, but there's something there for sure.

I personally try to avoid drinking while on call ;-)

View History Share Link

Scoundrel7 karma2013-08-12 17:57:13 UTC

What software and/or techniques do you use for backing up MySQL databases?

View History Share Link

patoarvizu5 karma2013-08-12 18:03:10 UTC

Backing up and restoring any DB in a Continuous Delivery environment is extremely challenging, I haven't heard of anyone that has implemented a solid solution. It would be great if you (Avleen) can touch on this topic a little bit also.

View History Share Link

avleen1 karma2013-08-13 03:43:16 UTC

Doing backups is actually pretty easy. The existing tools for any of the major database systems work really well. Restoring... that's more tricky, at least in MySQL. Let's say you have two servers, db1 and db2, which are in a master-master pair. Db1 dies. The trick is to not restore the backup you made of db1, but rather the backup you last made of db2. The reason comes down to MySQL's replication: If you restore the backup from db1, it'll come back up and replay all of the replication events which db2 has stored in its binlog. This includes the events which previously originated on db1, which occurred after the backup you just restored was made. As db1 begins to replay all of the events, it applied the events which originated on db2 (as it should), but when it comes across the events which previously originated on itself, it refuses to re-run them! This is a feature which is designed to prevent replication loops.

There are a couple of ways around it, but we found the easiest is just to apply the backup from the currently live host.

View History Share Link

Twirrim1 karma2013-08-13 12:46:08 UTC

Alongside what avleen mentions, there are a number of tools from Percona that are essential for anyone dealing with MySQL environments, which you can find in the "Percona Toolkit". For example pt-table-checksum which helps you confirm that data is consistent between servers, and pt-table-sync to help fix it when it isn't.

View History Share Link

avleen1 karma2013-08-13 17:04:50 UTC

My favourite ones are pt-query-digest and pt-query-analyzer :-)

View History Share Link

Isvara6 karma2013-08-12 18:43:00 UTC

What do you run your own servers instead of using cloud IaaS providers? Did you do a cost comparison? Is there a non-cost reason?

View History Share Link

avleen3 karma2013-08-12 19:06:22 UTC

Back when Etsy first started, the cloud was very much in its infancy. At that time everyone was running things on their own platform.

Since then, we've invested heavily in the skills and infrastructure to keep growing our own platform. Certainly, I don't think there's a strong technical reason we couldn't run on a cloud IaaS platform, it's just that we're not there. The setup we have now runs really well and there isn't a strong technical benefit to moving.

We do have a lot of internal virtual machines which developers use, but most of the rest of our hardware is dedicated to specific functions.

View History Share Link

pradeepchhetri6 karma2013-08-12 18:24:04 UTC

How do you people monitor the Etsy's website from client's perspective, whether you people use some external monitoring services ?
What RDBMS you people use and how you scale them, ensure high availablity and prevent SPOFs.
What log and data backup solution you people use ?
What are the desired skills required for getting job as Operations Engg. at Etsy?
Who decides the architecture of the application - Developer or DevOps? If both, how do you reconcile the differences?

View History Share Link

avleen3 karma2013-08-12 18:57:46 UTC

We use a number of tools to do this. We use external services, but we also use Lognormal to collect performance data, much like many other sites do.
Mostly MySQL. Check this out: http://codeascraft.com/2012/04/20/two-sides-for-salvation/ . We also have a little PostgreSQL, but most of the services that use it have been migrated to MySQL with a master-master setup.
We log a log of data and centralise it with syslog-ng. From that central point we then do a lot of cool stuff with it, like parsing with Logster, sending to splunk, etc.
Strong web operations knowledge (linux, apache, mysql, php), configuration management, and networking are all a good starting point.
Both. We get together in a room (or over video conferencing) and talk about what would be best for the product, operations, and long term maintainability.

View History Share Link

necessaryillusion5 karma2013-08-12 21:15:19 UTC

Thanks for doing this :)

From another comment here I assume you are using CentOS. How do you handle yum upgrades across your servers?

And also, do you use Perconas MySQL fork? If so what do you think of it compared to standard MySQL?

Thanks again :)

View History Share Link

avleen2 karma2013-08-13 02:55:38 UTC

The answer is "very carefully". We keep software versions pinned to one version for the most part. When we need to upgrade it (new features, bug fixes, etc), we test it one a small number of systems and then let chef upgrade it everywhere.

We opened source our chef-whitelist library which we use to accomplish this.

View History Share Link

xrothgarx4 karma2013-08-12 18:10:13 UTC

As someone who is trying to make the jump from small scale web services to more large scale companies. Where is the best place for me to focus my time to gain more experience?

I'm torn between: web server management and scaling (apache, nginx, etc.) Config management (Puppet, Chef, etc.) databases (MySQL, Mongo, etc.) programming (Python, Ruby, etc.) Hosting platforms (AWS, Google Compute, etc.)

View History Share Link

avleen3 karma2013-08-12 18:22:22 UTC

First and foremost; configuration management. This is the current best way to manage large systems and you'll save a lot of time. Chef, Puppet, Cfengine, learn one or more. More is better!

Once you've got a good handle on those, then start looking at the "whole stack". If you're thinking about large web companies, this means apache/nginx, php/ruby on rails/django, mysql/postgresql, etc.

"The Cloud" is a big deal - it's changed the way Operations teams work. Getting very familiar with the APIs and methodologies for AWS, Rackspace, Heroku and others is quite important.

View History Share Link

Isvara4 karma2013-08-12 18:37:27 UTC

The kind of continuous deployment that companies like yours are doing is, I think, still out of reach for a lot of companies, especially ones that don't have big enough in-house devops teams to produce the systems. What tools or services do you think are still not available that would help companies get there more quickly?

View History Share Link

avleen6 karma2013-08-12 18:46:46 UTC

I think you're right. I spoke about this at length at LISA '11 and the same question came up there.

A lot of CD is to do with culture, much more than tools. I know people who do it with Deployinator, Jenkins, Dreadnot, and a bunch of other things. But the culture is really what makes it. Once you have the culture moving in the right direction, where developers are happy pushing code and owning software problems, and operations teams are OK letting go of the control and working with developers, the tools become less important.

(I realise this isn't directly answering your question, so I'll summarise with: better CD tools like Deployinator and Dreadnot, and better/easier unit testing things are always good.)

View History Share Link

willigm4 karma2013-08-12 18:38:52 UTC

What are your favorite types of cookies?

View History Share Link

avleen4 karma2013-08-12 18:39:42 UTC

MIT magic cookies. No, not those types of magic cookies.

View History Share Link

CYPhillis4 karma2013-08-12 18:01:08 UTC

[deleted]

View History Share Link

avleen7 karma2013-08-12 18:11:27 UTC

We looked at a number of solutions a few years ago and none of them (at the time) met our requirements for simplicity and flexibility. Most were very heavyweight solutions.

One of the things we love here, is the ability to keep things as simple as possible. That includes our software, our processes and our changes. We based our solution on that. It started off as a collection of shell scripts, and eventually was rewritten in Ruby. It's still incredibly simple, and easy to debug when things break, and affords us the flexibility we want.

As an example, we can add very custom modules to test backups to ensure they're correct.

I won't say it was easy, or that I'd recommend it for everyone, but it works for us.

View History Share Link

Scoundrel4 karma2013-08-12 17:58:05 UTC

What software and/or techniques do you use for backing up Hadoop data (to protect it from being destroyed because of a human error)?

View History Share Link

avleen2 karma2013-08-13 03:31:22 UTC

"Human error" is a term we stay away from wherever possible. Frankly, it doesn't exist as a legitimate reason for problems in complex systems. Yes humans make errors, but those aren't themselves the reasons for something breaking.

We have an internal book club where we're currently reading "The Field Guide To Understanding Human Error" by Sidney Dekker. It explains how you should look at the failure of complex systems. There is something Dekker calls the "Old View", which includes things like human error as a reason for a problem. The "New View" takes the time to look into the actual cause of a problem. For example, let's say someone logs into a production server, thinking it's a development server, and wipes the disk.

The problem isn't that the person did it, but that they were able to do it. Why did the system not make it more clear that it was a production machine? What other safe guards should have been in place to prevent this? Did they need some kind of confirmation before doing it? Why did they need to take this particular action in the first place? And so on :-)

I highly recommend the book, and John Allspaw's talks on postmortems (part 1 of his 90 minute talk from Velocity 2011 is available here)

View History Share Link

jtopper3 karma2013-08-12 17:33:50 UTC

Would you rather fight one Allspaw-sized Gene Kim, or 100 Gene Kim-sized Allspaws?

View History Share Link

avleen8 karma2013-08-12 18:18:29 UTC

Depends - which one of them will buy me pizza afterwards?

View History Share Link

jtopper5 karma2013-08-12 18:23:34 UTC

I said "fight", not "date" ;)

View History Share Link

avleen3 karma2013-08-12 18:39:09 UTC

Damn! This is tough because I'd rather date them than fight them.

Ok fine: I'd rather take one Allspaw-sized Gene. No-one needs to fight 100 Allspaw. I can barely take on 1 (although I try regularly, I've yet to find his weak spot - next time I'll try after he's had a big lunch, it might slow his thought process down just enough).

View History Share Link

brandyvig3 karma2013-08-12 18:36:48 UTC

Dating co-workers is bad, bad!

View History Share Link

avleen6 karma2013-08-12 18:41:11 UTC

If that were the case, I would never have met my wife!

(full disclosure, brandyvig is my wife ;-) )

View History Share Link

Adman653 karma2013-08-12 19:07:01 UTC

Hey, thanks for doing this.

What do you use for an op's dashboard?
What are key operational metrics?
How do you deduce business KPI's from your data and how do you instrument them?
Are you using a messaging queue like AMQP for anything?

View History Share Link

avleen2 karma2013-08-12 19:19:09 UTC

We open sourced our dashboards framework, and we group important metrics together there.

As for key metrics, we have specific metrics per service (How many DB queries are happening? How about web requests?), and also business metrics.

We graph everything. Everything!

Message queues? :-)

View History Share Link

MightyBigMinus3 karma2013-08-12 18:46:37 UTC

How do you handle automation and change control on the very uncooperative lower layer stuff like bios and raid card config?

View History Share Link

lozzd3 karma2013-08-12 18:58:17 UTC

BIOS: We've managed to avoid it. Which is irritating because recently we discovered that having Hyperthreading enabled was actually hurting performance and scaling, so we had to manually restart tens of servers to fix them, but that was a one off so it was quicker to do it by hand than automate (I'd love to hear if there was a good way to do this if anyone has any ideas...)

RAID cards: We have a set of tools that allow us to burn-in, configure and then install machines automatically. Part of this includes a tool which PXE boots a CentOS "live CD", so we can have full access to the various RAID configuration tools. You can essentially choose a RAID configuration on the command line, and the server is rebooted and reconfigured by loading a configuration file from the master server. No more touching MegaCLI or hpacucli :) More information about how we do this available here: http://www.slideshare.net/jonlives/custom-live-media-spinning

View History Share Link

roxnrock17 karma2013-08-12 20:04:47 UTC

BIOS: We've managed to avoid it. Which is irritating because recently we discovered that having Hyperthreading enabled was actually hurting performance and scaling, so we had to manually restart tens of servers to fix them, but that was a one off so it was quicker to do it by hand than automate (I'd love to hear if there was a good way to do this if anyone has any ideas...)

I went through this nightmare too. A number of servers were shipped to us with the NVRAM flashed to have HT on, when we needed it off.

We fixed it this way:

With a dev machine run 'modprobe nvram && sudo dd if=/dev/nvram of=nvram.enabled'
Pop the machine into the bios. disable HT, reboot into the os and repeat: 'modprobe nvram && sudo dd if=/dev/nvram of=/dev/nvram.disabled'
Compare the two binaries with 'cmp -l -b' - just one bit should have changed. If this is true, then save this md5sum: 'md5sum nvram.enabled'
Run a dsh to find all the hosts with the matching md5sum for the nvram by 'modprobe nvram && md5sum /dev/nvram'
Whichever host has a matching md5sum for the nvram you saved previously, copy over the disabled nvram file to the new host, and dd over /dev/nvram.
Reboot the hosts. Observe that HT is now disabled. Yaaay!
Repeat for each md5sum that doesn't match the original

This worked for us, YMMV.

View History Share Link

avleen1 karma2013-08-13 03:04:42 UTC

Words cannot express how awesome this is. Love it!

View History Share Link

asenchi3 karma2013-08-12 20:12:24 UTC

Just want to say I love the Etsy Ops team. Everyone I've chatted with there are incredibly talented people. Keep on rocking guys.

View History Share Link

avleen1 karma2013-08-13 02:59:26 UTC

Thank you! We try to be helpful, and give back to the community as much as possible too. Part of that means talking openly and clearly about what we do so that others can benefit from our experience. It helps make everything better :-)

View History Share Link

kirksan3 karma2013-08-12 18:04:47 UTC

I asked several questions on behalf of my wife in another post. Here's one for me.

I heard you guys were looking at MongoDB for some stuff. I've experimented with MongoDB at a very large site and it was a dismal failure. If you have looked into it what's your opinion?

View History Share Link

avleen2 karma2013-08-12 18:30:32 UTC

We experimented with MongoDB for a large project a few years ago. We later decided to move the project from there to MySQL. There were a number of reasons for doing this, which related to stability, domain expertise and wanting to keep our infrastructure more homogenous.

We don't always follow "use the right tool for the job", because we found it leads to having too many tools. Instead we established a number of patterns internally, and try to stick to them.

That doesn't mean we never add new technologies though. We recently started using Redis because it has some very specific strengths which outweigh the cost of adding another technology to the stack.

View History Share Link

rmkbow3 karma2013-08-12 18:34:00 UTC

How do you handle new software releases? Is it a continuous deployment system of some sort?

Do you have scripts that automatically push the new to the servers?

View History Share Link

avleen3 karma2013-08-12 18:49:02 UTC

This is a topic I love to talk about! Due to time constraints I'll will simply say that we have tons of information on this process on our engineering blog at http://www.codeascraft.com/ :-)

View History Share Link

preflightsiren2 karma2013-08-13 04:02:09 UTC

Sorry I'm late to the party, perhaps these can get answered at a slower pace :)

Have you ever been in a position where PaaS/ outsourced IT was a realistic position for some products/web services within Etsy? How did you manage supporting those products and services?

What points of differentiation did you explore for Etsy operations services vs AWS services like Elastic Beanstalk or Heroku?

How much exposure to ITIL have you had? Do you think it's a good framework for new or ever-changing business? (think start ups pivoting, or rapidly bring in new product ideas)

View History Share Link

avleen3 karma2013-08-13 05:44:11 UTC

Hi!

We've certainly considered it. In some cases it has made sense, and in other cases not. That sounds like a cop-out, but we really do take everything on a case-by-case basis. If we were a lot smaller, I would advocate more for out sourcing to help stretch our resources farther.

However, we're much farther along at this point, and have the capacity (both technology and personnel) to do almost everything in-house.

When exploring those services, we look at a number of things: * How would it impact the user experience? * How would this impact the current architecture? * How much more (or less) time would it take to support? * How much would it cost, vs doing it ourselves? * How does it change our options for the future? These are really just some of the questions. With the understanding that our infrastructure is a known good design pattern for us, we try to stick to it where possible.

View History Share Link

DyHydrogenMonoxide2 karma2013-08-12 17:45:10 UTC

I have some experience running small web servers, and am interested in how a large scale system would function.

What are some useful tips / tricks / software I should look into to expand my knowledge?

View History Share Link

avleen2 karma2013-08-12 18:35:25 UTC

Growing your own knowledge is a hard problem. If you don't have the need to work with larger scales on a daily basis, you don't get the skills.

It's a very chicken and egg situation :-)

Read stackoverflow and serverfault. Look at what questions people ask about those larger scales. See if you can re-produce the problem and the fix at a smaller scale (maybe in a VM on your computer?)

View History Share Link

kaedyr2 karma2013-08-12 21:01:33 UTC

Hey Guys, I love the way you operate and the community seems awesome. I'm currently a sysadmin with programming experience and want to get into devops. Would you guys be willing to hire on someone who's quick to learn and really interested in what you're doing? :)

View History Share Link

avleen1 karma2013-08-13 02:58:09 UTC

We hire people with a wide range of backgrounds, and look for the best fits for each position :-)

I would say that you can't really get in to "devops" as a job. That would be akin to saying you want to get into "teamwork". It's just a thing you do :-) But having both programming and operations skills is a very good combination and one that is quite high demand these days!

View History Share Link

brandyvig2 karma2013-08-12 17:19:22 UTC

waves to all

How big is your team, who all is here today, and how do you manage all the different personalities you work with?

Also, what's for dinner? :P

View History Share Link

avleen2 karma2013-08-12 18:17:10 UTC

The entire Ops organisation is pretty large, and covers infrastructure, security, corp IT, and others. There are 14 people who deal with the production site and network directly. We have about 15 people in the room right now, from a variety of groups who want to say hello!

Personalities are always great to work with. We enjoy all the different ways of thinking. Sure, sometimes we run into bumps just like any group does. Over time we've built strong relationships with each other and they really help us get over things quickly.

Dinner is hand pulled noodles :-)

View History Share Link

Matt_Etsy7 karma2013-08-12 18:18:38 UTC

+1 in the Data Center!

Everyone loves noodles!

View History Share Link

avleen3 karma2013-08-12 18:42:49 UTC

Keep the noodles away from the servers, dude! Last time I got too close with noodles, the site was flooded with pictures of ramen!

View History Share Link

patoarvizu1 karma2013-08-12 17:28:53 UTC

I have been invited a couple of times to their 'Eatsy' days (on Tuesdays and Thursdays Etsy employees cook food for everybody with local ingredients) and the food is really great!

View History Share Link

Rocketeers3 karma2013-08-12 18:37:57 UTC

Etsy employees don't cook. Local food vendors/caterers are brought in to serve their food.

View History Share Link

avleen3 karma2013-08-12 18:59:30 UTC

Untrue! I make a pretty awesome chicken casserole - just at home though :-)

View History Share Link

ancat1 karma2013-08-12 19:20:34 UTC

Why is your security team so cool?

View History Share Link

avleen4 karma2013-08-12 19:25:30 UTC

It's all the animated cat gifs.

View History Share Link