The Art of Fixing Code

The title isn’t original, but it’s apt and describes my feelings exactly. Bulleted, and also referred to as Vishnu’s Commandments for Fixing Code:

  1. Fix The Fucking Problem (hereafter FTFP)

    When you’re faced with a bug, a calamity, a loss of limb, you solve the problem. Don’t worry about its causes, its probable antecedents, which commit broke your painfully arranged view of the multiverse or whether you’ll earn a PeeEtchDee by writing a paper about good software engineering practices. Instead, you FTFP.

    Only after you’ve FTFP, shall you think about anything ancillary. Only afterwords shall you blog or tweet about it or submit it to our humorous overlords. Capish?

  2. Next, FTFP as if your ass is on fire.

    Downtime sucks. If a marginally large system used by a non-trivial number of users (by which I refer not to your blog which your mom and your imaginary girlfriend reads) goes down the rabbithole, a lot of people will complain. If you work for Initech, Lumbergh will cluck his tongue and hand you a pink slip if you mess up too often. Forget nine-nines of reliability, if you manage a month with just 5 mins downtime, it’s great.

    I’ll also extrapolate this from TDD: “Write a test, write the minimum amount of code for the test to pass, refactor, write more tests.” becomes “Figure out the problem, write the minimum amount of code that fixes
    the problem
    , refactor, find more problems.”

  3. Log. Write logs to Disk. Backup & rotate Logs.

    In any after-action report, you’ll want to figure out why the Problem happened. What went wrong. For this you need logs from when it happened. Often you’ll notice that it’s a periodic bug which slowly got worse as your system added in more users, so you’ll want to figure out exactly when it happened, and when the issue escalated. For this you’ll need to log properly and backup those gzipped logs. Thumb rule: for the first nine months of any production system, logging should be enabled with the maximum possible verbosity. This includes connected systems, like for example, Database Logging.

  4. Use Git already, nitwit!

    It’s pretty much what everybody should use. Easy, quick code commits and can even serve as a quick and dirty deployer too. Git allows you to hotchpotch solutions in case of emergencies. There will be instances when you want to short-circuit every code review check and just deploy the thing goddammit and nothing beats Git, for now.

  5. Use an automatic deployment tool.

    Your deploy should be just one command, or a click of the button. Hooking up before-commit hooks is okay as long as it doesn’t take an eternity.

  6. Make pretty downtime notices so people know you are at least trying.

    Being apathetic sucks. Giving an impression of being apathetic when you are working hard to save your application sucks harder because of stupidity. So don’t be stupid. Communicate. Make twitter work for you or for the old-fashioned, have a mailing list or an RSS feed. Your blog shouldn’t go down at the same time as your site so keep it on a different server—wtf are you doing writing a blog app anyways—outsource that to people who know better.

  7. Learn the ins and outs of your deploy OS of choice

    It’s not enough to be a Gee Whiz programmer. Learn your OS inside out. If you’re on *nix (like real men) then this involves figuring out what to do when your load average goes through the roof, your SQL engine hogs CPU, your hard-disk fills up or your webserver restarts. Learn about commands like: top, iftop, iotop, uptime and the entire /proc magic filesystem. It’ll help you diagnose code and issues. For blacker magic, learn about strace and dtrace and how to debug difficult issues. All this comes later though—remember the golden rule: FTFP, so the first take should always be to Google your error.

  8. Don’t put all your eggs in one basket

    Do trust in Murphy, he’s eternally right. Things will fuck up. Instead of preventing it, plan for contingencies and try to recover from them fast. Have scenarios where bad things happen to your application. Have a load balanced implementation the first thing for chrissake! Have a DB in a master-slave configuration the instant you can afford it. Have a system where provisioning servers doesn’t take days. Move as much infrastructure as you can to the cloud where you don’t have to maintain it directly.

  9. Delegate the Debugging

    This might be harder to do because you’ve got to pump up your adrenalin and stay in the zone to figure out problems and implement quicker solutions, but often, three or four helping hands work much better at solving problems—especially when other people can help you cross-reference data to try to come up with probable cause. Remember, speed is key.

    Surround yourself with people who are smarter than you too. That helps to negate your stupidity. Own up to mistakes and implement solutions fast.

  10. Have a Staging Server (or Replicate Production as closely as possible)

    The worst problems are those that happen only on production systems and can never be replicated in development. Remember: OSX is not Linux, minor version differences often introduce incompatible interfaces (I’m looking at you, Rubygems and PHP) and stuff breaks when you add something new without testing it on your deployment OS of choice.

    If you are making any drastic changes to the architecture, test in on a staging server (or if you can’t afford it, a password-protected subdomain) in live conditions before switching, it’ll save you loads of trouble.

  11. Prevention is better than Cure

    Do TDD or have a good test suite. Stuff will break far less often then. Educate your coders to have decent S/W engineering practices. Indent code and name variables uniformly. Raise exceptions and use assertions within your code. Use transactions too so when stuff breaks it doesn’t affect a lot of other important elements. Learn how to use a Queue and how important that is in modern production systems. Don’t abuse an RDMS for tasks it wasn’t meant for. Learn about newer key-value storage DBs. Beware of caching—it introduces subtle errors and problems, but learn to love it too, without it, you’ll never scale. Write readable code and document it both within and separately.