Forums

How to avoid downtimes on updates

Hello,

Our dev team has grown, and we now deploy more frequently. On the other hand, our site has grown too, and downtimes are becoming less acceptable.

My goal is to let us keep deploying frequently and at any time during the day without risking server downtime.

The solution I've gathered from reading around is to have the codebase deployed on two servers, A and B. While traffic is directed to server A, we push new code to server B and restart it. Once it's back online, we transfer traffic to server B. On the next update, we'll push new code to server A instead, and do the same thing.

Is that a good way to go ?

How would I set that up in PythonAnywhere ? With two web apps connected to the same database ? Can I programmatically change which web app receives traffic from my domain, so I can automate the whole process ?

I don't think we have a programmatic way to do that right now, but I think you could do it from the interface. Would a couple of minutes downtime (while you clicked on things on our site) be acceptable for your use case? If I'm reading what you said correctly, your main concern is that you might deploy a version of your site that had bugs, and you're worried about downtime relating to backing that out -- is that right?

If so, you could set up two web apps, connected to the same database, as you say, with the live one having the name of your live domain (let's say www.yourdomain.com) and another with a different name (say, staging.yourdomain.com). You'd update staging.yourdomain.com, and if you were happy with it, you could then:

  • Rename www.yourdomain.com to (say) old.yourdomain.com. At that point, you'd start getting a PythonAnywhere "Coming soon" page for your site.
  • Rename staging.yourdomain.com to www.yourdomain.com -- that would put the new site live
  • Just to tidy up, you would then rename old.yourdomain.com to staging.yourdomain.com

Each rename operation would take less than a minute, so you'd have less than two minutes downtime. In the event of a bug in the new code meaning that you needed to roll back, you'd just repeat the process.

(I've made a note that we should add a "rename website" endpoint to our API -- it would be nice to automate this.)

One extra complexity think about with any blue/green deploy system like this is that you'll need to work out how to make sure that your database is compatible with both the old and the new code. That can be tricky, especially in the case of a rollback if you've deleted a table or a column.

Thank you Giles. We do already have and use a staging environment, and my concern is not so much with bugs than with the server restart when you update code. We like to push code multiple times per day, and the server restart time, even though it lasts less than 30 seconds, can cause issues in our user's flows.

So the current solution is to only update the live environment once per day, at a time when few users are around, but that reduces our reactivity.

The renaming solution sounds like a good plan, if it could be done automatically and instantaneously. Having that website renaming endpoint would help.

The automation would go something like this : 1. New code detected on master branch 2. Push code to the inactive webapp 3. Wait for a 200 response from the home page (and a few other key pages, just to be sure) 4. Rename the active webapp to "old.", the inactive webapp to "www.", and finally "old." to "staging."

The only downtime that should happen here is between : A. "www." gets renamed to "old." B. "staging." gets renamed to "www."

Would this happen near-instantaneously with the renaming API ?

I don't think it would be quick enough for your requirements, unfortunately -- a rename is essentially a stop on the old name, then a start on the new name, so it's going to be no faster than a regular reload -- probably slower, because you'd be doing it twice (well, technically three times, but the last one would not contribute to downtime).

We do have some plans to move into a direction where we could do something that would fit better with your goals; its being driven by the requirement that some people have to have wildcard domains (so, say, *.something.com would route to the same website, so that they could have separate hostnames for different users of their sites). In order to do that we'll need to decouple websites from domains, so that you could have multiple sites up and running and just redirect the traffic for a particular domain between them in realtime.

But that's some way away -- it will require some fairly significant changes to the way we route requests around the system. They're all good changes with multiple benefits, and I'm 99% sure it will happen, but can't give any timelines.

That sounds like a great idea that would be useful to us as well. Would you ever publish a roadmap or uservoice-style tool ?

The blue/green system also helps when there's more than just a code update : database migrations, installing new dependencies, collecting static. This can take time while the server is down, so I think blue/green is a good system to have even if the reloads are slow.

For the time being we can have blue up all day and do the url switching at night to have green be the new system. And if we terribly need a bugfix to go straight to production, we can exceptionally do the switch mid-day, which will take 2 server reloads as you said, but we can control when that happens and avoid too much trouble.

Thanks for your help Giles !

We have no plans to publish our roadmap, but I'm glad to hear you have an idea that you can use.