Newsletters Select newsletters below and click the button to sign up!
Internetnews BloggersRecent Entries
ArchivesMonthly ArchivesSearch The Blog
« Red Hat Network Satellite 5.3 hits orbit with open source |
Sean Michael Kerner Blog
| Happy Birthday Google Chrome! You're growing up fast »
Google Gmail 100 minute outage is a big deal From the 'Everyone Makes Mistakes' files:
Google Gmail users were hit with a 100 minutes outage yesterday due to an upgrade issue. Ben Treynor, VP Engineering for Google Gmail blogged that,Google took some of the Gmail servers offline on Tuesday AM for routine upgrades. It was those upgrades that led to the service disruption. That's right, due to miscalculation on Google's part, an action (the upgrade) which should have provided better service, resulted in no service for tens of millions of Gmail users around the world. "We had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers -- servers which direct web queries to the appropriate Gmail server for response," Treynor blogged.In my opinion, this is a classic load balancing newbie error. Problem is Google isn't a newbie. How could they not know the load on their servers? More importantly how come they don't have some kind of virtual (or physical) pool of burst bandwidth on demand capability to deal with issues? It turns out Google does have extra capacity which is how they were able to bring back service after 100 minutes. To Google's credit they did notice the issue immediately and they are taking steps to have better reliability in the future. "After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online," Treynor blogged.Treynor added that Google is now increasing request router capacity beyond peak demand. They are also doing some router isolation to ensure they don't have the same kind of cascading effect in the future. So if one router goes down in one datacenter it doesn't affect routers in other data centers. Graceful degradation, another good best practice, is something Google is now going to implement. So traffic will just be slower when trouble pops up, instead of having routers just not accept traffic in response to an overflow condition. This is the same type of technique that works well in my opinion to deal with Denial of Service (DoS) attacks. "Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity,' Treynor wrote.This incident just goes to show us all that even a big sophisticated organization like Google can mess up on network upgrades. 0 TrackBacksListed below are links to blogs that reference this entry: Google Gmail 100 minute outage is a big deal. TrackBack URL for this entry: https://swarm.jupitermedia.com/mt-tb.cgi/8834 1 CommentsLeave a comment |
||
Thought this was a good article - I was totally dead in the water during the two hour downtime, couldn't get work email, couldn't look up information I needed for an important phone call, and couldn't confirm an account for a server certificate I was trying to create before it expired....
All because of Gmail. Meanwhile, email to my home server (Linux w/Postfix) was still running fine so I could get email in/out.
Basically, I'm extremely skeptical of "Cloud Computing" (vs. ASPs, where a company provides a very specific 24/7 server for a fee.) But in either case, not having MY data: incoming, outgoing, and archived (stored) on a local machine w/backups just seems like I'm flirting disaster - if my Google email account got wiped, it would be as bad as losing my wallet due to all the info I have saved there and not on a machine I own (with backups!)
Thanks for the details on what happened - it was indeed a total newbie mistake (although w/more than 20 years in the biz, I still make them myself, I must admit.)