Why Is The Site So Slow? My Eyes Are Glazing Over!

As I wrote in my post on May 7 – “Fessing Up To Our Mistakes” – we ran into problems earlier this month with scaling our application.

That post has a great deal of detail about those problems, our early efforts to solve them, and the lessons we learned. We know that especially during the day in the Western Hemisphere, our site has been very slow.

From the moment we ran into problems with scaling, we’ve focused 100% on trying to resolve them as quickly as we can. I’ve been personally involved with those efforts and our entire team has been involved either in helping to fix the problems or responding to our users.

I wanted to give you an update on what we’ve done and what’s left for us to do. I am happy to answer specific questions if the update doesn’t answer them – feel free to leave them in the comments to this post.

1. Our priority has been to find a global way to deliver the data quicker to every user, especially during periods of heavy traffic (7 am to 7 pm in the Central Time zone). The problem for us isn’t the number of servers, the data-center, or bandwidth. We host with one of the top providers in the world (for which we pay an obscene amount of money every month).  As I wrote in the earlier post – our problem is with our existing application and the way that data is stored and queried from the database. We can’t easily solve that problem with our existing application, so we’ve looked for a temporary solution to work around it. We’ve brought in additional developers and system admin people to assist and we’re close to a solution. We’ll be doing further testing tomorrow on a test site before we roll out the solution to our production site.

For those who want slightly more technical details – we’ve tried to implement Squid (without success) and Varnish (with some success, and we hope that we can finalize Varnish tomorrow) to cache much of our information so that we can serve huge numbers of people without the delays you’ve been seeing. In the very short gaps of time where we’ve tested on the production site, this has worked very well. Unfortunately, we’ve run into problems that have required us to revert back. That’s why last week, some of you saw pages that wouldn’t load – and why we posted a note at the top of our site telling you that some pages might not load.

We debated whether we should “test” in our production environment but concluded that unless we subjected the proxies to real load, we could not be sure whether they would work or not. In fact, they worked beautifully in our test environment and then would fail horribly when placed under real load. Many of the problems caused last week were the result of those live tests.

We are very proud of our customer service team for making sure our community was well informed about the problems, and for dealing with buyer and creatives who were having all sorts of problems on the site. By Friday of last week, we resolved virtually all of the outstanding problems except for the overall sluggish site performance during the day.

2. Our second priority (which we’re pursuing in parallel with what I just discussed), is to audit all of our server configurations, identify all errors and areas where we can improve performance, test, and implement those solutions. This has been an ongoing process and we’ve made numerous improvements that have significantly helped in the off hours, but have had only a marginal impact during the day. We continue to make tweaks looking for ways to make sure that our servers are performing 100%.

3. Our third priority is to explore the addition of more servers. If we could fix all problems by deploying more servers, we would.

This is not as simple a solution as it appears because more servers can actually hurt our performance – by putting more demands on the database and our file server. We learned this when we deployed the 2 additional servers that we added to our server farm last week. That’s why at the moment, this is not our highest priority, but we have continued to evaluate this option to make sure that once we find a way to add more servers without incurring the negative impacts, we can do so.

Once we have stabilized the site and returned performance to levels that don’t embarrass us (and believe us, we are embarrassed about the performance of our site over the last few weeks), we’ll refocus to promptly complete our refactoring efforts, will thoroughly test the new code, and will deploy it at the earliest opportunity. We are confident that the new code will resolve virtually all of these issues, and more importantly, will allow us to scale cleanly and efficiently.

Please feel free to ask questions. I am happy to get into more technical discussions in the comments if you’re interested or if it’ll help you avoid making some of the same mistakes we’ve made.

Thanks to our entire community for your patience with us as we deal with the real problems of scaling. We continue to be humbled by your confidence in our ability to promptly get past these issues.

Photo credit: law_keven