Friday, November 12, 2010

Dangers of multi-threaded applications

Creating multi-threaded applications can be a scary task (at least for me). The common problems such as deadlocks, race conditions and corrupted memory are pretty intimidating, especially if you're just starting out. One less famous, but still nefarious, problem you can run into in multi-threaded development are continuously spinning threads. I'm sure there's a more common name for these, but essentially they are threads running in a non-stop loop, polling for new stuff to work on. Threads like these run as fast as possible in this loop, and greedily consume as much CPU time as they can; which leads to horrendous performance loss.

We ran into a very bad version of this today on new server software we mass tested for the first time. A lot of people complained about being unable to connect to the server, move through our game levels, or interact with items in the levels. I confirmed the server processes were running at 100% CPU, so we started looking at where our performance was going. What made matters frustrating was on our (beefy) dev machines, the servers were only running at about 8-10% usage. We spent a little time looking at some game engine profile output, and looking at some asset issues we thought were the culprit. In the end I wish I could say we found the problem through some tool or cool debug technique; but rather it came about when, on a whim, I asked another developer about some threaded functionality he recently added. I asked him to check his code to make sure his new thread was yielding if there was nothing for it to do, and voila, we found the problem.

It ended up he originally built the threaded task with some blocking functionality which had controlled the thread CPU usage (unintentionally), but later changed it to use non-blocking calls. After this change he forgot to add a thread sleep call, so the thread was simply spinning in a loop looking for something to do. While he fixed the code, I figured out how to setup a test case for our dev machines to reproduce the problem of 100% CPU usage on the servers. I remembered we could set the processor affinity through the task manager, which would allow us to simulate a single-core machine to run the game server on. Once I did this, I was able to see the game server spiking up to 100% CPU usage on the assigned core. Next I integrated our fix, which was to simply sleep the thread for 10ms every loop, and re-ran my test. The server dropped from 100% CPU usage to between 0 and 1% while idling!

Overall, it was a good day today, long as hell though.