Thursday 16 July 2009

The Amazing Windows Specific Bug

I've just spent a week and a bit tracking down possibly the weirdest, most pernicious bug I've seen all year. It was obviously going to be a doozy - the application crashed only when the system had multiple cores, only when the input images were all of different sizes, and the crashes were both intermittent and unpredictable. Oh, and in the middle of code that had been seriously hammered in the past, and in the wild, and shown no signs of falling over.

Where to start?

Well, first of all, that intermittent crashing? That looks like a threading issue. It could just be memory corruption, but running on the exact same input data in the exact same order gave significantly different times to failure each run. It just feels like a threading issue. That plus it's only apparent on multi-core machines. A check of the code reveals some worker threads that are spawned in the midst of some number crunching to make use of all the available cores - so if there's only one core, there're no more threads. Aha, this looks likely. Right, so, if there's a weird issue going on here, we might be able to track it down using Valgrind, right? right?

Wrong. Because on Linux... there is no issue. Exact same source code. No crashing.

Here's a pseudocode version of the section of code in question:

In the main thread, for each image processed:
//Initialise the shared job queue (that all threads will pull jobs from)
this->jobQueue.Initialise(jobData);
//Initialise and resume the worker threads
for (i=0; i<threadCount; i++)
{
this->threads[i]->Initialise(data);
this->threads[i]->Resume();
}
while (true)
{
//Keep pulling jobs off of the shared job queue until there's no more work to do
job = this->jobQueue.GetJob();
if (job)
{
this->DoProcessing(job)
}
else break;
}

do
{
//Wait for the worker threads to finish
for (i=0; i<threadCount; i++)
{
busy = this->threads[i]->CheckBusy();
if (busy) break;
}
}
while (busy);
//Pause all the workers until we need them again
for (i=0; i<threadCount; i++) this->threads[i]->Pause();

And in each worker thread :
Thread::Initialise(data)
{
HandleData(data);
//Signal that the thread is now working
this->busy = true;
}


Thread::Entry()
{
while (!this->TestDestroy())
{
job = this->jobQueue->GetJob();
if (job)
{
this->CrunchNumbers();
}
else
{
//Signal that the thread is no longer busy
this->busy = false;
}
}
}

So, to explain a little, what we have here is a pool of threads that are created at one time and then re-used through the lifetime of the application. Whenever we get into the number crunching code, the threads have their operating data passed in and they're resumed. As soon as all the jobs are processed, the threads signal that they're finished. Once all the threads are done, the main thread pauses them all and the program goes on its merry way.

Anyone figured it out yet?

There's one last piece of interesting information, to do with the way that threads are paused on the two different platforms. On Linux, the thread won't honour the Pause() call immediately, but will pause the next time it gets to TestDestroy() (for those wanting to replicate this, I'm using wxThread though this behaviour is a result of the underlying threading of the target OS). On Windows, the thread is paused immediately.

The upshot of this is that you can't guarantee, when you resume a thread in Windows, where it's going to start execution from. Now, with this last piece of information, it should all fall into place. See where the thread sets its busy flag to false? Right, in your head pause the thread there. Now, next iteration the thread will be set up properly by the initialise call, which raises the busy flag... only for it to instantly lower the flag when the thread resumes. So now the thread is merrily doing its processing, but it's signalling that it isn't, and is ready to be paused.

That in itself, while worrying, isn't so bad. What can happen next is. The Pause() can now happen anywhere in the worker threads loop. So if the thread happens to still be working when the main thread starts checking the workers to see if they're ready to be paused, it's more than likely going to be paused in the middle of doing number crunching on the input data.

And then, next iteration, you change all the input parameters and resume the thread mid-calculation with totally different data. If you're lucky, that means you get garbage out for this one iteration. If you're not, and your worker is mucking around with variable length data, you just wandered off your allocated heap and into the wild unknowns of Segfault City.

There's an easy fix, of course. The issue only arises because when the thread is in a state that it flags as "pausable", it's still performing write operations on itself (constantly re-clearing the busy flag). If we consider that the pause, and any re-initialisation of the thread data, is asynchronous to the thread entry (which it is) then it's obvious this is going to cause trouble. Instead, we can make sure that once the thread has signalled it's ready to be paused it no longer performs any writes. Since it's not going to be writing, it's now safe to pause it, do some asynchronous writes, and then set it going again. (note: I say "safe". It's safe in this context. You could still really confuse the thread if it's checking badly-considered conditionals that you just monkeyed with the control parameters of, but let's take that as read).
Thread::Entry()
{
while (!this->TestDestroy())
{
//Make sure we don't keep setting the busy flag to false over and over
if (this->busy)
{
job = this->jobQueue->GetJob();
if (job)
{
this->CrunchNumbers();
}
else
{
this->busy = false;
}
}
else Sleep(1); //Oh, and may as well consume fewer resources while we're at it
}
}