Friday, February 08, 2008

PBXT & DBT2: Dubugging C/C++ 101

Yesterday I starting testing PBXT using the DBT2 benchmark. Following the implementation of durability and SELECT FOR UPDATE for the engine I was more interested in the benchmark as a test for stability and concurrency than performance. I was not disappointed...

Which bug first?

Well I immediately ran into 3 bugs. Isn't it funny how bugs often come in batches, which leaves you thinking: "Oh sh.. where do I start?". Here's my advice: start with the bug that is most likely to disappear if you fix the others!

A simple example, you have 2 bugs: an unexpected exception is occurring, and you're loosing memory. First look for the memory loss, because it may disappear when you fix the exception (because you may be loosing memory in the error handler).

Take things one problem at time:

Another thing: once you have decided for one of the bugs, stick with it (no matter how hard it gets) to the bitter end! Thrashing around will build frustration!

So what happened with the DBT2 test? I started the test and immediately noticed that the engine was throwing "duplicate key" errors (it was too much to hope that this behavior was intended). Next I hit an assertion that claimed that a semaphore was not initialized (but I knew the semaphore was initialized). Finally, on restart after the assertion failed the engine crashed on recovery, in the clib memory manager (not a good sign!).

So were to start? Taking my own advice I quickly secured the state of the database before the restart, and confirmed that I could repeat the restart crash. So that one could wait for later.

The duplicate key error seemed be a fairly stable repeat, so I took a closer look at the semaphore problem. Here I noticed that the assertion was failing because the check bytes that indicate that the semaphore was initialized had been overwritten, not a happy situation!

Make the bug quick and easy to repeat:

This bug was also difficult to repeat, I had to restore a fresh environment to get it to repeat consistently. So this is where I started.

But before we go on: make sure, in such a situation, that you can repeat the bug as quickly and easily as possible. Eliminate as many manual steps as you can, it will save time in the long run. For example, in this case I wrote a line of shell commands to delete and copy in the database to provide the correct starting point for repeating the bug.

Check your last bug fix first!

Unfortunately this bug turned out to be the result of a short laps of concentration during my last bug fix. But I did not notice the error during my testing of the big fix and so I moved on to DBT2. When the error occurred during the DBT2 test, I did not relate the problem to my last bug fix.

If I had, I would have found the problem quick enough by a simple code read of the bug fix again. This has happened to me before, so my advice is: check your last bug fix, even when the new error does not seem to be related!

Debugging C/C++ 101, 3 lessons:

Conveniently for my little refresher course, each of the 3 bugs proved to touch on a different aspect of C/C++ debugging:

Bug 1. Using an uninitialized pointer.

For goodness sake, if you suspect this, then compile you program with optimization on, and the warning for "uninitialized variables" enabled. I didn't do this, and I may have saved myself a lot of time. Anyway, this does not always work (for example if you used '&'). Unfortunately, if the compiler does not help, there is no easy way to find these bugs.

Debugging method: Probing

I call the method I used to find this error "probing". The idea is to write a special piece of check code which tests for the memory overwrite. The semaphore that was being overwritten was not global, but I added some code when it was initialized to set a global pointer to the semaphore. Then I wrote a little check (or probe) function which tested to see if the check bytes were still OK.

Next I spread calls to the check function around my program, trying to close in on the point were things go wrong. When doing this you have 2 difficulties to deal with:

Finding the right thread - If you are probing the wrong thread, then you get very miss-leading results. For example, I started by adding the probes to the engine API functions. When the probe was failing on the entry point it took me a while to realize that the problem must be in the PBXT background threads. So, using the probe try to first isolate the thread(s) that are causing the corruption.

This meant removing all probes from the engine API functions and placing them in the background threads, starting with the main loops. Then by elimination I managed to narrow the problem down to one particular thread.

Dealing with disappearing repeatability - The problem with this kind of bug is that it is really shy! As soon as you start to probe it, it disappears. I mean the changes made to the program change the executions so that the bug does not repeat.

At this stage it is very tempting to leave the debug statements in the code and declare the bug as fixed! But, alas, the bug is still there, it has just moved on to overwrite some other part of memory in some quite little corner.

Here I can give the following advice: When the bug disappears always return to the last repeat point and try taking smaller steps this time.

Another thing: approach the corruption point from the bottom. By this I mean, close the probe in from a point after the corruption has occurred. This is because if the corruption is due an uninitialized stack value, then as you move the probe towards the corruption point from the top, the probe disturbs the state of the stack.

As I mentioned above, when I found this bug it turned out to be a result of the last bug fix, bummer :(

Bug 2. Overwriting memory.

Next I decided to look at the crash on startup. This bug caused a crash in the memory manager. This is most often due to a memory overwrite which has wiped out some of the management data stored per block by the memory manager.

Fortunately I have to right tools to deal with this problem.

Debugging method: Scanning Memory

Its like "don't leave home without it": don't start C/C++ program without a debug memory manager that does at least the following:
  • Adds and checks headers and footers on every block of memory allocated.
  • Wipes out blocks that are freed (for example set all bytes to 0xFE).
  • Always moves a block of memory that is reallocated.
  • Records every block allocated, and notes the line number and file where allocated.
  • Checks on shutdown that all memory has been freed, and reports blocks not freed.
This can also be done for objects allocated in C++ by overriding the delete and new operators.

Now using the fact that I have a list of all allocated pointers I have implemented a function which scans all allocated pointers and checks the headers and footers to tell me if anything is corrupted.

So to find the recovery crash I added the scan call to some of the loops that do recovery and soon managed to narrow things down and find the point were the corruption was occurring.

Note that with this method it may not even be necessary to do such a search. One call to the scan routine tells me which block has been corrupted. If it was a simple overwrite of the end of the block, then my debug memory manager will tells me which block it was, and were it was allocated. This may be enough to find the problem.

In my case it turned out that I had taken a pointer to a block and then some sub-function reallocated the block. But this is why it is so important that the debug memory manager always moves blocks on realloc(). If it had not done this, I probably never would have noticed the bug until it happened in some production situation (ugh!).

Bug 3. Concurrency problems.

I was worried about the 3rd bug which was causing an unexpected "duplicate key" error, because I was afraid it might be a conceptual problems. This fear stems from the fact that there are indeed serious conceptual problems involving MVCC and SELECT FOR UPDATE (which requires locking), but fortunately, my fear was unfounded, and it turned out to be just a normal bug, whew!

I new this bug must be related to concurrency because I had tested all aspects of the row level locking in a simple controlled environment.

Debugging method: Trace it

The way to find concurrency problems is to trace them. The better your trace, the easier it is to find the bug. I think it is almost impossible to find a concurrency bug just using the debugger (unless you have a deadlock, for example). This is because in the debugger you only have a snapshot of the situation, and you don't see the interaction between the threads.

In my case the duplicate key error turned out to be the result of a SELECT FOR UPDATE that failed earlier.

There is, of course a big problem with tracing and concurrency. Sometimes the error is robust, and sticks around while you bombard it with printf() statements. Mine was more of the shy type where the timing was really critical, and it disappeared when I added print statements.

In this case, I also have the right tool for the job. It is a trace function which records the information only in RAM, in one huge block of memory which rolls over if necessary. It is worth also taking the bit of extra time to make the trace function handle printf() type syntax, so that it is as easy to using as printf() itself.

What I did was I set a breakpoint at the spot in my code where the duplicate key error is generated. At this point in the code, I built in a call to my "trace dump" function. This function dumps the trace information which I have collected so far to a file.

Then I can examine the trace to find out how I got to this point, and I can also use the debugger to examine the threads and find other information I need.

Now some advice on trace code:
  • Firstly, never build in trace code that you think you might need! This is a waste of time if it is never used, and it clutters the code unnecessarily.
  • Secondly, when faced with a problem that needs to be traced, do not waste to much time or thought trying to guess what information you will need. Initially, just get something out there. Examining the trace is the best way to decide what information is missing.
  • Finally, when you are done, and you have found your bug, you will probably feel quite attached to you new trace code and not want to part with it. Don't worry, I understand, and I am not going now tell you that you have to remove it :) Well, not all of it, anyway.
Really, you have to be very critical and decide what parts of the trace are good in general, and what parts helped you just find this bug. That stuff must go. And the other stuff: take a bit of time to clean it up, and make sure it can only be enabled in debug mode! Ever have to recall a version because you forgot a trace in it? Hmmm...

Well in the end I was very happy with my trace code. It allowed me to pinpoint the bug in SELECT FOR UPDATE, and I added a GOTCHA to my code which you can search for as soon as I get this version released (soon I hope).

OK, so this has been quite a long post, thanks for sticking with me :)