Debug"ing,

a. & n. from Debug, v. (hah missed that one Webster 1913 ! What ? The word didn't exist in 1913 ? You win this one Webster 1913, but you'd better watch your back from now on...)
Put simply it is the act of finding (and hopefully fixing) the bugs (ie errors) in your code. Often infuriating, but usually rewarding when you finally find and crush the little bugger.
Debugging normally starts when your program exhibits some abnormal behaviour. If you're lucky you will be able to reproduce the problem easily and get to work. If you are unlucky the program will behave fine every time you try. You are of course neglecting the influence of the alignment of the planets on your code.
Your first task is to gradually locate where in your code the problem is occurring. You may already have a fairly good idea of where the problem is thanks to things such as core dumps, Stdlog files (generated by Macsbug on versions of the Mac OS prior to Mac OS X), crash logs etc., which provide information on the state of the program if it actually crashes. Here's one from a program I've been working on :
Date/Time:  2002-07-18 02:17:45 +0200
OS Version: 10.1.5 (Build 5S66)
Host:       localhost

Command:    Speed Download
PID:        2495

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x00000000

Thread 0:
 #0   0x70000978 in mach_msg_overwrite_trap
 #1   0x7024a0e8 in SwitchContexts
 #2   0x702c9eec in YieldToThread
 #3   0x00015190 in ThreadSchedulerTimer
 #4   0x70196cbc in __CFRunLoopDoTimer
 #5   0x7017c244 in __CFRunLoopRun
 #6   0x701b70ec in CFRunLoopRunSpecific
 #7   0x7017b8cc in CFRunLoopRunInMode
 #8   0x79587904 in RunEventLoopInModeUntilEventArrives
 #9   0x7959a818 in ReceiveNextEventCommon
 #10  0x7974dbac in _AcquireNextEvent
 #11  0x795f1090 in RunApplicationEventLoop
 #12  0x00010d6c in main
 #13  0x000040ac in start
 #14  0x00003edc in start

Thread 1:
 #0   0x7000497c in syscall
 #1   0x70557600 in BSD_waitevent
 #2   0x7002054c in _pthread_body

Thread 2:
 #0   0x7003f4c8 in semaphore_wait_signal_trap
 #1   0x7003f2c8 in _pthread_cond_wait
 #2   0x705593ec in CarbonOperationThreadFunc
 #3   0x7002054c in _pthread_body

Thread 3 Crashed:
 #0   0x0002abec in HTTPValidator
 #1   0x000345fc in StartHTTPDownload
 #2   0x7027ae50 in CooperativeThread
 #3   0x7002054c in _pthread_body

PPC Thread State:
  srr0: 0x0002abec srr1: 0x0000d030                vrsave: 0x00000000
   xer: 0x20000014   lr: 0x0002abc8  ctr: 0x70002af0   mq: 0x00000000
    r0: 0x00000000   r1: 0x021974c0   r2: 0x00000000   r3: 0x01dc91b7
    r4: 0x00000000   r5: 0x00000006   r6: 0x01dc91b0   r7: 0x01dc91b4
    r8: 0x000001fc   r9: 0x8024099c  r10: 0x000bc1a0  r11: 0x84000280
   r12: 0x70002af0  r13: 0x00000000  r14: 0x00000000  r15: 0x00000000
   r16: 0x00000000  r17: 0x0005ab34  r18: 0x00000000  r19: 0x01dcec90
   r20: 0x00000000  r21: 0x0006da51  r22: 0x0006dbf0  r23: 0x00000000
   r24: 0x01dc91b7  r25: 0x00000000  r26: 0x01dcec90  r27: 0x00000000
   r28: 0x000e8ff0  r29: 0x000f20a0  r30: 0x00164000  r31: 0x0002ab34

Besides providing me with information such as date and time, this crash log tells me what caused the crash
Exception:  EXC_BAD_ACCESS (0x0001)
Codes:      KERN_PROTECTION_FAILURE (0x0002) at 0x00000000
This tells me that my program crashed because I tried to access some memory which i wasn't allowed to (specifically at address 0). A bit further down you can see
Thread 3 Crashed:
 #0   0x0002abec in HTTPValidator
 #1   0x000345fc in StartHTTPDownload
 #2   0x7027ae50 in CooperativeThread
 #3   0x7002054c in _pthread_body
It is giving be a stack trace for each of my threads, which basically tells me the name of the function that was executing when my program crashed. As you might expect, this narrows down the problem significantly.

If the bug you are hunting doesn't actually cause a crash then you're best bet is to follow the program and see what it is doing. If you are lucky enough to have a debugger (which you will almost certainly have these days), you will usually put breakpoints at various places in your code. When the execution of your program reaches one of these points the debugger will step in and let you examine the contents of memory and variables, and step through your code line by line. If a crash occurs, the debugger will often show you what line caused it. If you are doing the wrong thing you may see it happen. If you don't have a debugger, you will have to add statements to your code that output data you are interested in. This is of course less flexible and may also interfere with the problem you are trying to fix. Depending on your operating system, you may have other tools at your disposal, for example environment variable that cause system libraries to print extra information about what they are doing (or sometimes separate "debug" versions of these libraries).

Debugging your program may alter the way your program works. For example if the problem is caused by 2 threads trying to access a same piece of data or resource at the same time (a common problem, known as a race condition) then you interrupting the execution may stop the simultaneous access from happening. Even something as innocent as adding a printf statement can alter execution of your program in some way.

Often the problem is simply the final result of an earlier problem. Part of your program may trash some data another part of your program relies on. A crash may happen when the second part executes, but this may give you very little information on where the actual problem occurs. Even worse is when your program is trashing the stack or the heap, which will usually cause your program to crash at seemingly random points.
Yet another type of problem, is what is known as a deadlock. When this happens you don't actually get a crash, the program just locks up. This happens when part A of the program is waiting for part B to complete, part B is waiting for part A to complete. As you can see, when this happens you will wait forever.

All these previous types of bugs are what I might call an implementation bug: you had the right idea when you were writing you code, you just messed up when you converted your ideas into code. Equally insidious is what i call a logical bug, i.e. a bug that is caused by a fault in your logic or design. You can step through code till you're blue in the face, it won't help much until you realise you were thinking about the task your program is doing in the wrong way. And even then you have to come up with the right way of doing it, which may involve rewriting significant amounts of code.

At some point you will probably end up looking through hundreds or thousands of lines of code trying to work out what is happening, cup of coffee in one hand, mouse in the other. You may make random changes, and keep your fingers crossed while you run the program or send it off to testers. Oh the sinking feeling when you get an email with the subject "Bug not fixed"!!

But in the end it's all worth it, the feeling of satisfaction you get when you have sent the little bugger into the other world keeps you going until the next bug is found.

I hope some of the non developers out there have gained a brief insight into what we are actually doing staring at our screens at 2 am and would like to finish with a few words of advice if you ever submit a bug report:

  • Don't just say "The program crashes": it's not very helpful
  • Do be specific and give details
  • Do try and find a scenario that causes the problem to happen reliably: fixing bugs you can't reproduce is hard
  • Do be clear and articulate about the problem
  • Do give any details you have (such as core dumps etc.)
  • Don't be abusive: it doesn't help anyone
If you're thinking "Hey I'm just a user, you're the developer, it's up to you to fix all that!" then bear in mind that the better the bug report is, the easier it will be to find and fix the bug and you will have a better product sooner.

Help a geek today, send in nice bug reports !