The process of figuring out how something works by observation, disassembly and/or analysis of its behaviour, attributes and effects.

A common method of duplicating a product is called "clean room reverse engineering". This involves having two separate teams of engineers. The important thing is that these two teams never see each other, talk to each other, pass each other in the hallway, or have any contact whatsoever. Team A (or the dirty team) is sent to reverse engineer the original product (also called the target product). They take it apart, run tests on the components and the whole, and generally figure out what makes it tick. They then draw up their conclusion, as a set of complete specifications for the product, in a document called a Design Document. Their job is over; they are then dismissed. The design document is handed over to Team B (or the clean team). Using this document, they then reconstruct the product from scratch to do exactly the same thing that the original did.

This process is used to sidestep legal issues with reverse engineering a product with the intent of duplicating it. The dirty team has no intent of duplicating it and the clean team has never reverse engineered the product. It is important, for legal purposes, that the clean team has never seen or used the original product at all.

A well-known case of clean room reverse engineering involves the IBM PC. In 1982, the IBM PC was a widely-used computer. While various attempts to copy it had been made, these attempts all resulted in an imperfect copy that was similar, but not identical to the PC, and were unfit for many purposes. An engineer named Rod Canion decided to make money off the situation by forming a company called Compaq, which used the clean room procedure to clone the IBM ROM BIOS (the most important piece of the PC that was proprietary to IBM) and create the Compaq Portable PC.

Being not only 100% compatible with the IBM PC, but also cheaper, not to mention portable, made the Compaq very popular and sparked the whole PC clone industry.

This is a trivial example.

Some time ago, we wrote a graphics-intensive Windows program, and we wanted it to degrade gracefully on systems with eight-bit or four-bit color. We spent two painful weeks thrashing around with eight-bit color palettes in several different versions of Windows. In the end we pretty much gave up1. The full horror of it all belongs in another writeup, but along the way, we learned a few things about color palettes. In fact, we wrote a Win32 utility for editing them2. Most bitmap editors (e.g. Photoshop and Paint Shop Pro) will let you export a color palette from a bitmap that uses indexed color. This turns out also to be true of the resource editor that's built into Microsoft's C++ IDE, DevStudio. Of course, everybody uses a different file format for these things. Photoshop just blasts an array of three-byte RGB triplets onto the disk with no header, PSP uses a text file, and so on. DevStudio exports a RIFF file.

What the hell's a RIFF file?!

Today, we decided (for no sane reason at all) to train our palette editor to read those files, so we sat down and figured that out.

First, we decided to close our eyes and charge madly at it, swinging wildly and howling imprecations. Real programmers don't need no stinking documentation.

First, we created a trivial case to start sniffing at: A sixteen-color palette, with sixteen identical and easily recognizable entries (red == 1, green == 2, blue == 3). We created that, and created a simple bitmap which used it. We then loaded the bitmap into the resource editor, exported the palette to a file, and hex-dumped that file. This is what we saw:

00000000  52 49 46 46 50 00 00 00  50 41 4c 20 64 61 74 61  RIFFP...PAL data
00000010  44 00 00 00 00 03 10 00  01 02 03 00 01 02 03 00  D...............
00000020  01 02 03 00 01 02 03 00  01 02 03 00 01 02 03 00  ................
00000030  01 02 03 00 01 02 03 00  01 02 03 00 01 02 03 00  ................
00000040  01 02 03 00 01 02 03 00  01 02 03 00 01 02 03 00  ................
00000050  01 02 03 00 01 02 03 00                           ........

We expected the first four bytes to be "RIFF", because we've dumped these things before. That's how we knew in advance that the RIFF thing was happening here. We aren't surprised by all that 01 02 03 00 stuff starting at byte 0x18, either. Those are the color palette entries. They're easy to recognize because we used an easily recognized pattern of bytes. We wouldn't have been surprised to see them be three bytes each, but four makes equal or greater sense: Windows usually stores colors as four byte values3, with the last byte containing flags or just left zeroed as padding. Maybe the first or the last palette entry is something else, something which just coincidentally happens to have the same value as one of our colors -- but there's only sixteen of these things. Yeah, those are all palette entries.

So what about the rest? Bytes 0x08 through 0x0f are easy: "PAL data" is a string which makes sense in English. We vaguely remember that RIFF files are some kind of metafile (AVI files are RIFFs too, for example), so it makes sense for there to be a tag identifying what kind of data the file contains.

Now the hard parts: Bytes 0x04 through 0x07 (50 00 00 00) and bytes 0x10 through 0x17 (44 00 00 00 00 03 10 00).

0x04 through 0x07 could be either two WORD (16 bits) values, or one DWORD (32 bits) value. The way these things usually go, it's probably something about the file size, or the size of the data segment that comes next in the file. We're on Planet Wintel here, so it's little-endian: That means that if it's a DWORD, we're looking at 0x00000050, or decimal 80. How big is our file? 88 bytes. A-ha! The file is four bytes of "RIFF" followed by four bytes of the DWORD we're sniffing at right now, followed by 80 more bytes of other stuff. It looks like what we've got here is a "data segment size" value. We can check that by creating another palette, this one with 256 entries. Sure enough:

00000000  52 49 46 46 10 04 00 00  50 41 4c 20 64 61 74 61  RIFF....PAL data
00000010  04 04 00 00 00 03 00 01  01 02 03 00 01 02 03 00  ................
...

The new file is 1048 bytes in size. In bytes 0x04 through 0x07 (in boldface), it's got 0x00000410 (little-endian, remember). That's decimal 1040, once again eight less than the file size. We'll assume we got that one right. Of course, it could be a two-byte value: The second two bytes there could be something else entirely. However, if that were the case, the size of the thing would be limited to 0xffff (decimal 65535) bytes. That's tiny: 64k. Microsoft was that stupid fifteen years ago, but we'll assume that they've woken up by now. Ha ha.

Now what about those other eight bytes, between "PAL data" and the RGB data? Let's tackle the first four bytes, again starting with the assumption that it's a DWORD. Okay, 0x00000044, what's that in decimal? Maybe when we see the number, it'll look like it means something. 0x44 is 68 decimal. Maybe it's a size. Does it describe the size of anything we've got on our hands? Well, how many bytes left in the file? Ohh, about 68. Gotcha. Does this assumption hold true for our other file? 0x404 equals decimal 1028 equals, well, golly... Once again, twenty bytes less than the file size. There it is. WE ARE COOKING WITH GAS.

(Any questions about this "twenty bytes" figure? Well, I got confused just now writing about it, so I'll cover that part in detail. Let's look back up at the first hex dump up above. We're looking at the first four bytes of the second "line" of data. The first "line" is sixteen bytes, and we're looking at the next four bytes after that: So, the remainder of the file is everything from byte number 21 onwards.)

Next four bytes: As a DWORD, that's 0x00100300, decimal 1049344. Um. That doesn't look like anything meaningful. Let's try WORDs: 0x0300 (768d) and 0x0010 (16d). 768 decimal is 256 * 3, which probably means something, since 256 is a magic number (2 to the eigth power, or 0x0100), but we don't see 768 of anything in particular lying around on the floor here, so we'll move on.

16 decimal rings a bell immediately: It's the number of entries in our palette. Does this assumption hold true for the other palette? Yes: 0x0100 == 256.

So that's it, except for that 0x0300, which both of them have in the same spot. Ohhh... Hang on. We have seen 0x0300 recently: The Win32 GDI ("graphics device interface") has an internal palette format called LOGPALETTE ("logical palette"). We used LOGPALETTE to store eight-bit palettes when we were fighting our epic losing battle with indexed color eight months ago, and just this morning we wrote code to read and write files containing them. They look like this:

typedef struct tagLOGPALETTE { 
    WORD         palVersion;
    WORD         palNumEntries;
    PALETTEENTRY palPalEntry[1]; // See note 4
} LOGPALETTE; 

We know what a WORD is: Two bytes. Suddenly, we remember that the version member of that struct must always... equal... 0x0300! That's where we remember that number from. And the version member is two bytes, followed by a two-byte count of the palette entries. This is exactly what we've been looking at. What about the rest? PALETTEENTRY is four bytes: One byte each of red, green, blue, and flags. The flags can be zero. Our palette entries in the file are one byte each of red, green, and blue, followed by a zero byte.

So. I think what we've got on our hands here is a LOGPALETTE, crammed for obscure reasons into some wacky metafile.

By this time, we feel good enough about ourselves that we just might deign to RTFM. It turns out that RIFF files consist of a series of "chunks", which may be nested. The word "chunk" nauseates us. Nevertheless, we read on. A "chunk" is a four-byte code, usually ASCII (but not null-terminated), followed by a four-byte "chunk" size, followed by the appropriate number of bytes of data. The data in a "chunk" could be any old thing, including a series of "sub-chunks" (which are laid out just like any other chunk). The data in the top-level chunk in the file -- the chunk identified by the "RIFF" at the beginning of the file -- is a four-byte file-type string ("PAL ", "AVI ", etc.) followed by a series of "chunks". So this is what we've really got (I'll de-little-endianize the integers for the sake of your patience and my own rusty neurons):

"RIFF"      ;;  ID for the "top level" "chunk"
0x00000050  ;;  Top-level "chunk" data size (80 decimal)
"PAL "      ;;  File-type ID
"data"      ;;  ID of first contained "chunk" 
0x00000044  ;;  First contained "chunk" data size (68 decimal)
            ;;  (it's the last "chunk" too, in our case)
;; ...and 0x44 bytes of data, which in this case happens to be a LOGPALETTE:
    0x0300      ;;  Version
    0x0010      ;;  Number of entries
    01 02 03 00 ;;  First palette entry
    01 02 03 00 ;;  Second palette entry
    ;; ...and so on, for a total of sixteen palette entries.

How d'ya like them apples? MSDN doesn't have anything about this "PAL " RIFF file stuff, by the way, nor anything meaningful that I could find about RIFFs in general -- just some stuff about AVIs which presupposes that you're already semi-clued about RIFFs5.





1 The big problem was that our thing was a clock utility, and it needed to look right even when it didn't have focus. It turned out that we could safely degrade to four-bit color on eight-bit displays, so we settled for that.

2 http://divinewrath.com/gpl It's called PaletteEd. There's a binary, but no documentation.

3 See PALETTEENTRY, RGBQUAD, and COLORREF.

4 The last member is weird: The '1' in square brackets tells us that it's an array with only one element. That's not as pointless as you think: This is a C data structure. C doesn't care how many elements an array has, much less how many you declare it to have. After the program is compiled, nobody really even knows. It's up to the programmer to write code to keep track (or not). This struct is meant to be allocated dynamically. You allocate enough space for a LOGPALETTE struct, plus additional space for however many PALETTEENTRY structs you expect you'll need. Then you access the thing through a pointer. You'll never declare an actual LOGPALETTE struct, just a pointer to one. At runtime, you can index that array anywhere you like. If you allocated enough memory, you're okay. If not, you're doomed. So don't do that, okay?

5 Some of it discusses how AVIs store color palettes, but that's done in a completely different way from the above. AVIs use a BITMAPINFOHEADER with RGBQUADs: Different header, different byte order in the entries. There's pretty much no resemblance at all to what we've looked at here. It's just as well I didn't read the damn documentation first.

Log in or register to write something here or to contact authors.