As a Unix programmer most of the data files I am dealing with are text files, and are not only human readable but also computer parseable. Recently I discovered the joy of reading and writing binary data structures to and from disk. This is probably old hat to most C hackers, but hopefully it will help the newbies.

By writing data structures in their native form to a file, you eliminate a lot of potential errors from converting data members or parsing the text file. You lose some of the flexability of human readable files with variable length data members, but if you have a situation where you have to do fast reading and writing of fixed length data to disk, doing it without the conversion is the way to go.


First, lets take a simple example. You want to save and restore personal information about a user. The information will have the following components:

  • First Name
  • Last Name
  • Age
  • Record Creation Time

The first thing you want to do is create a structure to hold this data.

struct user
{
   char first_name[30];
   char last_name[30];
   unsigned int age;
   time_t create_time;
};

Now, the old way of dealing with saving and restoring this data would be to write it in text format to a file, delimiting the parts of the record with an identifier. Reading the record would be reading the file line by line, parsing the data on the delimiter and copying the data back into variables. For example:

/* Writing */
void write_data( struct user* data )
{
   FILE* fp = 0;
   char* buffer = 0;
   
   /* allocate */
   buffer = malloc ( 80 );
   bzero( buffer, 80 );

   /* copy the data to a string */
   snprintf( buffer, 80,
      "%s|%s|%d|%d", 
      data->first_name, data->last_name, data->age, data->create_time );
   
   /* open the file in append mode (note that error checking is left as an exercise to the reader */
   fp = fopen( "data.dat", "a" );
   fputs( buffer, fp );
   free( buffer );
   fclose( fp );
}

In the above we opened a file (in append mode, or we'd overwrite our file with each new data record), and wrote a string with (for example) "bob|barker|45|1000000000". This method creates a human readable file that is then parsed when it is read, such as with the following:

/* Reading */
void read_data()
{
   FILE* fp = 0;
   char* buffer = 0;
   char tokens[] = "|\n";
   struct user data;

   /* allocate */
   buffer = malloc ( 80 );
   bzero( buffer, 80 );

   /* open the file in read mode */
   fp = fopen( "data.dat", "r" );
   while( fgets( buffer, 80, fp )
   {
      /* the line is in 'buffer', now parse it into the 'data' variable using strtok() */
      strcpy( data.first_name, strtok( buffer, tokens ));
      strcpy( data.last_name, strtok( NULL, tokens ));
      data.age = atoi( strtok( NULL, tokens ));
      data.create_time = atoi( strtok( NULL, tokens ));
   }
   free( buffer );
}

Pretty ugly huh? This is example code and has absolutely no error checking in it. If you were to make this into any sort of a "real" application you would have to check the return value of strtok() to ensure that it was not returning NULL, make sure that the data was of the correct type and the type conversions worked properly. If you've got program structure where you are doing parsing within parsings you will have to use strtok_r(), which is the thread safe version of strtok().


Now, for the "better" way of doing things. Using the same data structure as above, lets do this cleaner, faster and cheaper. By writing the data structure directly to file we lose the human readability, but gain in other areas. There is no need to do data conversion or string parsing. To do this you use the read() and write() functions.

/* Writing */
void write_data( struct user* data )
{
   int fd = 0;
   /* open the file in append mode */
   fd = open( "data.dat", O_RDWR|O_CREAT|O_APPEND );
   
   /* write the binary structure right to the file */
   write( fd, (struct user*) user, sizeof( struct user ));
   close( fd );
}

Of course there's error checking to do for both the open, close and write functions, but it's still much simpler. The reading is also much easier.

s a Unix programmer most of the data files I am dealing with are text files
/* Reading */
void read_data()
{
   int fd = 0;
   struct user data;
   
   /* open the file */
   fd = open( "data.dat", O_RDONLY );
   
   /* read the data into 'data' until we read the end of the file */
   while( read( fd, &data, sizeof( struct user )))
   {
      printf("Just read:\nfirst: %s\nlast\nage: %d\ncreate time: %d\n",
         data.first_name, data.last_name, data.age, data.create_time );
   }

   /* close file */
   close( fd );
}

Again, not much too it. Just tell read() where you want to put the data and how big it is. In this case I statically allocated a user struct (just to be different) to read the data into. The sizeof function takes care of determining the number of bytes to read. All the function does is read into the data variable until it reaches EOF (read() returns 0) and print out each record. Of course there should be good error checking, and you will probably do more than just print out the data, but those are the minor details that are filled in when this is turned into a real application.

Hopefully this will help programmers such as myself find new and interesting ways of doing their progamming tasks.

Reactions
Thanks for all the msg's regarding bits of bad code or spelling errors. Kudos to y'all.

Wicker808 notes that writing binary files also has the disadvantage of not being portable between different OSs or compilers, due to differences in endianism and struct padding.

flamingweasel also mentions that the old way of doing things isn't all that great due to the fixed buffer size and the potential for overruns.

tftv256 says for the first example, I would just use fprintf() or dprintf() - that way, you don't have to do any mallocs at all.

ariels says You really should pass all int s through htonl or htons on write, and the reverse functions on read. Or your files cannot be written on one platform and read from another. E.g. Sun and PC will have this problem. See big endian and little endian for details.


Please /msg me with any corrections or errors that may be found in this node.

I have problems with the above approach. Aside from the obvious flaw of not playing nice, and suddenly not being able to use all of my handy tools like grep and head and cut and fold on said data, here is why this is bad:

There's a reason why UNIX people tend to put their files in ASCII. And it's not that hard to use fprintf(), fscanf(), dprintf(), ... Everybody else is doing it!

Don't pretend ASCII storage is limited to UNIX, either. Look at all the Windows applications with INI files, or take a look at the data forks of files in an (old) Mac OS box's Preferences folder. Or hey, look at XML.


If you really want to use this approach, though, here's a tip: the mmap() system call.

Log in or register to write something here or to contact authors.