Unfortunatly E2 doesn't like using UTF-8 in its XML streams, this can cause myriad problems for people writing e2 clients of various forms since the XML standard specifies that XML streams are to be UTF-8 unless otherwise specified.

Thankfully its fairly easy to convert straight 8-bit ASCII to UTF-8.

The Brief Explanation as to What I'm Doing Here

Since ASCII goes from (in hexadecimal) 0x00 to 0xFF (0 to 255) the maximum length of our UTF-8 character will be 16bits since 16bits will hold any value (under UTF-8 encoding) from 0x000 to 0x7FF (0 to 2047).

Say we have a character c. if c is equal to or less than 0x7F (127) then in UTF-8 it will be represented as:

0xxxxxxx in binary

If, however, c is greater than 0x7F then it gets represented as (again in binary):

110xxxxx 10xxxxxx

We can see that if c is an 8-bit long ASCII value then it will actually be:

110000xx 10xxxxxx

since we only have 8-bits to deal with...

So to convert an ASCII character to UTF-8 we need to do the following:

If the character has a value of less than or equal to 127 then we leave it alone.

If it is greater than 127 then we need to split it into two characters. To get the first byte we want only the upper two bits from our character, but they now move to become bits 1 and 0 from 7 and 6. So we right shift our ASCII character six times. We also need to make sure we have our 110 in bits 7, 6 and 5 so we bitwise OR our shifted value against 11000000 (0xC0 in hex). To put it more simply we have: 11000000 OR 000000xx to get 110000xx.

For the second byte its a bit easier. We want only the lower six bits from our ASCII character so we block off the top two bits by doing a bitwise AND against 00111111 (0x3F in hex) to ensure that the top two bits will both be 0. We then need to ensure that the top two bits are 10, so we bitwise OR our byte with 10000000. Simply we do: (00111111 AND xxxxxxxx) OR 10000000 to get 10xxxxxx.

So without further ado, heres some code examples.

In each example the function will get as its input a single character and output a (null terminated if necessary) string of one or more characters.

Please note that I'm writing this from work and as such can't test the following code snipets. The C version I'm pretty sure works as advertised since I'm using it in a different application at the moment, but I'm not 100% about the PERL version.

In C:

char *ascii_to_utf8(unsigned char c)
	unsigned char *out;
	if(c < 128)
		out = (char *)calloc(2, sizeof(char));
		out[0] = c;
		out[1] = '\0';
		out = (char *)calloc(3, sizeof(char));
		out[1] = (c >> 6) | 0xC0;
		out[0] = (c & 0x3F) | 0x80;
		out[2] = '\0';
	return out;

And in PERL:

sub ascii_to_utf8
	my $c = ord(shift(@_));
	if($c < 128)
		return chr($c);
		return pack("C*", ($c >> 6) | 0xC0, ($c & 0x3F) | 0x80));

More languages might be added later... :)

Log in or register to write something here or to contact authors.