Tuesday, July 1, 2008

Fast binary visualization (to hex conversions) and back

This spin-off started with the immutable System.Data.Linq.Binary class. It's somewhat irritating because it does not expose any read-only capabilities. As such one must either rely on the ToString method which surrounds the base64 encoded value in quotes or the ToArray method which uses defensive copying.

At this point, I wonder why they never defined a public getter, like with the string class (string s; s[0]). Since what I need is to be able to serialize this object. Object serialization comes in many flavors, I've chosen one which utilizes reflection to emit a parser collection of strongly typed serializes. It's way faster than what you get with any of the built-in stuff but I've purposely chosen to ignore the System.Data.Linq.Binary because I didn't use it, up until now.

SQL to LINQ maps timestamps to the Binary data type and you need timestamps to apply optimistic concurrency. So I have to able to treat them as well. And this is where this hack'ish hex stuff comes in.

In the end, timestamps are 8-bytes in size. So I accept the defensive copying overhead that takes place. I rely on the ToArray method and .ctor(byte[]) for getting and setting the internal representation.

When you look at hex, the 4 high and low bits of a byte is usually referred to as a nibble. A nibble map to 1 hexadecimal character, usually in the range of [0-9ABCDEF] but why? There really no relation between the binary value or the ASCII representation.

I gave it some thought, looked at the bit patterns and this is what I came up with.

Each nibble represent a 4-bit integer in the range of 0-15. As such if I were to use a lookup table I could map this space to any 16 characters, but I chose to not do that. Instead I searched the byte range for a bit pattern which would not involve any lookup map and still remain a presumably safe ASCII representation.

Funny enough the commercial at '@' character happens to not use the low 4-bits of a byte and the following 16 characters constitutes "ABCDEFGIHJKLMNOP", the beginning of the alphabet. What's even more coincidental is that the caret notation, is exactly this (Ctrl+C is really ASCII character at zero-based index 3).

But enough talk, let's look at the code:

public static string Serialize( this System.Data.Linq.Binary binary )
{
    StringBuilder sb = new StringBuilder();
    byte[] bytes = binary.ToArray();
    for ( int i = 0 ; i < bytes.Length ; i++ )
    {
        // Unpack
        sb.Append( (char)('@' | (bytes[i] >> 4)) ); // '@' 0x40 | (i & 0xf)
        sb.Append( (char)('@' | (bytes[i] & 0xf)) );
    }
    return sb.ToString();
}

public static System.Data.Linq.Binary ToBinary( string s )
{
    byte[] bytes = new byte[s.Length >> 1];
    for ( int i = 0 ; i < bytes.Length ; i++ )
    {
        // Pack
        bytes[i] = (byte)((s[i << 1] << 4) | (s[(i << 1) + 1] & 0xf)); 
    }
    return new System.Data.Linq.Binary( bytes );
}

The above code in C# can be used as is. But those of you that think bit manipulating is a bit daunting, I recommend you to take a while to understand the code (make sure you fully understand the impact of explicit and implicit type casts in conjunction with bit arithmetic).

Each nibble does not occupy the 4 high bits of each byte and since we want to be able to represent them as strings, the bitwise | with '@' (0x40) will give us a very nice ASCII representation. One which is also very efficient to parse, as each nibble is stored unmodified together with the '@' character. We simply bitwise & with mask 0xf to get just the low part, and then bit shift everything else into place. Another nice property of this method is that the ordering of the binary value does not change with the textual representation.

This bi-directional relationship actually wants me to stop using the old fashion hexadecimal representation altogether and start using this caret notation stuff. I reckon it's just wrapping your head around another numerical representation.

There's 10 kinds of people, right?
Those who speak binary and those who don't.

Update July 25, '08: I've observed that the commercial at character '@' is sometimes encoded, despite the fact that any documents I could find on the subject says otherwise. So if you want a pure ASCII and URL-safe binary representation add 1 just before the serialization and subtract 1 just after the de-serialization. It will do the trick.