Friday, September 2, 2011

7-bit Encoding with BinaryWriter in .Net

JetBrains dotPeek LogoAt work this week, I had the need to serialize objects and encrypt them while trying to keep the smallest data footprint I could. I figured the easiest thing to do would be to binary serialize them with the BinaryFormatter. It was, indeed, the easiest thing to do; however, the BinaryFormatter seems to come with a fair amount of overhead. By the time the BinaryFormatter was finished listing all of the necessary assembly-qualified type names, the 60 bytes of data I wanted to preserve were more than a kilobyte!

I needed another way so I extended BinaryWriter (mostly to get it to serialize the types I needed it to) and now my 60 bytes take 68 bytes to serialize. In the process of writing this class, I looked through the disassembled BinaryWriter using JetBrains's dotPeek and found this little gem (Write7BitEncodedInt(int)) and decided it'd make a great dotPeek of the Week:
protected void Write7BitEncodedInt(int value)
{
  uint num = (uint) value;

  while (num >= 128U)
  {
    this.Write((byte) (num | 128U));
    num >>= 7;
  }

  this.Write((byte) num);
}
This is how the BinaryWriter.Write(string) encodes the length of the string. Thus, if you have a string with fewer than 128 characters, it only takes one byte to encode the length. In my implementation, I used this to specify the length of collections too. In fact, when writing a 32 bit signed integer, you should be able to save space for all positive numbers less than 2 ^ 22 and break even (require four bytes) for 2 ^ 29. Encoding an Int32 that is greater than or equal to 2 ^29 will require five bytes. Thus, if your integers tend to be smaller than 536,870,912, you'll probably save space encoding this way. A similar function could be used for a long where all positive values less than 2 ^ 57 will result in at least breaking even.

Here's how it works:
  1. Convert the number to an unsigned int so you can do arithmetic on positive numbers (after all, you're just writing bits)
  2. While the converted value is greater than or equal to 128 (i.e., 8 bits)
    1. Write low 7 bits and put a 1 in the high bit (to indicate to the decoder more bytes are coming)
    2. Shift the 7 bits you just wrote off the number
  3. When the loop finishes, there will be 7 or fewer bits to write so write them
I think this is really clever and very easy. Reading the data back is a smidgen more complicated (Read7BitEncodedInt()):
protected internal int Read7BitEncodedInt()
{
  // some names have been changed to protect the readability
  int returnValue = 0;
  int bitIndex = 0;

  while (bitIndex != 35)
  {
    byte currentByte = this.ReadByte();
    returnValue |= ((int) currentByte & (int) sbyte.MaxValue) << bitIndex;
    bitIndex += 7;

    if (((int) currentByte & 128) == 0)
      return returnValue;
  }

  throw new FormatException(Environment.GetResourceString("Format_Bad7BitInt32"));
}
Here's how this works:
  1. Set up an int to accumulate your data as you read them
  2. Set up a place to keep track of which 7-bit block you're reading
  3. While your bit index is less than 35 (i.e., you've read no more than 5 bytes comprising 1 "more bytes" indicator and 7 data bits)
    1. Read a byte
    2. Take the byte you just read and logical conjuction (bitwise and) it with sbyte.MaxValue (127 or in binary 0111 1111)
    3. Left shift those 7 bits to the position in which they belong
    4. Use a logical disjunction (bitwise or) to write those seven bits into your accumulator int
    5. Add 7 to your bit index for the next 7 bits you read
    6. If the current byte you read does not have a 1 in its most significant bit (i.e, byte & 1000 0000 == 0)
      1. There are no more bytes to read so just return the current accumulator value
  4. If you get to this point, you've read a 6th of 5 bytes and it's time to let the user know there was a formatting problem

No comments:

Post a Comment