D. Patrick Caldwell on Software Engineering

Thursday, August 25, 2011

.Net GetHashcode Functions

Last week, I wrote a post about a .Net method to combine hashcodes. In the process of designing that method, I looked into dozens of .Net's GetHashcode implementations.

If you'd like to know the principles of hashcodes, take a look at the aforementioned article. This one is part of the dotPeek of the Week series so I'll just be sharing the insight I got from the framework's implementations here.

First, some really basic ones:

// from Int32
public override int GetHashCode()
{
  return this;
}

// from Int16
public override int GetHashCode()
{
  return (int) (ushort) this | (int) this << 16;
}

// from Int64
public override int GetHashCode()
{
  return (int) this ^ (int) (this >> 32);
}

The Int32 implementation easily meets the hashcode requirements by returning itself. The Int16 copies itself to the top half of the Int32 and returns that. The Int64 takes the bottom half of itself and XORs it with the top half.

This XOR is the first valuable piece of information. If you look at the logical functions, the XOR produces the best bit twiddling for hashing. Basically, the XOR produces a relatively even distribution of bits as opposed to the OR and the AND operations which will be biased 3 to 1 in one direction or the other.

Some more complicated implementations:

// from String
public override unsafe int GetHashCode()
{
  fixed (char* chPtr = this)
  {
    int num1 = 352654597;
    int num2 = num1;
    int* numPtr = (int*) chPtr;
    int length = this.Length;
    while (length > 0)
    {
      num1 = (num1 << 5) + num1 + (num1 >> 27) ^ *numPtr;
      if (length > 2)
      {
        num2 = (num2 << 5) + num2 + (num2 >> 27) ^ numPtr[1];
        numPtr += 2;
        length -= 4;
      }
      else
        break;
    }
    return num1 + num2 * 1566083941;
  }
}

// from Tuple<>
internal static int CombineHashCodes(int h1, int h2)
{
  return (h1 << 5) + h1 ^ h2;
}

These two methods have some really good information in them. The String.GetHashcode implementation has what's called a rolling hash. It loops through the characters, does a barrel shift by 5, and then XORs with the next character.

While I liked this, I preferred the simplicity of the Bernstein Hash. The main component of the Bernstein Hash is the (i << 5) + i. i << 5 == i * 32 (except that bit shifting is much faster than multiplying).

Thus, i << 5 + i == 32i + i == 33i. The Bernstein Hash just takes the current hash value, multiplies it by 33, and XORs in the new value.

I didn't use 33 in my hash function because I feel like using prime numbers is healthy. Thus, instead of adding I subtract so my hashcode method is (hash << 5) - hash ^ value (or 31 * hash ^ value). This, by the way, is the way Java tends to do it.

Here are some interesting ones from System.Drawing:

// from Size
public override int GetHashCode()
{
  return this.width ^ this.height;
}

// from Rectangle
public override int GetHashCode()
{
  return this.X ^ (this.Y << 13 | (int) ((uint) this.Y >> 19)) ^ (this.Width << 26 | (int) ((uint) this.Width >> 6)) ^ (this.Height << 7 | (int) ((uint) this.Height >> 25));
}

// from Point
public override int GetHashCode()
{
  return this.x ^ this.y;
}

Thursday, August 18, 2011

.Net Method to Combine Hash Codes

A while back I developed a helper method that I've been using to aid me in computing good hash values from sets of properties for an object. I was writing a dotPeek of the Week entry about .Net GetHashcode implementations and I decided to spruce up my implementation and post it on my blog.

I know It's pretty uncommon that you need to override the Equals method, but when you do, you have to take particular care in considering overriding the GetHashCode method as well. To make this a little easier, I developed a method I call CombineHashCodes().

Using CombineHashCodes, I can easily compute a well randomized composite hash code based on several parameters from an object. This curtails a lot of the work involved when dealing with objects that have complicated .Equals overrides.

Microsoft's documentation on Object.GetHashCode() lists the following rules:

A hash function must have the following properties:

If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

For the best performance, a hash function must generate a random distribution for all input.

Thus, if you override the .Equals method such that it's no longer doing the default reference equality, you probably need to change your GetHashCode implementation. I've found this is usually a very simple task. I whipped up a quick example in LINQPad:

void Main()
{
 var lauren1 = new Person { FirstName = "Lauren" };
 var lauren2 = new Person { FirstName = "Lauren" };
 
 Console.WriteLine(lauren1.Equals(lauren2));
 Console.WriteLine("{0}.GetHashCode() = {1}", "lauren1", lauren1.GetHashCode());
 Console.WriteLine("{0}.GetHashCode() = {1}", "lauren2", lauren2.GetHashCode());
 
 if (lauren1.Equals(lauren2) && lauren1.GetHashCode() != lauren2.GetHashCode())
  Console.WriteLine("This is bad.  This is very bad.");
}

public class Person
{
   public string FirstName { get; set; }
}

/*
False
lauren1.GetHashCode() = 47858386
lauren2.GetHashCode() = 20006478
*/

As one would expect, lauren1 does not equal lauren2 and neither do the hashcodes; however, if we were to override the .Equals method on the Person object to compare just the first name, we'll find that lauren1 and lauren2 will be equal but will have different hashcodes! This violates the first rule of GetHashCode Club. Here's an example:

public class Person
{
   public string FirstName { get; set; }
  
   public override bool Equals (object obj)
   {
  var otherPerson = obj as Person;
  
  if (otherPerson == null)
   return false;
   
  return String.Compare(this.FirstName, otherPerson.FirstName, StringComparison.OrdinalIgnoreCase) == 0;
 }
}

/*
True
lauren1.GetHashCode() = 33193253
lauren2.GetHashCode() = 37386806
This is bad.  This is very bad.
*/

That's why we override GetHashCode as well and something like this will work:

public override int GetHashCode()
{
 return FirstName.GetHashCode();
}

/*
True
lauren1.GetHashCode() = 2962686
lauren2.GetHashCode() = 2962686
*/

That's all well and good (except the .Equals() does a case insensitive compare so "Lauren" and "lauren" will be equal but will result in different hash codes unless you do something like return FirstName.ToLower().GetHashCode(). Why'd I leave that error in the post just to parenthetically correct it . . . because I think it's a good demonstration of how easy it is to screw this up :)). Consider what happens when you end up with a more complex class that looks like this:

public class Person
{
   public string FirstName { get; set; }
 public string LastName { get; set; }
 public DateTime BirthDate { get; set; }
  
   public override bool Equals (object obj)
   {
  var otherPerson = obj as Person;
  
  if (otherPerson == null)
   return false;
   
  return String.Compare(FirstName, otherPerson.FirstName, StringComparison.OrdinalIgnoreCase) == 0
   && String.Compare(LastName, otherPerson.LastName, StringComparison.OrdinalIgnoreCase) == 0
   && BirthDate.Equals(otherPerson.BirthDate);
 }
}

You could leave the GetHashCode function the way it was previously, but then all Lauren's will be lumped in the same bucket together which violates the third rule about a random distribution. I've seen people combat this problem by concatenating the string values of the equals fields delimited by some extremely unlikely string and getting the hash code of that.

I felt like that was not really the best way to do it in case someone decides a colon is a good character in a first name or something worse.

Thus, I wrote something like this (which has recently been modified to what you see now based on my explorations for my dotPeek of the Week series). This is CombineHashCodes:

public static class HashCodeHelper
{
 public static int CombineHashCodes(params object[] args)
 {
  return CombineHashCodes(EqualityComparer<object>.Default, args);
 }

 public static int CombineHashCodes(IEqualityComparer comparer, params object[] args)
 {
  if (args == null) throw new ArgumentNullException("args");
  if (args.Length == 0) throw new ArgumentException("args");

  int hashcode = 0;

  unchecked
  {
   foreach (var arg in args)
    hashcode = (hashcode << 5) - hashcode ^ comparer.GetHashCode(arg);
  }

  return hashcode;
 }
}

With this method, you can get a nice pseudo-random distribution of hash codes without violating any of the GetHashCode rules as long as you include the values relevant to your Equals method in your call to CombineHashCodes:

public override int GetHashCode()
{
 return HashCodeHelper.CombineHashCodes(FirstName.ToLower(), LastName.ToLower(), BirthDate);
}

/*
True
lauren1.GetHashCode() = -891990792
lauren2.GetHashCode() = -891990792
*/

Monday, August 15, 2011

.Net Optimization for Int32

Another entry in the dotPeek of the Week series here. I was digging through some .GetHashcode() implementations (expect that as the next dotPeek of the week) and noticed some interesting implementations of overridden .Equals() methods.

I found this in Int16:

public override bool Equals(object obj)
{
  if (!(obj is short))
    return false;
  else
    return (int) this == (int) (short) obj;
}

public bool Equals(short obj)
{
  return (int) this == (int) obj;
}

So, I checked Byte and sure enough:

public override bool Equals(object obj)
{
  if (!(obj is byte))
    return false;
  else
    return (int) this == (int) (byte) obj;
}

public bool Equals(byte obj)
{
  return (int) this == (int) obj;
}

Can you guess how Char.Equals() is implemented?

public override bool Equals(object obj)
{
  if (!(obj is char))
    return false;
  else
    return (int) this == (int) (char) obj;
}

public bool Equals(char obj)
{
  return (int) this == (int) obj;
}

Even the unsigned int gets cast as a signed int in UInt.Equals(). I was pretty curious, so I looked at some other methods:

public int CompareTo(short value)
{
  return (int) this - (int) value;
}

public int CompareTo(byte value)
{
  return (int) this - (int) value;
}

public int CompareTo(char value)
{
  return (int) this - (int) value;
}

I took note and discussed it with a few friends. It seems to make sense that on a 32 bit operating system, it'd be optimal to work with 32 bit integers. It turns out, even on a 64 bit architecture, .net is optimized for the Int32. I found this gem online:

Best Practices: Optimizing performance with built-in types

The runtime optimizes the performance of 32-bit integer types (Int32 and UInt32), so use those types for counters and other frequently accessed integral variables.

For floating-point operations, Double is the most efficient type because those operations are optimized by hardware.

MCTS Self-Paced Training Kit (Exam 70-536): Microsoft® .NET Framework 2.0—Application Development Foundation

So storing integer values, no matter the ceiling of your expected value, is best done using Int32, particularly if they'll be operated on heavily like an index to a collection or as a counter in an iteration control structure.

Monday, August 8, 2011

Enumerable.Any vs. Enumerable.Count

This is the innagural entry into the section of my blog I'm calling "dotPeek of the Week." Workload permitting, it will be a weekly thing where I'll be using JetBrain's dotPeek - a free .NET decompiler to learn more about the .NET framework and share any interesting things I come up with.

In this post, I'll be sharing a little tip I picked up from my friend David Govek.

I can't .Count() the number of times I've seen a line of code like this one:

if (someEnumerable.Count() > 0) 
    doSomething();

I know I've done that myself a handful of times. I think it comes from the pre-.net 3.5 days when you were used to dealing with ICollections that had a .Count property that returned the value of a private field:

public virtual int Count { get { return this._size; } }

When 3.5 came out, the System.Linq.Enumerable class brought with it a Count extension method. I believe it was at that point that people started using .Count() everywhere, including on IEnumerables, which previously didn't have a method for getting the length of the enumerable.

Most of the time, there was very little pain because the engineers over at Microsoft were clever enough to help us out. The first thing they try to do in the .Count extension method is check to see if the IEnumerable is an ICollection. If it is, they just use the Count property which we already know is plenty fast; however, if it's not an ICollection, they have to iterate the Enumerable and count the elements.

Here's what that looks like:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
  if (source == null)
    throw Error.ArgumentNull("source");

  ICollection<TSource> collection1 = source as ICollection<TSource>;
  if (collection1 != null)
    return collection1.Count;

  ICollection collection2 = source as ICollection;
  if (collection2 != null)
    return collection2.Count;

  int num = 0;
  using (IEnumerator enumerator = source.GetEnumerator())
  {
    while (enumerator.MoveNext())
      checked { ++num; }
  }
  return num;
}

If your source is indeed an Enumerable, this is still a fine way to find out how many items there are. The problem is, in the sample code where we're just simply checking to ensure that the Enumerable isn't empty, using .Count can prove costly. Checking that the count is greater than 0 has to enumerate the entire collection and count each item despite the fact that we know it's not empty as soon as we spot the first element.

Fortunately, the clever folks at Microsoft thought of this and also gave us .Any(). This extension method simply gets the Enumerator, calls .MoveNext(), and disposes the Enumerator. MoveNext tries to move to the next element in the collection and returns true until it passes the end of the collection.

Here's what the .Any() method looks like:

public static bool Any(this IEnumerable<TSource> source)
{
  if (source == null)
    throw Error.ArgumentNull("source");

  using (IEnumerator<TSource> enumerator = source.GetEnumerator())
  {
    if (enumerator.MoveNext())
      return true;
  }

  return false;
}

Thus, there's no need to Enumerate the entire Enumerable if you can use .Any() like this refactored code:

if (someEnumerable.Any()) 
    doSomething();

Sunday, August 7, 2011

About dotPeek of the Week

Yesterday, I posted the first post of a series I call the dotPeek of the Week. Why am I using a JetBrains product name in my blog? Well, 'cause I like JetBrains. I love ReSharper for C#, IntelliJ for Android development, and of course, dotPeek for decompiling .Net code.

I also want to support JetBrains by encouraging people to use dotPeek so that they'll keep it free, unlike that other one that got bought by that one company and now is a paid app. Hopefully, I won't have to change the name of this section any time soon :).

The other (and primary) objective of this exercise is to learn more about the .Net framework and how it functions. If I find something interesting, I'll share it here on my blog.

Latest dotPeek of the Weeks

All dotPeek of the Week posts

Monday, July 25, 2011

C# Yield Keyword, IEnumerable<T>, and Infinite Enumerations

Visual Studio 2005 came with a slew of .net and .net compiler features. One of those features that I particularly enjoy is the yield keyword. It was Microsoft's way of building IEnumerable (or IEnumerator) classes and generic classes around your iterator code block. I'm not going to discuss the yield keyword much in this post because the feature has been around a long time now and the internet is replete with discussions on the topic.

Despite the long life of the yield keyword, I still find it conspicuous when I see it in projects I work on. I suppose it's just rare that I find myself writing my own enumerable. As a result, when I see yield return, it tends to stand out. I started looking around to see how the rest of the programming world uses the yield keyword wondering if I was under-utilizing the flexibility provided.

Specifically, I wondered if I was missing out on the lazy nature of the iterator and the numerous linq extension methods optimized to take advantage of that aspect of iterators. What I mean by that is that an iterator doesn't need to store each sequential value in memory the way a collection would and thus you can use each value without necessarily increasing the memory overhead. Further, you can take advantage of calculations which tend to already be sequential in nature (like the Fibbonacci sequence for example).

The second thing I thought about was an infinite (well, sort of infinite) enumerable. I'm not sure if I feel like it's a bad idea or not so I wrote these examples to be unending? I may eventually find a use for such an iterator and then get burned when someone tries to call Fibbonacci.Min() and the application throws an overflow exception, but I suppose at that point I'll make it a method and take a sanity check variable.

In the meantime, here are a few examples of some iterators I thought were interesting and fun challenges:

static IEnumerable<ulong> Fibbonacci
{
    get
    {
        yield return 0;
        yield return 1;

        ulong previous = 0, current = 1;
        while (true)
        {
            ulong swap = checked(previous + current);
            previous = current;
            current = swap;
            yield return current;
        }
    }
}

static IEnumerable<long> EnumerateGeometricSeries(long @base)
{
    yield return 1;

    long accumulator = 1;
    while (true)
        yield return accumulator = checked(accumulator * @base);
}

static IEnumerable<ulong> PrimeNumbers
{
    get
    {
        var prime = 0UL;
        while (true)
            yield return prime = prime.GetNextPrime();
    }
}

static IEnumerable<List<uint>> PascalsTriangle
{
    get
    {
        var row = new List<uint> { 1 };
        yield return row;

        while (true)
        {
            var last = row[0];
            for (var i = 1; i < row.Count; i++)
            {
                var current = row[i];
                row[i] = current + last;
                last = current;
            }

            row.Add(1);
            yield return row;
        }
    }
}

Some of these methods use a few extension methods I adapted from places in the .net framework:

public static ulong GetNextPrime(this ulong from)
{
    for (var j = from + 1 | 1UL; j < ulong.MaxValue; j += 2)
        if (j.IsPrime())
            return j;

    return from;
}

public static bool IsPrime(this ulong value)
{
    if ((value & 1) != 0)
    {
        var squareRoot = (ulong)Math.Sqrt((double)value);
        for (ulong i = 3; i <= squareRoot; i += 2)
            if (value % i == 0)
                return false;

        return true;
    }
    return value == 2;
}

To keep the test console app clean, of course, I used my favorite IEnumerable.Each() extension method:

public static void Each<T>(this IEnumerable<T> enumerable, Action<T> action)
{
    foreach (var element in enumerable)
        action(element);
}

Here's the sample code:

static void Main()
{
    Fibbonacci.Skip(10).Take(10).Each(Console.WriteLine);
    Console.WriteLine();

    EnumerateGeometricSeries(2).Take(10).Each(Console.WriteLine);
    Console.WriteLine();

    PrimeNumbers.Where(p => p > 600).Take(10).Each(Console.WriteLine);
    Console.WriteLine();

    foreach (var row in PascalsTriangle.Take(10))
    {
        row.Each(element => Console.Write("{0} ", element)); 
        Console.WriteLine();
    }

    Console.ReadLine();
}

// The second set of 10 elements in the Fibbonacci sequence
// 55 89 144 233 377 610 987 1597 2584 4181

// Base of 2 to the first 10 powers
// 1 2 4 8 16 32 64 128 256 512

// The first 10 prime numbers greater than 600
// 601 607 613 617 619 631 641 643 647 653

// The first 10 rows of Pascal's Triangle
// 1
// 1 1
// 1 2 1
// 1 3 3 1
// 1 4 6 4 1
// 1 5 10 10 5 1
// 1 6 15 20 15 6 1
// 1 7 21 35 35 21 7 1
// 1 8 28 56 70 56 28 8 1
// 1 9 36 84 126 126 84 36 9 1

Wednesday, July 20, 2011

.Net ObjectFormatter - Using Tokens in a Format String

If you've already read this article and you don't feel like scrolling through my sample formats, you can jump directly to ObjectFormatter on github to get the source.

At some point in almost every business application, it seems you eventually run into the ubiquitous email notifications requirement. Suddenly, in the middle of what was once a pleasant and enjoyable project come the dozens of email templates with «guillemets» marking the myriad fields which will need replacing with data values.

You concoct some handy way of storing these templates in a database, on the file system, or in resource files. You compose your many String.Format() statements with the dozens of variables required to format the email templates and you move on to the greener pastures of application development.

Now, you've got a dozen email templates like this one:

Dear {0},

{1} has created a {2} task for your approval. This task must be reviewed between {3} and {4} to be considered for final approval.

This is an automagically generated email sent from an unmonitored email address. Please do not reply to this message. No, seriously . . . stop that. Nobody is going to read what you are typing right now. Don't you dare touch that send button. Stop right now. I hate you. I wish I could hate you to death.

Thank you,
The Task Approval Team

No big deal, everything is going swimmingly, and the application goes into beta. Then, it turns out, the stakeholders don't want an email template that looks like that. That was more of a draft really. Besides, you should've already known what they wanted in the template to begin with. After all, it's like you have ESPN or something.

It's important to add information about the user for whom this action is taking place, so this is your new template:

Dear {0},

{1} has created a {2} task for your approval regarding {5}({6}). This task must be reviewed between {3} and {4} to be considered for final approval.

If you have questions, please contact your approvals management supervisor {7}.

This is an automagically generated email sent from an unmonitored email address. Please do not reply to this message. No, seriously . . . stop that. Nobody is going to read what you are typing right now. Don't you dare touch that send button. Stop right now. I hate you. I wish I could hate you to death.

Thank you,
The Task Approval Team

So far so good. You've updated the template, updated your String.Format() parameters, passed QA and gone into production. But, now that users are actually hitting the system, it turns out that you need a few more changes. Specifically, you need to add contact information for the supervisor, remove the originator of the task, and by the way, what kind of sense does it make to put a low end limit on a deadline? Here's your new template:

Dear {0},

A {2} task for {5}({6}) is awaiting your approval. This task must be reviewed by {4} to be considered for final approval.

If you have questions, please contact your approvals management supervisor {7} at {1}.

This is an automagically generated email sent from an unmonitored email address. Please do not reply to this message. No, seriously . . . stop that. Nobody is going to read what you are typing right now. Don't you dare touch that send button. Stop right now. I hate you. I wish I could hate you to death.

Thank you,
The Task Approval Team

Now you have an email template format with various numbers all over the place, a String.Format() call with more parameters than there are tokens, and you have to go through the QA - deployment cycle again.

I've gone through this process on almost every application throughout my career as a software engineer. Hence the ObjectFormatter. Now, my email template looks like this:

Dear {Employee.FullName},

A {Task.Description} task for {TargetUser.FullName}({TargetUser.UserId}) is awaiting your approval. This task must be reviewed by {DueDate} to be considered for final approval.

If you have questions, please contact your approvals management supervisor {Supervisor.FullName} at {Supervisor.PhoneNumber}.

This is an automagically generated email sent from an unmonitored email address. Please do not reply to this message. No, seriously . . . stop that. Nobody is going to read what you are typing right now. Don't you dare touch that send button. Stop right now. I hate you. I wish I could hate you to death.

Thank you,
The Task Approval Team

I find that the ObjectFormatter makes my templating much easier to maintain and much more flexible. It also usually makes my calling code a lot cleaner. Here's an example of the approaches you could take to populate the sample templates:

// plain string formatting
String.Format(template, Employee.FullName, Supervisor.PhoneNumber, Task.Description, String.Empty, DueDate, TargetUser.FullName, TargetUser.UserId);

// if you have a dto already built
ObjectFormatter.Format(template, myDto);

// if you don't have a dto built
ObjectFormatter.Format(template, new { Employee, Supervisor, Task, DueDate, TargetUser });

I've found that most of the time they ask for template changes, they want me to add some value that is already a property on an object that's already in my object graph because of the current email template. That way, when they come tell me they want he target user's name formatted differently, I don't even need to recompile (well, sometimes I do . . . I mean, I can't predict everything). I can implement a lot of changes by using objects I already know I'm passing into the ObjectFormatter.Format() method. Here's the new template with the changes and I didn't have to change a line of code to make it work:

Dear {Employee.FullName},

A {Task.Description} task for {TargetUser.LastName}, {TargetUser.FirstName}({TargetUser.UserId}) is awaiting your approval. This task must be reviewed by {DueDate} to be considered for final approval.

If you have questions, please contact your approvals management supervisor {Supervisor.FullName} at {Supervisor.PhoneNumber}.

This is an automagically generated email sent from an unmonitored email address. Please do not reply to this message. No, seriously . . . stop that. Nobody is going to read what you are typing right now. Don't you dare touch that send button. Stop right now. I hate you. I wish I could hate you to death.

Thank you,
The Task Approval Team

If you'd like to check out the source or use the ObjectFormatter in your own projects, look for ObjectFormatter on github. If you make any cool changes, please let me know and I'll try to figure out how to merge them into the repository.