Thursday, October 28, 2010

Improving Search Performance when using SQL Server 2008 Encrypted Columns

ConfidentialThis is a story of courage, honor, and data encryption. In addition to being something of a tribute to one of my favorite games of all time, I do feel I need to preface this post with a bit of a disclaimer.

Generally, when I describe this problem to someone, we go through pretty much the same conversation.
Why didn't you try Sql Server 2008 Transparent Data Encryption? Why don't you just update your ACL? Why don't you try this? Or that?

Well, sometimes the environment, either technologically or politically, precludes some of your better options. Ultimately, you have to go with your best permissible solution to solve the problem at hand. That being said, let's say you've been charged with securing your organizations PII. Your only option is to encrypt the column with SQL Server's column encryption.

You start looking at the impact this is going to have on your current and future applications. If you're like me, one of the first concerns you're going to have is performance. For example, let's say you have the need to search on the encrypted field. A good example of an encrypted field that you'll likely have to search is the Social Security Number.

In my original benchmarks, I found the following results:Knowing that I was going to have to improve this performance, I started playing with some searching alternatives. First of all, it's obviously much faster to search an indexed column (especially one that is clustered), so my first goal was to generate an indexed column.

Ryan McGarty (my best friend and a damn fine programmer) and I discussed and quickly ruled out a basic hash. Sure, it'd be fast to search an indexed column containing the hash value, but that opens you up to a relatively simple plain text attack (especially with a known set of possible values). I decided that you could reduce the threat and still speed up the search by using hash buckets a la hashtables instead.

I concocted a hash function that produced a reasonable distribution within the buckets. My good friend and esteemed colleague David Govek pointed out that an MD5 hash would produce a pretty effective distribution between buckets (with high cardinality data like the social security number. David was right and I ended up with this hash function:
CREATE FUNCTION [dbo].[GroupData] 
(
 @String VARCHAR(MAX),
 @Divisor INT
) 
RETURNS INT
AS BEGIN

  DECLARE @Result INT;
  
  SET @Result = HASHBYTES('MD5', @String);
  
  IF (@Divisor > 0)
    SET @Result =  @Result % @Divisor;
  
  RETURN @Result;

END
I created the test data just as I had before, except that this time I also added the hash bucket. I created a clustered index on the bucket and I changed my select statement a little:
-- original select statement
SELECT *
FROM PersonData
WHERE 
  CONVERT(CHAR(9), DECRYPTBYKEY(SocialSecurityNumberEncrypted)) = '457555462'

-- hash key select statement
-- @Divisor is the divisor for the modulus operation in the hash function
SELECT *
FROM PersonData
WHERE 
  SocialSecurityNumberGroup = dbo.GroupData('457555462', @Divisor) AND
  CONVERT(CHAR(9), DECRYPTBYKEY(SocialSecurityNumberEncrypted)) = '457555462'

Here are my results:

So, what of the known plain text attack then?

So, I don't want to gloss over the plain text attack issue. Given the possible set of socials, the hashes would be very unlikely to have a collision. Thus, for most people, you'd be able to get their socials easily by hashing every possible social and joining to that table. By modulo dividing the hash value, I'm able to evenly distribute social security numbers among a known set of buckets. That means, I can control the approximate number of socials in each bucket given my set of values. I generally aim for about 1000 socials per bucket.

For example, an MD5 % 100 has a possible set of values from -100 to +100. That's 201 buckets so if you have 2010 rows of data to hash, you'll have about 10 rows per bucket. The benefit is that you now only have to decrypt 10 rows to find the exact row you're looking for. The detriment is that you've narrowed your possible result set. Within your own data set, you'd have narrowed it to 10 possible plaintext values; however, given that these values are unknown, you then have to look at the set of possible values.

Social security numbers have a set of possible values less than 1,000,000,000. It's hard to say exactly how many of them are the possible set of values, so let's say we're using only those socials currently in use by living Americans. The population of the United States at the time of writing was about 312,000,000. As I said, I usually aim for about 1,000 records per bucket. If I have 1,000,000 rows of data, I would modulo divide by 500 (1,001 buckets). If you knew my set of values, then you'd only have 1,000 possible values to reduce. Given that you know when and where I was born, you could probably narrow it to 20 or so possible socials.

But, you don't know my set of values, so you really have 312,000 values to chose from. Even if you did know the first 5 digits of my social security number (based on my state of issuance and my date of birth), your set of possible options would be so large, you'd probably be better off just pulling my credit report and getting my social that way.

Thus, while it seems to introduce a weakness to plain text attacks (and with low cardinality data it would be an issue), in the case of social security numbers, I don't believe it to be a reasonable attack.

5 comments:

  1. Hmm, but why is MS SQL server so slow on searching through encrypted columns? Doesn't it also create hash tables or rather does "linear search", as you mentioned in the article, mean, that MS SQL server decrypts records one after another and compares them with the search value?

    ReplyDelete
  2. I haven't looked at the internals, but my impression is that it decrypts a block of values, table scans, and moves on to the next batch.

    I don't know why Microsoft didn't implement their own hash tables. There could be several reasons though. One could be that they didn't want the overhead of hashing during inserts and updates. Another could be that they didn't want to inadvertently expose plain text attack vectors. Another could be that the configuration required to adequately support the indexing of encrypted columns is nearly the same amount of effort for developers implementing their own hashing. I just don't know.

    I don't advocate using column level encryption in SQL Server. I don't believe it is the correct approach. I certainly don't advocate searching encrypted columns. I prefer using SQL Server 2008's Transparent Data Encryption. It's a great deal faster and provides better protection of your data.

    For example, your data are encrypted on disk (including TempDB). Thus, if a developer decrypts values into a temp table, those values will still be protected by TDE but not by column encryption.

    Patrick

    ReplyDelete
  3. Whew, you're replying fast...

    Well, looks like column level encryption has more downsides than I thought. But it's interesting stuff. And it's surprising how much faster lookups work with your hash table approach. Thanks for the article.

    Oliver

    ReplyDelete
  4. I try to help. Thanks for reading my blog Oliver. Good luck with your encryption needs.

    ReplyDelete