Thursday, October 28, 2010

Results of the "What Really Makes a Good Programmer" Survey

Back in September I wrote a blog post called What Really Makes a Good Programmer? My goal was to ask various members of the development community what traits they thought contributed to the quality of a programmer. If you haven't taken the survey yet, I'd recommend you do that before reading the results. Wouldn't want to bias your opinions, right?

If you have taken the survey, but haven't told everyone you know to take it too, go ahead and do that; I'll wait.

Now that we've covered that, let's get on to the analysis. As you recall from taking the survey, there were 17 traits and you rated them on a scale of Very Important, Important, A little Important, Negligible, and Completely Unimportant. In order to do any analysis, I had to take these ratings and convert them to numbers.

I decided that A Little Important was the baseline and basically represented a lack of opinion on the topic. These traits are the ones most people feel are nice to have but aren't requirements. Basically, a good programmer often has these traits but not having them doesn't mean you're not a good programmer. I recoded these values with 0 points.

Very Important and Completely Unimportant are the ratings that people used when they felt strongly about the item. That means that, respectively, the trait is either absolutely necessary to being a good programmer or the trait has no bearing on the quality of programmer you are whatsoever (or perhaps a little bit of a negative indicator). These values were recoded as 10 and -10 respectively.

The central values of Important and Negligible are what I call the non-committal values. You feel that they make a difference but they're not quite important enough to bank on them. I gave these a 3 and -3 respectively.

For example, "Has good problem solving skills" was rated Very Important and "Cheap" was rated Completely Unimportant. One could say, "someone with poor problem solving skills would make a poor programmer." On the other hand, you cannot say, "someone who is not cheap is a poor programmer." Of course, the contrapositive could also be stated that, "someone who is a good programmer is also a good problem solver." The contrapositive "someone who is a good programmer is also cheap" is considered by the community to be a largely invalid assertion.

"Fast" and "Co-located" appear nearest to the baseline. This may be due to the fact that many of the respondents didn't know what I meant by co-located. In any case, one might say, "a good programmer is a good programmer whether or not she's in the same building," or that "just because you're fast doesn't mean you're good and just because you're slow doesn't mean you're bad."

"Communicates effectively" and "Interested in helping teammates" are moderately rated in favor of contributing to being a good programmer while the two "college degree" items are rated moderately against contributing to being a good programmer. For example, you might know a lot of programmers who are great developers but who lack the social skills or interest to become good communicators or team players. As a result, you might say that good programmers often communicate effectively and help teammates, but some good programmers cannot.

You might also say, "Many people without college degrees, let alone a computer science degree, are great programmers. Thus, while having a degree is helpful, you can learn to be a great programmer without it."

Now, I feel it is worth noting that "unimportant" does not mean "negative." Just because most of the development community feels that having a CS degree or other certifications is unimportant to being a good programmer, it doesn't mean that the degree itself is unimportant. It just means that having the degree won't make you a good programmer; you still have a lot of learning to do.

With notes on the analysis out of the way, what do you say we take a look at some data? Currently, the new Google charts won't let me force the display of the categories, so you'll have to hover over the dots you're interested in. Here's a legacy chart you can look at if the javascript version below is unsatisfactory. The legacy chart has current data; it's just not as fancy as this one:
I'm sure you noticed there are two series in this graph. All Responses represents the average responses for all respondents; however, as you noticed in the respondent histogram that I have a dramatically unbalanced sample with programmers outnumbering all other groups combined by almost 3 to 1. To get a more accurate representation of the "general feel" of the community, I included the Group Average measure. This is the average of the averages. It's sort of the electoral college of informal research.

This chart demonstrates the average responses for each group:

Improving Search Performance when using SQL Server 2008 Encrypted Columns

This is a story of courage, honor, and data encryption. In addition to being something of a tribute to one of my favorite games of all time, I do feel I need to preface this post with a bit of a disclaimer.

Generally, when I describe this problem to someone, we go through pretty much the same conversation.

Why didn't you try Sql Server 2008 Transparent Data Encryption? Why don't you just update your ACL? Why don't you try this? Or that?

Well, sometimes the environment, either technologically or politically, precludes some of your better options. Ultimately, you have to go with your best permissible solution to solve the problem at hand. That being said, let's say you've been charged with securing your organizations PII. Your only option is to encrypt the column with SQL Server's column encryption.

You start looking at the impact this is going to have on your current and future applications. If you're like me, one of the first concerns you're going to have is performance. For example, let's say you have the need to search on the encrypted field. A good example of an encrypted field that you'll likely have to search is the Social Security Number.

In my original benchmarks, I found the following results:Knowing that I was going to have to improve this performance, I started playing with some searching alternatives. First of all, it's obviously much faster to search an indexed column (especially one that is clustered), so my first goal was to generate an indexed column.

Ryan McGarty (my best friend and a damn fine programmer) and I discussed and quickly ruled out a basic hash. Sure, it'd be fast to search an indexed column containing the hash value, but that opens you up to a relatively simple plain text attack (especially with a known set of possible values). I decided that you could reduce the threat and still speed up the search by using hash buckets a la hashtables instead.

I concocted a hash function that produced a reasonable distribution within the buckets. My good friend and esteemed colleague David Govek pointed out that an MD5 hash would produce a pretty effective distribution between buckets (with high cardinality data like the social security number. David was right and I ended up with this hash function:

CREATE FUNCTION [dbo].[GroupData] 
(
 @String VARCHAR(MAX),
 @Divisor INT
) 
RETURNS INT
AS BEGIN

  DECLARE @Result INT;
  
  SET @Result = HASHBYTES('MD5', @String);
  
  IF (@Divisor > 0)
    SET @Result =  @Result % @Divisor;
  
  RETURN @Result;

END

I created the test data just as I had before, except that this time I also added the hash bucket. I created a clustered index on the bucket and I changed my select statement a little:

-- original select statement
SELECT *
FROM PersonData
WHERE 
  CONVERT(CHAR(9), DECRYPTBYKEY(SocialSecurityNumberEncrypted)) = '457555462'

-- hash key select statement
-- @Divisor is the divisor for the modulus operation in the hash function
SELECT *
FROM PersonData
WHERE 
  SocialSecurityNumberGroup = dbo.GroupData('457555462', @Divisor) AND
  CONVERT(CHAR(9), DECRYPTBYKEY(SocialSecurityNumberEncrypted)) = '457555462'

Here are my results:

So, what of the known plain text attack then?

So, I don't want to gloss over the plain text attack issue. Given the possible set of socials, the hashes would be very unlikely to have a collision. Thus, for most people, you'd be able to get their socials easily by hashing every possible social and joining to that table. By modulo dividing the hash value, I'm able to evenly distribute social security numbers among a known set of buckets. That means, I can control the approximate number of socials in each bucket given my set of values. I generally aim for about 1000 socials per bucket.

For example, an MD5 % 100 has a possible set of values from -100 to +100. That's 201 buckets so if you have 2010 rows of data to hash, you'll have about 10 rows per bucket. The benefit is that you now only have to decrypt 10 rows to find the exact row you're looking for. The detriment is that you've narrowed your possible result set. Within your own data set, you'd have narrowed it to 10 possible plaintext values; however, given that these values are unknown, you then have to look at the set of possible values.

Social security numbers have a set of possible values less than 1,000,000,000. It's hard to say exactly how many of them are the possible set of values, so let's say we're using only those socials currently in use by living Americans. The population of the United States at the time of writing was about 312,000,000. As I said, I usually aim for about 1,000 records per bucket. If I have 1,000,000 rows of data, I would modulo divide by 500 (1,001 buckets). If you knew my set of values, then you'd only have 1,000 possible values to reduce. Given that you know when and where I was born, you could probably narrow it to 20 or so possible socials.

But, you don't know my set of values, so you really have 312,000 values to chose from. Even if you did know the first 5 digits of my social security number (based on my state of issuance and my date of birth), your set of possible options would be so large, you'd probably be better off just pulling my credit report and getting my social that way.

Thus, while it seems to introduce a weakness to plain text attacks (and with low cardinality data it would be an issue), in the case of social security numbers, I don't believe it to be a reasonable attack.

Column Encryption in SQL Server 2008 with Symmetric Keys

Most of the time when I write blog posts, I do it to share ideas with my fellow developers. Sometimes I do it just so I can have a place to reference when I forget the syntax for something. This is one of those reference posts.

Recently I've been charged to column level encrypt some personally identifiable information. The present post is not intended to discuss the merits of column level encryption; rather, as I said it is to put a few code snippets up so that I can reference them later. If you should find yourself in a column level encryption predicament in a SQL Server 2008 environment, you may find these useful as well.

First thing's first. Get the database ready for column level encryption by creating a master key:

--if there is no master key create one
IF NOT EXISTS 
(
  SELECT * 
  FROM sys.symmetric_keys 
  WHERE symmetric_key_id = 101
)
CREATE MASTER KEY ENCRYPTION BY 
  PASSWORD = 'This is where you would put a really long key for creating a symmetric key.'
GO

Now, you'll need a certificate or a set of certificates with which you will encrypt your symmetric key or keys:

-- if the certificate doesn't, exist create it now
IF NOT EXISTS
(
  SELECT *
  FROM sys.certificates
  WHERE name = 'PrivateDataCertificate'
)
CREATE CERTIFICATE PrivateDataCertificate
   WITH SUBJECT = 'For encrypting private data';
GO

Once you have your certificates, you can create your key or keys:

-- if the key doesn't exist, create it too
IF NOT EXISTS
(
  SELECT *
  FROM sys.symmetric_keys
  WHERE name = 'PrivateDataKey'
)
CREATE SYMMETRIC KEY PrivateDataKey
  WITH ALGORITHM = AES_256
  ENCRYPTION BY CERTIFICATE PrivateDataCertificate;
GO

Before you can use your symmetric key, you have to open it. I recommend that you get in the habit of closing it when you're finished with it. The symmetric key remains open for the life of the session. Let's say that you have a stored procedure in which you open the symmetric key to decrypt some private data which your stored procedure uses internally. Someone who has access to the stored procedure can run it and then will have the key opened for use in decrypting private data. My point, close the key before you leave the procedure. Here's how you open and close keys.

-- open the symmetric key with which to encrypt the data.
OPEN SYMMETRIC KEY PrivateDataKey
   DECRYPTION BY CERTIFICATE PrivateDataCertificate;

-- close the symmetric key
CLOSE SYMMETRIC KEY PrivateDataKey;

Here's a little test script I wrote to demonstrate a few points. First, the syntax for encrypting and decrypting. Second, the fact that the the cipher text changes each time you do the encryption. This prevents a plain text attack.

-- open the symmetric key with which to encrypt the data.
OPEN SYMMETRIC KEY PrivateDataKey
   DECRYPTION BY CERTIFICATE PrivateDataCertificate;

-- somewhere to put the data
DECLARE @TestEncryption TABLE
(
  PlainText VARCHAR(100),
  Cipher1 VARBINARY(100),
  Cipher2 VARBINARY(100)
);

-- some test data
INSERT INTO @TestEncryption (PlainText)
SELECT 'Boogers'
UNION ALL
SELECT 'Foobar'
UNION ALL
SELECT '457-55-5462'; -- ignoranus

-- encrypt twice
UPDATE @TestEncryption
SET 
  Cipher1 = ENCRYPTBYKEY(KEY_GUID('PrivateDataKey'), PlainText),
  Cipher2 = ENCRYPTBYKEY(KEY_GUID('PrivateDataKey'), PlainText);

-- decrypt and display results  
SELECT
  *,
  CiphersDiffer = CASE WHEN Cipher1 <> Cipher2 THEN 'TRUE' ELSE 'FALSE' END,
  PlainText1 = CONVERT(VARCHAR, DECRYPTBYKEY(Cipher1)),
  PlainText2 = CONVERT(VARCHAR, DECRYPTBYKEY(Cipher2))
FROM @TestEncryption;

-- close the symmetric key
CLOSE SYMMETRIC KEY PrivateDataKey;