D. Patrick Caldwell on Software Engineering: Providing both Authentication and Anonymity

I was reading a little tgdaily today and I found an article about a new iPhone app that may be showing up in the app store in the near future. The new application is called Trapster and it's a "social-networking speed trap warning website."

I know what you're asking. Well, I don't actually know what you're asking, but what you should be asking is, "Why are you blogging about this Patrick? You spend most of your time writing Human Resources Software for paperless onboarding and business process automation." Well, you're right, but I'm always fascinated by new ideas, new technology, and of course, social networking. I watched the tgdaily video, I read the article, and it got me to thinking . . . and blogging.

Trapster allows users to track their current location, to see where speed traps and cameras are located, and to support the community of "moving violationally" challenged people in their local area by reporting these pesky traffic control devices. It's a neat and clever idea. I started developing a similar app in windows mobile a few years ago, but abandoned the project, mostly because at the time there just weren't that many mobile devices with built in GPS receivers. One feature I considered for ensuring the validity of the speed trap data was to collect statistics on the frequency with which users reported a speed trap in the same location and the estimated duration that the speed trap was in place, but this would obviously require a large user base.

Pete Tenereillo, the maker of Trapster, addressed the issue by allowing users to rate the validity of reports and by using these ratings to calculate a historical "trustworthyness" of any particular user and his or her reports. In my application, I didn't associate reports with users so there was no way for me to calculate a trustworthyness factor on a per user basis rather than on a time-based historical basis. I opted to use time and frequency instead of user ratings because I was concerned about the privacy and security of my users. While unlikely, it is concievabe that an irritated government that finds itself losing revenue from ticketing fees and which sees an increase in brazen drivers may decide that they would like to outlaw the reporting of traffic devices. Not only that, but they could also decide to address the interferance of officers' duties by requesting access to Trapster's user data.

At first, you would think that your user information and driving history would be unavailable to an interested third party, but in the case of the government, you'd quickly find that this is a difficult battle to win. The law calls these data "regularly kept records" and they are subject to subpoena and seisure. Even search engine giant Google has suffered with this. Google provided 12TB of YouTube user data to Viacom in one case and another dataset to the Brazillian government in another case.

The problem is, I really like Tenereillo's brilliant idea of having the community rate contributor data. For one thing, people interested false positives by posting fake reports will quickly have discounted authority in the system. Furthermore, people who consistently try to create false negatives by disagreeing with other raters will also have reduced consideration by the system (At least, that's how I presume it will work 'cause that's what I would do). So, how do you calculate inter-observer reliability if you don't keep user data around? Furthermore, if you do keep user records, how can you keep track of user submissions without being able to relate users to their submissions? So, my question is this: is it possible to provide both authentication and anonymity with the same system?

I've put some thought into the problem and I'm off to a start with a potential pattern to solve it. The problem is demonstrated in the scenario below:

User registers with username and password

System assigns user id

System hashes and stores password

User logs in

User reports speed trap

System records trap report with user id

Most systems will work with a design that approximates this one. In this configuration, the system is aware of the relationship between reports and users, the system can provide historical report data, and a third party could subpoena historical user reports. An alternative would be, obviously, to save speed trap reports without a user id. As discussed before, the system would thus be unaware of the relationship between users and reports and a third party couldn't subpoena these data, but there would be no way to relate reports to eachother. There is, however, a third alternative. Imagine the above scenario modified as follows:

User registers with username and password

System assigns user id

System hashes and stores password

System hashes password + username and stores it with hash id

User logs in

User reports speed trap

System records trap report with hash id

With this pattern, you can provide historical report data even though the system is is unaware of the relationship between users and reports because the reports can be associated with other reports from the same user. In fact, the user can even view his or her own history. When the user logs in, he or she enters the password which is then hashed with the username and is stored in short-term memory rather than in the application database. Passing this hash to procedures in the database will allow the user to retrieve historical post data and will allow the system to calculate "user trustworthyness" statistics even though it cannot associate specific users with their posts.

For the sake of deeper explanation, here's how the system provides both authentication and anonymity. Hashing the password and storing it in the user table allows you to securely keep authentication information because a hash cannot be reversed, thus providing the authentication funciton. The other hash of the username concatenated with the password provides a unique identifier for the user that also cannot be reversed. This way, once the application has authenticated and validated the user, it can then use the second hash for retrieving and posting data. Hashing data which are not stored in the database (i.e., the password) means that the database alone cannot be used to associate the users with historical report data, thus providing anonymity.

One concern I haven't yet addressed (though it is truly unlikely to be an issue) is that by hashing the same password twice, you make it slightly easier to bruteforce the password.

D. Patrick Caldwell on Software Engineering

Wednesday, September 10, 2008

Providing both Authentication and Anonymity

1 comment: