D. Patrick Caldwell on Software Engineering: September 2008

Friday, September 12, 2008

Statistical Method for Estimating Software Projects

As the Vice President for Research and Development at Emerald Software Group, a large part of my job comprises managing software projects from conception to completion. As a programmer in a management position, I've discovered a few things. First, I am very good at estimating how many human hours of programming time it will take to write an application. Second, I am not very good at estimating how many man hours it will take to complete the entire project life cycle. So, with a background in psychology, experience with statistics for social sciences, and a knack for inquisitive observation, I set out to generate a formula for providing accurate and timely estimates by isolating the variable that I am best at estimating . . . development hours.

I started by observing my work with one of our larger clients because, after all, they were impetus for my little side-project anyhow. I took notes on all things client related. How much time did we spend talking to the client? How many people were in conference calls? How much time did it usually take to deploy in the client's environment? How much time did it take to test an application before deployment? I gathered as much data as I could and once I felt like I had a reasonable sample, I looked at the trends in the data. I noticed that four separate components stood out: coding time, testing time, deployment time, conference calls. I've renamed these factors, which I call the 4 Ds, and they are the basis for my formula: develop, debug, deploy, and discuss.

So, my records showed that for this particular client, if I spent 10 hours coding, I would have to spend 15 hours testing. If I spent 5 hours coding, I would spend 7.5 hours testing. I generally do most of my testing as I go along, so it was a little difficult to arrive at these numbers, but I found that I reliably spend 1.5 times as much time testing as I do programming. Thus, for every n hours I spend coding, I will spend 1.5n hours testing. I also noted that we spent an average of 1 hour deploying our applications to this client regardless of the amount of time spent programming so I add 1 hour for deployment regardless of the project size.

Finally, conference calls. Conference calls were kind of difficult to factor out of the raw data. It turns out that for this client, we always have at least 1 conference call and the average conference call lasts about 45 minutes. I also noted that in addition to the one conference call we have for every project, there's also one conference call for every six hours of development. Thus, to figure out how much time we spent discussing a project, I calculated (⌈n/6⌉ + 1) and multiplied that by 45 minutes or 3/4 hour. The problem was that this only lined up for projects with low complexity.

I reviewed the data again to find the source for the discrepancy. The time spent in conference calls after factoring out development time had a particularly high standard deviation. I decided there must be another factor. I thought about the many meetings we've had discussing projects for this client and I realized what was causing the within measure variability; the more hours we spent developing, the more people had to be involved with the conference call. I took that factor out and identified an average of 1.5 people per call and the estimates became a little more accurate.

The problem was that there was still some variability, not only in conference calls, but throughout the entire formula as well. To keep a long explanation from getting longer, I found that the overall variability wasn't a factor within the formula itself per se, rather a complexity factor that generally increased all of the estimates throughout the formula. Thus, my entire formula needed to factor in complexity anywhere the formula referenced development time.

My finished formula was:
h ≈ develop + debug + deploy + discuss
h ≈ cn + 1.5cn + 1 + 3 (1.5 * (⌈cn / 6⌉ + 1)) / 4

After simplifying:
h ≈ 2.6875cn + 2.125

Now, any time I get a new project for this client, I estimate the number of hours and I guess the unfortunately subjective complexity. I apply the above formula and the estimates are accurate within a 95% confidence interval when compared with actual times.

Wednesday, September 10, 2008

Providing both Authentication and Anonymity

I was reading a little tgdaily today and I found an article about a new iPhone app that may be showing up in the app store in the near future. The new application is called Trapster and it's a "social-networking speed trap warning website."

I know what you're asking. Well, I don't actually know what you're asking, but what you should be asking is, "Why are you blogging about this Patrick? You spend most of your time writing Human Resources Software for paperless onboarding and business process automation." Well, you're right, but I'm always fascinated by new ideas, new technology, and of course, social networking. I watched the tgdaily video, I read the article, and it got me to thinking . . . and blogging.

Trapster allows users to track their current location, to see where speed traps and cameras are located, and to support the community of "moving violationally" challenged people in their local area by reporting these pesky traffic control devices. It's a neat and clever idea. I started developing a similar app in windows mobile a few years ago, but abandoned the project, mostly because at the time there just weren't that many mobile devices with built in GPS receivers. One feature I considered for ensuring the validity of the speed trap data was to collect statistics on the frequency with which users reported a speed trap in the same location and the estimated duration that the speed trap was in place, but this would obviously require a large user base.

Pete Tenereillo, the maker of Trapster, addressed the issue by allowing users to rate the validity of reports and by using these ratings to calculate a historical "trustworthyness" of any particular user and his or her reports. In my application, I didn't associate reports with users so there was no way for me to calculate a trustworthyness factor on a per user basis rather than on a time-based historical basis. I opted to use time and frequency instead of user ratings because I was concerned about the privacy and security of my users. While unlikely, it is concievabe that an irritated government that finds itself losing revenue from ticketing fees and which sees an increase in brazen drivers may decide that they would like to outlaw the reporting of traffic devices. Not only that, but they could also decide to address the interferance of officers' duties by requesting access to Trapster's user data.

At first, you would think that your user information and driving history would be unavailable to an interested third party, but in the case of the government, you'd quickly find that this is a difficult battle to win. The law calls these data "regularly kept records" and they are subject to subpoena and seisure. Even search engine giant Google has suffered with this. Google provided 12TB of YouTube user data to Viacom in one case and another dataset to the Brazillian government in another case.

The problem is, I really like Tenereillo's brilliant idea of having the community rate contributor data. For one thing, people interested false positives by posting fake reports will quickly have discounted authority in the system. Furthermore, people who consistently try to create false negatives by disagreeing with other raters will also have reduced consideration by the system (At least, that's how I presume it will work 'cause that's what I would do). So, how do you calculate inter-observer reliability if you don't keep user data around? Furthermore, if you do keep user records, how can you keep track of user submissions without being able to relate users to their submissions? So, my question is this: is it possible to provide both authentication and anonymity with the same system?

I've put some thought into the problem and I'm off to a start with a potential pattern to solve it. The problem is demonstrated in the scenario below:

User registers with username and password

System assigns user id

System hashes and stores password

User logs in

User reports speed trap

System records trap report with user id

Most systems will work with a design that approximates this one. In this configuration, the system is aware of the relationship between reports and users, the system can provide historical report data, and a third party could subpoena historical user reports. An alternative would be, obviously, to save speed trap reports without a user id. As discussed before, the system would thus be unaware of the relationship between users and reports and a third party couldn't subpoena these data, but there would be no way to relate reports to eachother. There is, however, a third alternative. Imagine the above scenario modified as follows:

User registers with username and password

System assigns user id

System hashes and stores password

System hashes password + username and stores it with hash id

User logs in

User reports speed trap

System records trap report with hash id

With this pattern, you can provide historical report data even though the system is is unaware of the relationship between users and reports because the reports can be associated with other reports from the same user. In fact, the user can even view his or her own history. When the user logs in, he or she enters the password which is then hashed with the username and is stored in short-term memory rather than in the application database. Passing this hash to procedures in the database will allow the user to retrieve historical post data and will allow the system to calculate "user trustworthyness" statistics even though it cannot associate specific users with their posts.

For the sake of deeper explanation, here's how the system provides both authentication and anonymity. Hashing the password and storing it in the user table allows you to securely keep authentication information because a hash cannot be reversed, thus providing the authentication funciton. The other hash of the username concatenated with the password provides a unique identifier for the user that also cannot be reversed. This way, once the application has authenticated and validated the user, it can then use the second hash for retrieving and posting data. Hashing data which are not stored in the database (i.e., the password) means that the database alone cannot be used to associate the users with historical report data, thus providing anonymity.

One concern I haven't yet addressed (though it is truly unlikely to be an issue) is that by hashing the same password twice, you make it slightly easier to bruteforce the password.