Small Sample Sizes Don’t Make Studies Invalid

Everyone in the UX field has heard the same complaint day in and day out from critics: Your sample sizes are too small, therefore your findings are meaningless in the grand scheme of things. Now, coming from a human factors background, I had the same gripe. (That, and the fact that aspects of tests would seemingly change with every few participants based on trends or clients’ whims. But time and experience has shown me that those things are not big deals.)

From my perspective, depending upon the goals of a study the lack of statistical significance criticism is very valid. For instance, you should consider using larger sample sizes if you’re conducting any sort of research whose goals include assessing attitude toward the product. You also want to use larger sample sizes in summative testing situations, where you are assessing overall system usability based upon a set of multiple metrics collected during test sessions.

So, then, what of formative testing situations and other situations where researchers are restricted to small sample sizes? Well, all is not lost in this case. Let’s consider some typical metrics first. Time on task, task completion rates, task error counts, and task satisfaction are some of the major metrics used to assess usability. I will put these forward one-by-one for examination.

Time on Task

Time on task is one of those measures that many folks are in love with. Now, in formative studies where people commonly follow think aloud protocols I do not advocate taking time on task. Arguments exist saying it can slow people down as well as speed them up because it forces beneficial cognitive processing that propels users along successfully. Regardless of which side you fall on, I’m going to say that it doesn’t represent actual use since most people don’t use a website or an iPad while talking themselves through the process. However, this once again begs the question of, what’s the purpose of your test? For studies whose purpose is to explore and understand user reactions (which is commonly the case in formative studies), it is probably best to leave time on task out unless you are comfortable with retrospective probing. But if you intend to gain some sort of benchmark from a small sample size, then by all means, measure time on task as there are statistical methods that will lend credit to your findings.

How to analyze this data when it comes with a small sample size

Student’s t-test. This particular statistical test is meant to test the hypothesis when the sample size is small and the population variance is unknown. Along with this, consider calculating the confidence interval, which is the product of the critical t-value and the standard error.

Task Completion Rates

This is one of those metrics that needs to be strictly defined before starting a study. Defining what constitutes a success and what constitutes a failure is important because a lack of understanding could lead to unintentional misclassification. That being said, this is a pretty straightforward metric, it is the number of successful task completions divided by the total number of users who attempted the task. The actual data points, however, are discrete binary data. So, someone either completes or does not complete a task. And once again, if you are collecting this information in a study, you should provide confidence intervals. However, calculating confidence intervals on task completion data is different than calculating it for time on task data given the nature of the individual data points.

How to calculate confidence intervals for discrete binary data

Modified Wald Interval. The Modified Wald Interval will produce an interval that will contain the observed proportion about 95 percent of the time on average.

What if everyone passed? What if everyone failed?

Well, we all know that unless your client has created the most awesome (or horrifying) product known to man, this is probably just a result of the sample size being so small. So, use the LaPlace method to obtain a point estimate. The reason this is a good method is because it is closely tied to the sample size you’re using. It is calculated by adding 1 to the number of observed success, and then dividing by the number of task attempts plus 2. So: (x+1)/(n+2).

Task Error Counts

Task error counts are extremely straightforward as long as what is an error is clearly defined at the outset of the study. You simply count how many times users make a mistake while trying to complete a task. Thankfully it is continuous data, so analysis is pretty normal: t task and confidence intervals.

Task Satisfaction

These are subjective data points taken on a point scale (usually 1 through 5), and the scale is assumed to have equal magnitude and be continuous. In truth, we cannot say that this data has equal magnitude between different points on a scale since the difference between a 4 and a 5 can be different between users. And my own personal preference for reporting statistics with this data is as follows:

For 5 or less users: List the range of scores.

For 7 to 15 users: Report the median.

For 15+ users: I still prefer the median because it does not produce decimal values very often, and typically you are not allowing users to give decimal ratings on a scale. However, it is safe to consider using the mean at this point. Just be aware that you will likely obtain a decimal value for the mean, which is not a true representation of the scale.