Extension:CampaignEvents/Aggregating participants' responses

When an event ends, all the participants' responses are anonymized and presented to the event organizers in aggregate form. Our goal is to do this in a way that is useful for the organizers (who can then see statistics about their events), while at the same time protecting participant data. Many smaller events will not have enough participants for us to be able to cleverly obfuscate the data, for instance by randomizing the answers: with a small data set, any amount of noise we could add would quickly bias the data and make it almost useless. This page explains what measures were implemented in order to balance anonymization and usefulness.

Question-level threshold
For each question, we first check whether the number of responses is larger than an arbitrary threshold. Given the expected amounts of data, we chose the value 10. If a question received less than 10 total responses, no data will be shown for that question. This not only helps us protect the data, but it also helps us keep the presented information more relevant, given the next measure.

Option-level threshold
For each question that has at least 10 total responses, we show the exact number of participants who chose a given option iff that option received at least 5 responses (the option-level threshold); otherwise we just display "less than 5". Here, each option is independent of the others. With this criterion in mind, it becomes easier to understand the rationale behind a question-level threshold. Without a question-level threshold, most of the options would be reported as having received "less than 5" responses, which is not very useful. In fact, given the values that we have chosen for these thresholds, if a question received less than 10 responses, then there can be at most one option with at least 5 responses. This wouldn't be very useful for the organizers, especially if the question has many answer choices.

In practice, this is what it would look like:

Displaying the number of nonresponses
One of the things we wanted to include in the report is how many participants did not answer a certain question. From now on, let:
 * $$T$$ be the total number of participants; in our example, assume $$T=100$$.
 * $$\mathit{NR}$$ be the number of nonresponses that is shown to the organizer (what we want to determine).
 * $$R_i$$ be the number of responses to option $$i$$.

The naïve approach would be to derive $$\mathit{NR}$$ with a simple number subtraction: $$\mathit{NR} := T - \sum R_i$$. In our example we would have $$\mathit{NR} = 100 - (42 + 33 + 2 + 4 + 9) = 100 - 90 = 10$$, and therefore:

However, this is problematic. When we did the subtraction, we used all the $$R_i$$ values, even those that are below the threshold and that the organizers cannot see. In practice, from an organizer's point of view, this means that the number of nonresponses actually carries additional information. And because $$\mathit{NR}$$ is derived from (and unambiguously determined by) all the $$R_i$$, this additional information necessarily describes the $$R_i$$ themselves.

Let $$S_5 := \sum_{R_i \ge 5} R_i$$ be the sum of responses across all options with at least 5 responses; in our example, $$S_5=42+33+9=84$$. The organizer still cannot know that 2 people responded "Elephant" and 4 "Penguin", but they can reverse the subtraction and discover that a total of 6 people responded either "Elephant" or "Penguin": $$T - S_5 - \mathit{NR} = 100 - 84 - 10 = 6$$. By seeing the number of nonresponses, the organizer gained additional information; this can be easily proven, for example by observing that $$(\text{Elephant}, \text{Penguin}) = (4,4)$$ was a plausible scenario without seeing the number of nonresponses, but it no longer is plausible in the last table.

Sometimes, depending on the number of options and the distribution of responses, the additional information carried by the nonresponse number can actually be enough for the organizer to infer the obfuscated values for the whole dataset. For instance, in the following example, assuming a total of 20 participants:

The organizer could easily deduce that exactly 2 people responded "dancer".

Solution
To avoid leaking information, our goal is to compute $$\mathit{NR}$$ without using any additional information beyond what the organizer already knows. In the naïve approach, the extra information came from using all the $$R_i$$ in our calculation, when the organizer was not aware of the $$R_i$$ values smaller than 5. For those, all the organizer knows is that $$0 \le R_i < 5$$. Or in other words, from an organizer's perspective, those $$R_i$$ values lie in the $$[0, 5)$$ interval, which is equivalent to saying that $$R_i \in [0,4]$$.

Because these number intervals are known to the organizers, we can use elementary interval arithmetic to compute $$\mathit{NR}$$ without using more information than the organizer already knows. Let $$N := \#\{ i \mid R_i < 5\}$$ be the number of options with less than 5 responses. We can now change our definition: $$\mathit{NR} := T - S_5 - N \cdot [0,4]$$.

All the values used in this formula are already known to the organizer, meaning this does not carry any additional information. The formula can also be rewritten as follows:

$$\mathit{NR} := T - S_5 - N \cdot [0,4] = T - S_5 - [ 0, 4N ] = [ T - S_5 - 4N, T - S_5 ]$$

In our first example, we have $$\mathit{NR} = [ 100 - 84 - 4 \cdot 2, 100 - 84 ] = [8, 16]$$. Putting it all together:

Usefulness
One thing to note about this calculation is that the resulting $$\mathit{NR}$$ value is somewhat useless, because it does not convey additional information; meaning, the organizer could have computed this value on their own by just looking at the rest of the table. However, that's precisely our goal: we don't want the $$\mathit{NR}$$ number to carry any new information. In practical terms, there's still value in showing this number: the organizer doesn't have to do the subtraction themself, which could lead to arithmetic mistakes, especially if the numbers are large.

Clamping
The formula we're using does a good job at protecting the data, but sometimes it may result in intervals that make little sense. For instance, considering our previous example with 20 participants but slightly different numbers:

Here we have an $$\mathit{NR}$$ interval with negative numbers, which clearly aren't possible in practice. This can easily be addressed by tweaking our formula so that the lower bound is capped at 0, thus yielding "between 0 and 3".

Note that the only reason why we need this correction is that the "less than 5" threshold wasn't accurate in the first place. Because the organizers already knows that 17 out of 20 people responded "Human", they also knew that the number of "Dancer" responses must have been between 0 and 3.

Likewise, if all 20 people responded "Human", the organizer would know that "less than 5" actually means 0. This is just a limitation in the simple thresholding approach we're using for individual options, and there isn't much we can do about it.

Worst case scenario
The last thing I'd like to consider is what's the worst case scenario for the nonresponse interval, i.e. how large it can get. We can easily see that the length of the interval can be maximized by minimizing $$S_5$$ (and consequently maximizing $$N$$). This corresponds to the scenario where all options received less than 5 responses, which leaves us with $$\mathit{NR} = [0, T]$$. This really adds no information for the organizer, but that was our goal in the first place, so what else to say if not: mission accomplished!