What is information theory?

Information theory is the branch of applied mathematics involved in quantifying information.

The basic unit of information is a "bit", which represents the amount of information produced by flipping a fair coin. Two independent coin flips, likewise, should produce 2 bits of information. Also, the order in which they are flipped or added together should not effect the amount of information they produce.

Exactly one function meets these two simple requirements of additivity and commutativity. Namely, the logarithm of the probability. This is the definition of the mathematical measure known as "information":

I = -log2(p) (Where p is the probability of the event, and I is the amount of information.)

The relationship is probably clearer in reverse:

(1/2)I = p (Where p is the probability of the event, and I is the amount of information.)

The quantity of information is that which, when 1/2 is raised to that power, gives you the probability of the event.

The main difference between information and probability is that while the probability of two independent events multiply together, the information they produce adds together. That's why the logarithm is taken - to turn multiplications into additions. This makes it a lot easier to work with, and has a clear, intuitive interpretation: the amount of information produced.

Why are you using information theory in a redistricting program?

An election is fundamentally a transmission of information. A vote is a piece of information. The goal of an election is to "encode" that information with high fidelity as representation in congress. An electoral district is a "channel" for this information. As all channels do, it adds noise. We want to eliminate noise. To do that we have to measure it. We need to measure how accurately the elected delegates "encode" the information in the votes. In the mathematical study of information and communication, relative entropy measures precisely this.

What is self-entropy?

Self-entropy is a measure of how unpredictable something is, measured in terms of information. A fair coin has a self-entropy of 1 bit. If we flip it twice, that's twice as much information, or two bits. Likewise if we flip two coins at the same time, that's two bits. Also, if we rolled a 4-sided die once, that's like flipping a fair coin twice, so that's two bits. (if you have trouble seeing why this is, picture one side of the die representing heads-heads, another heads-tails, another tails-heads, and another tails-tails.)

To carry this analogy to the extreme, think of a roulette table. If it had just 4 slots of equal size, that'd be like a 4-sided die, or 2 coins. If it had 4*2=8 slots of equal size, that'd be like flipping 3 coins, so that'd be a self-entropy of 3 bits. A 16-sided table would be like flipping 4 coins, so would be 4 bits. And so on.

Now the slots don't have to be of equal size. Let's say half of the wheel is one slot, and the other half is 8 evenly sized-slots.

When it lands on the side that's just one slot, that's like just flipping heads on a coin-toss. That's 1 bit of information. When it lands in one of the 8 slots on the other side, that's like we flipped a coin and it came up tails, and then we spun a 8-sided roulette table and it landed in that slot. So that's 1 bit, plus 3 bits, equals 4 bits. Half the time it will land in the one big slot, for 1 bit of information, and half the time it will land in the other, for 4 bits of information.

1 bitshalf the time
4 bitshalf the time


To get the self-entropy we take the weighted average of these possibilities, so half times 1 bit plus half times 4 bits, gives us (1+4)/2 = 5/2, or 2.5 bits. This tells us the unpredictability of our roulette table is about the same as 2 1/2 coin flips. Or 2 spins would be the same as 5 coin flips.

We say that the "self-entropy" of this roulette table is 2.5 bits. By that we mean, on average, that's how much information a single spin of the wheel produces. About the same as 2 1/2 coin flips.

What is cross-entropy?

Now let us imagine again our roulette table, with 9 slots. But let's pretend that we were wrong about the sizes of the slots. When we are wrong about the size of the slots, the result is called "cross-entropy". It's called "cross-entropy" because we are taking an estimation of the distribution, and "crossing" it together with the actual distribution.

Let's say we thought the first slot - the one that really takes up half the board - only took up a quarter of the board. In other words, we think it represents 2 bits of information. For the remaining 8, we think 4 are evenly distributed on the other side, and the final 4 are evenly distributed in the remaining quarter. So the first 4, we think represent 3 bits of information, and the last 4, we think represent 4 bits of information. (i.e. as unpredictable as 4 coin flips)

Under these - incorrect - assumptions, when we actually spin the wheel, we'll count:

2 bitshalf the time
3 bits1/4 the time
4 bits1/4 the time


So, 2*0.5 + 3*0.25 + 4*0.25 is how much information we'll count on average, per spin. This comes out to 1+0.75+1 = 2.75. Note this is higher than the actual "self-entropy". Cross-entropy is always higher than self-entropy, because we are always a bit extra "surprised" by the result, because our estimates are wrong.

What is relative entropy?

Relative entropy, also known as "Kullback-Leibler divergence" is just the cross-entropy minus the self-entropy. It measures this extra "surprisal" that we get from our estimation not matching the actual (or ideal) distribution. When our estimates match the actual perfectly, relative entropy is zero. In this way, "relative entropy" measures how "incorrect" our estimate is. In our example above, there are 2.75 - 2.5 = 0.25 bits of "surprisal", or about one coin-flip worth of surprise every four spins.

How is relative entropy used in the scoring?

Use in "proportional representation" - In our use of "relative entropy" or "KL-divergence", for the "proportional representation" criteria, we treat the distribution of elected seats as the "estimation", and the votes of the citizens as the "actual". That is, we are saying that we are trying to "estimate" the will of the citizens via the distribution of seats in congress. The further those two distributions are from each other, the higher the "relative entropy", the less accurately the distribution of seats represents the will of the people.

Use in "voting power balance" - In our use of "relative entropy" or "KL-divergence", for the "voting power balance" criteria, we treat the distribution of voting power among the districts as the "estimation", and a perfectly even distribution as the "ideal". That is, we are saying that we are trying to "estimate" a perfectly even distribution. The further those two distributions are from each other, the higher the "relative entropy", the less even the distribution of voting power is among the districts.

Another way to look at it is that the self-entropy of an election in a district represents how much information about the votes that district generates. We want each person's vote to count for the same amount of information. One person, one bit.

There are actually two reasons we want this. The first one is obvious: because it's fair. The second one, perhaps a little more interesting. We want to maximize the net "informativeness". In other words we want to maximize it's "self-entropy". There's a well know result in information theory that self-entropy of a discrete distribution is maximized when "p is uniform". That is, when each "partition" has the same size. Or, in our case, when each person's vote represents the same amount of information.

So we take the self-entropy of each district's election outcomes, divide that by it's population, and then use relative entropy to measure it's distance from the ideal, maximally informative distribution, which is simply the uniform distribution.

Why use relative entropy?

Well there are a couple reasons.

First, because it describes what we are actually interested in. We want to minimize the amount of "surprise" we have, when, for instance, we see that there are 15 seats for one party and 18 for the other, and then we learn that the popular vote was actually about 52-48 in the other direction. We don't want to be surprised like that. We want to see the seat ratio, and then see the popular vote, and not be surprised at all. That's exactly what relative entropy measures: the amount of surprisal.

Also because it has many important properties. The least of which is that it is always positive, and it has a minimum of zero. The most important property is that it is always convex. This is a fancy way of saying that it has no "bumps". This guarantees that, by using this measure, our program won't get "stuck" in "local optima"; it will always have a slope to fall down, that always leads down, except at the very bottom, which will indeed be the very bottom.

And finally because we really don't have much of a choice. We don't have just two parties, we have many parties. Currently only two viable parties for any major election, yes, but if we want to do it right, we need a measure that works regardless. And while if we had just two parties we could use a much simpler statistic such as a "variance", we have more than two parties. And more to the point, the right formula should be able to handle more than two parties, because it's certainly possible in theory. Relative entropy, and, more generally, statistics based on information theory, naturally handle these scenarios where there are many "buckets". Indeed, it is their home court.