I recently read about using physical dice to roll private Bitcoin keys, for example here. Intrigued, I ordered a pair of hexadecimal dice from GameStation.

Soon I began to wonder… How random are they? Do rolls result in an expected distribution? Can knowledge of the actual distribution characteristics of the dice be used to narrow private keys generated by this method into a practically searchable space, making such keys more susceptible to discovery?

I rolled my pair of hexadecimal dice 1,014 times for a total of 2,028 samples for the pair. Here are the results:

Already a severe underrepresentation of the values 3 and C is observed.

But how about a goodness of fit test? We begin with a null hypothesis that the dice are fair and distribute evenly. The data yield . Using a chi square table such as this one and looking at the row for 15 degrees of freedom, we see that the value 105 is off the chart, or p < 0.01. In fact, using this calculator, we see that p < 0.0001. This means the data are very much statistically significant to reject the null hypothesis — indicating that the dice are not observed to be fair over these 2,028 rolls.

Here’s a good article that examines the question of how many rolls give good results for the goodness of fit test.

However, commenter Matthew Neagly makes an interesting point in this post that “the larger your sample size, the more exactly you have to match a potential distribution to not reject a match. So eventually you hit a point where you’re ‘cursed’ to always reject your hypothesis.” In other words, the more rolls, the higher the probability that the test will reject a die as unfair. (He mentions two terms for this phenomenon, “curse of significance” and “doomed to significance,” but I was unable to find any discussion or related use of these terms.)

He also says, “This is one of the reasons why some statisticians favor visual interpretation of plots.” From that perspective, the histogram above is pretty clear.

The dice didn’t have fair distribution for my test… So what? It’s a fair (pun!) question.

At what point are dice “good enough”? Can these dice I’ve acquired, given the apparent wildly uneven distribution observed, be used to create private Bitcoin keys no more susceptible to discovery than keys generated by other means? Is the unevenness attributable to something specific or unique to these two dice or the rolling method/conditions? Or would a similar pattern be observed with other dice, that is, attributable to systemic or common causes in the manufacturing process? Finally, if they are truly fair and random dice, then certainly the exact sequence observed for this test is a valid possibility, however “unfair” it may seem.