What to do when you can’t find a coin
Mathematics, Sports, Statistics, and other things
What follows is the academic residue of a spirited discussion between a fellow PhD student and myself, concerning the use of measure theory in probability. The central question is “Why bother?” Here is my attempt at an answer to this question, through a small demonstration of measure theory’s ability to generalise. This is not an attempt to teach any measure theory, but I will point to a few resources at the end that I found helpful for reacquainting myself during our discussion if you would like to do the same.
First, we must establish in traditional terms the result we will later emulate measure-theoretically. I will only talk about non-negative random variables; the result generalises by splitting into positive and negative parts, but the notation is drastically simplified.
Theorem 1: If is a non-negative random variable with density
and probability function
, then
.
Proof:
There are two conceptually important points here. The less theoretically troublesome one is the switching of integrals, which Fubini lets us do, but I’ve always found a little cheeky. More foundationally important is that we assume the existence of a density here, but it is absent from the result of the theorem. It is an achievable exercise to prove the equivalent result for discrete distributions, and I concede that most continuous distributions I have encountered in the wild have a density, but this does have practical importance. The usefulness of the theorem is in being able to compute an expectation when we don’t have or don’t want to find a density, so it’s essentially useless if having a density is a pre-condition to its application. Can we get around this somehow?
I will spare the majority of the details of satisfactorily defining a random variable measure-theoretically, but some objects need to be defined.
The premise of measure-theoretic probability is that we start with a measure space . In probability terms, this gives us a sample space
, a set of events
and a probability measure
, as long as
. We will brush over what
is really saying, but suffice to say it imposes Kolmogorov’s unit measure axiom. The other axioms of probability are packaged up in what a measure space is. This gives us a notion of what probability means on
. We can then define a real-valued random variable as a measurable function from
to the reals, that is, a function
such that the pre-image
of any open interval
is an element of
.
For our purposes, we can define any real-valued random variable as follows, by first defining the distribution. Take
to be our measure space, where
is the set of Lebesgue-measurable subsets of
, and
is the Lebesgue measure. Then the uniform distribution
can be defined as the identity map
. You can check for yourself that any property you like about the uniform distribution carries over perfectly. In particular, we can check that
.
Now, anyone familiar with the inverse-transform will know that defining any other real-valued random variable is a piece of cake. Every real-valued random variable has a distribution function
, so we define
.
might not be easy to compute, but it definitely exists.
We are still missing one key element, expected value. We define it as . I will leave undefined what it means to actually compute an integral this way, but it can be done. Importantly, it is still achieving the same goal of finding area under a curve. We are now ready to prove:
Theorem 2: If is a non-negative random variable with probability function
, then
.
Proof:
If you’ll allow me a couple of pictures, I argue that it is true by definition. We see that the area integrated by and the area integrated by
are in fact the same areas.


Q.E.D.
Now, we should be skeptical of any proof which follows so readily from the definitions. The traditional discrete distribution is marvelously intuitive. Further, we can squint at the traditional continuous expected value definition and notice the pattern. By comparison, the measure-theoretic definition is quite opaque. So far it seems like we just made up a definition so that this proof was easy. What’s the value in that? Here’s how I see it.
I liken it to the intermediate value theorem (IVT). The point of proving the IVT is not to dispel any doubt that if an arrow pierces my heart it must also have pierced my ribcage. The point of the IVT is in showing that the definition of mathematical continuity we have written down captures the same notion of physical and temporal continuity we sense in the real world.
What we have really learned from theorem 2 then, is that we can define expected value in terms of the probability function directly. We essentially drop the density assumption by fiat. The value is in discovering this more powerful definition which unites previously disparate discrete and continuous cases, as well as distributions which are a mix of both.
My favourite mixed distribution is the zero-inflated exponential, with probability function when
, and
otherwise.
Traditionally, to evaluate an expected value we would have to be rather careful or apply some clever insight. Now with measure theory, we can ham-fistedly shove straight in to
and call it a day.
We can also start sparring with more exotic random variables on non-numeric spaces with confidence. I’m currently working through Diaconis’ Group Representations in Probability and Statistics, so hopefully I can speak on these “applications” in more detail in the future. But for now, I’ll leave it as an enticing mountaintop rather than trying to spoil the ending.
It is no secret that I don’t like the IVT, or theorem-motivated definitions more broadly, so I am uncomfortable leaning on it in an argument. What I will provide here is my own post-hoc intuition for the measure-theoretic expected value. Rather fortuitously it leans on the IVT, so I’ll point out pedantically that I’m actually using it as the Intermediate Value Property (IVP) in and of itself. Observe below that the area on the left is the area defining a measure-theoretic expected value as we have seen above.

Note that this area is the same as the area of the rectangular region. As it has unit width, its height is also its area. This height is not coincidentally the mean value of guaranteed by the IVP, so we see that the measure-theoretic definition gives us a measure of central tendency. That this is the same measure of central tendency as the traditional definition can be shown in many ways, but we have seen it today as a porism of theorem 2.
In short, that generalisation is cool, and measure theory is not as scary as I thought after failing it in third-year. It gives us steady footing to go and explore exotic spaces, and it provides some nice perspectives on old favourites. Is it of practical use to the working statistician? Debateable. Our main theorem can certainly be used without actually doing any measure. Perhaps it provides nice perspectives on transformations if one does need to compute certain integrals which aren’t recognisable. What do you think? Have I convinced you measure-theoretic probability isn’t useless? Do you know any interesting applications I didn’t mention? As always, I’d love to hear your thoughts.
I am always hesitant to endorse texts based solely on how helpful they were to me. We should remember that one always understands something better the second time. That being said, the following two probability-oriented texts were useful to me. Matthew N. Bernstein has a trio of nice blog posts entitled Demystifying measure-theoretic probability theory, which are a nice, slow introduction to some of the basics. I also found Sebastien Roch’s Lecture Notes on Measure-theoretic Probability Theory useful as a much denser, more comprehensive reference. As for strictly measure-theoretic principles, I found plenty enough information by simply clicking the first Wikipedia article to pop up when I searched the relevant terms.
See mathsfeed.blog/problem-adding-dice-rolls/ for the motivation, introduction and immediate discussion of the problem.
We want to roll dice at a time, and add up the total over repeated rolls, and continue to do so until we reach a threshold
. When we reach or exceed
, we note
, the sum of the dice we just rolled. What is the distribution of
? We will call the range
to
the striking range, as a roll in this range might get us to
. Inside this range we need to pay extra attention to the value of our dice roll, as depending on its value we might have to roll again, or we might have to stop.
Let be the cumulative sum of
dice rolled
times. Let
be the cumulative sum of
dice rolled an unknown number of times. We want to find a mixing point
after which
is uniform in terms of
. Why? If we find such an
, then as long as our threshold is at least a few maximum dice rolls away from
, it doesn’t really matter how far away it is, we can always assume our cumulative total approaches the striking range from somewhere uniformly in an appropriately wide interval just outside the striking range. This significantly reduces the complexity of an analytic solution or a computer simulation. If the threshold is not significantly past the mixing point, then we have to be careful as our cumulative total is more likely to be at particular points, and the calculations become more complex, as our cumulative total will come in chunks of about
at a time.
It might be that doesn’t really exist, and there is always some very subtle nonuniformity to
. This isn’t necessarily the case, but showing that would be another problem entirely, and we’re probably quite fine with
technically a function of some tolerance level. Let’s quickly develop a mental picture with some histograms. This will hopefully convince us that
exists (or we get close enough to uniform that it looks like
exists), and how we might capture its meaning.
For a particular set of and
,
just follows some discrete distribution with a nice bell-curvy shape. Of real interest is sampling from
, as we don’t know how many rolls it took to reach our threshold. The immediate problem is this requires sampling a
first, and we don’t want to have any assumption on
‘s value. So instead I will just uniformly sample
values between 0 and 200 and hope that our brains can imagine the extension to unbounded
. Play around with my code in the Colab notebook here. Actually, let’s always sample
but keep track of the partial sum at each intermediate step. I know this technically violates independence assumptions, but whatever. I’m also going to work with
for these pictures, as the results are suitably interesting.

Here we can see a histogram for 10 000 samples of . As expected it forms a lovely bell-curve shape.

Here we see a histogram for 10 000 samples of for all
. We can see the nice uniform property we’re looking for emerge definitively once
exceeds about 3 000, but it’s hard to tell at this scale. Let’s zoom in on the more interesting part of the graph.

Here we only look at samples of for
. We can see more clearly now the spikes which indicate we are not at the mixing point yet, which we can make out more clearly here is at about
. Zooming in further:

We can identify the spikes more clearly here. Given we roll 20 dice at a time, we should not be surprised to see the initial values occur in spikes about 70 units apart, which is roughly what we see.
They do get a bit wider as we move to the right, as the tails of slightly fatter and further right spikes gently nudge up their neighbours. So, whatever more technical answer we derive below should line up roughly with these observations, namely that by the time is about 2000, or about 29 dice rolls, it should be well mixed.
For a more technical definition, you can be as picky as you like as to how you define suitably uniform, probably with some sort of floating around, but I want a rough and ready answer, and I don’t personally enjoy having
‘s littered throughout my work, so my working definition is as follows:
If, for all , it is no longer obvious to infer from
how many rolls
it took to reach
, then
is a mixing point.
Of course this is entirely heuristic, it is not longer obvious (in some sense) as long as there is more than one value for which
is nonzero. This happens very quickly and does not capture what we see in the simulations. In the other direction, for any
, there will always be some much more sensible guesses for
than others, probably an integer close to
. So we need to start by deciding on our criteria for obvious. I’ve come up with a couple of different definitions, and I’ll discuss them both below.
It can be checked that , and that
. From now on, if I need to, I will approximate
with
, a normal random variable with the same mean and variance. Then I can say we have reached the mixing point if there is significant overlap between
and
for some
. Again there are lots of choices for what is meant by significant overlap and choice of
. Inspired by mathsfeed.blog/is-human-height-bimodal I think a reasonable choice is to compare
, and consider the overlap significant if the there is only one mode, not two. Using the fact that a normal pdf is concave down within 1 standard deviation of its mean, we would like that one standard deviation above the mean for
:
is equal to one standard deviation below the mean for :
One can do some rather boring algebra to arrive at . You can solve this properly I guess, but I am a deeply lazy person, so I’m going to approximate the right-hand side as
. If this upsets you, then I am deeply sorry, but I will not change. (
is big enough and we’re rounding to a whole number at the end of the day so its fine, but I’ve already spent more time justifying this than I wanted to.) This allows us to arrive at
. This roughly agrees with the scale we wanted for
. If you try and count out the first 21 spikes in the above plots, they become very hard to make out by the end. So I’m actually fairly happy with this answer, subject to some proper checking with more choices for
and maybe just topping off with another 20% just for good measure. More important I think is convincing yourself that if I had chosen some other number of standard deviations or some larger
, then
as a function of
should still be linear! So instead of rederiving all of these calculations, just remember that if you’re happy
,
is well-mixed, then
,
should be well-mixed too. Note that this condition truly isn’t enough to guarantee uniformity as it makes no attempt to consider the contribution of any
other than
and
, but it should ensure any spikiness is rather muted. If you’re happy with this condition, good, so am I, but I may as well mention the other method I thought of for measuring mixing.
The definition of suitably uniform above is very heavily based in conditional probability, and I am a dyed-in-the-wool bayesian, so I’m going to attack with all the bayesian magic spells I can muster. If you’re a committed frequentist, maybe it’s time to look away.
We want to derive
.
Can we derive ? Well by the definition of conditional probability,
.
I know is approximately normal, so
,
and we have no prior information about what should be, so we can treat
with a constant uninformative prior. Finally,
is not a function of
, its just a scaling factor, so
.
Now admittedly it’s been a while since I was properly in the stats game, so my tools might be a bit rusty, but this doesn’t look like a pmf I’m familiar with. It looks like it’s in the exponential family, so maybe somebody with more experience in the dark arts can take it from here. I guess you could always figure out some sort of acceptance-rejection sampler if needed. Okay but what’s the point? Well now we have our posterior for , we can be more precise about it being suitably non-obvious what to infer for
. The first criteria that come to mind for me is either specifying the variance should be suitably large (which can be approximated up to proportionality with the pdf, though that proportionality depends on
generally), or that the mode of the distribution is suitably unlikely (also easy up to proportionality, but knowing the actual probability itself feels more integral to the interpretation). Of course in both cases we can approximation the proportionality constant by computing an appropriate partial sum. I’ve knocked up a quick demo on Desmos of what this would look like in practice.
Note of course that the normal approximation itself only works if the number of dice in each roll is suitably large to apply CLT. It also then feels like no coincidence that ‘about 30 rolls’ is the conclusion, as it sounds an awful lot like my usual usual retort when asked if a sample mean is big enough to make a normal approximation. Overall I’m okay with making approximations which assume a large for the same reason we are more interested in deriving results for large
, namely for small
and/or
, we can probably simulate the answer with high precision using a computer, or even by hand for very small values. But these asymptotic results help us to be confident in when we can truncate the simulation for speed, or when we can stop doing simulations and rely only on the asymptotic results.
If you enjoyed this, you might enjoy my other posts on problems I would like to see solved, or find out more about my research from my homepage.
Last year, as part of the assessment for a machine learning course, we were asked to write a tutorial paper on a topic of our choice. I chose to write about Evolutionary Algorithms. I’m happy with how it turned out, and the lecturer chose it as an exemplar for future students. For posterity, I’m going to share it here. Unfortunately, the original .tex file is lost to the sands of time (more specifically, it was saved on a thumb drive that was the only casualty in a backpack-dropping accident while getting off the bus). Thus, the only copy I have access to is the PDF file that I submitted online. You can find that here:
You must be logged in to post a comment.