Why do measure theory in probability?

What follows is the academic residue of a spirited discussion between a fellow PhD student and myself, concerning the use of measure theory in probability. The central question is “Why bother?” Here is my attempt at an answer to this question, through a small demonstration of measure theory’s ability to generalise. This is not an attempt to teach any measure theory, but I will point to a few resources at the end that I found helpful for reacquainting myself during our discussion if you would like to do the same.

The traditional result

First, we must establish in traditional terms the result we will later emulate measure-theoretically. I will only talk about non-negative random variables; the result generalises by splitting into positive and negative parts, but the notation is drastically simplified.

Theorem 1: If X is a non-negative random variable with density f(x) and probability function F(x), then \mathbb{E}(X) = \int_0^{\infty} 1-F(x) dx.

Proof:

\begin{aligned} \int_0^{\infty} 1 - F(x) \mathop{dx} &= \int_0^{\infty} P(X \geq x) \mathop{dx} \\ &= \int_{x=0}^{x=\infty} \int_{u=x}^{u=\infty} f(u) \mathop{du} \mathop{dx} \\ &= \int_{u=0}^{u=\infty} \int_{x=0}^{x=u} f(u) \mathop{dx} \mathop{du} \\ &= \int_{u=0}^{u=\infty} f(u) \int_{x=0}^{x=u} \mathop{dx} \mathop{du} \\ &= \int_{u=0}^{u=\infty} f(u) u \mathop{du} \\ &= \mathbb{E}(X) \end{aligned}

There are two conceptually important points here. The less theoretically troublesome one is the switching of integrals, which Fubini lets us do, but I’ve always found a little cheeky. More foundationally important is that we assume the existence of a density f(x) here, but it is absent from the result of the theorem. It is an achievable exercise to prove the equivalent result for discrete distributions, and I concede that most continuous distributions I have encountered in the wild have a density, but this does have practical importance. The usefulness of the theorem is in being able to compute an expectation when we don’t have or don’t want to find a density, so it’s essentially useless if having a density is a pre-condition to its application. Can we get around this somehow?

In steps measure theory

I will spare the majority of the details of satisfactorily defining a random variable measure-theoretically, but some objects need to be defined.

The premise of measure-theoretic probability is that we start with a measure space \left(\Omega, E, \mathbb{P} \right). In probability terms, this gives us a sample space \Omega, a set of events E and a probability measure \mathbb{P}, as long as \int_{\Omega} d \mathbb{P} = 1. We will brush over what \int_{\Omega} d \mathbb{P} = 1 is really saying, but suffice to say it imposes Kolmogorov’s unit measure axiom. The other axioms of probability are packaged up in what a measure space is. This gives us a notion of what probability means on \Omega. We can then define a real-valued random variable as a measurable function from \Omega to the reals, that is, a function f: \Omega \to \mathbb{R} such that the pre-image f^{-1}((a,b)) of any open interval (a,b) is an element of E.

For our purposes, we can define any real-valued random variable as follows, by first defining the \mathrm{Uniform}\left(0,1\right) distribution. Take \left(\left[0,1\right], \lambda\left(\left[0,1\right]\right), \mu \right) to be our measure space, where \lambda\left(\left[0,1]\right]\right) is the set of Lebesgue-measurable subsets of \left[0,1\right], and \mu is the Lebesgue measure. Then the uniform distribution U can be defined as the identity map U: \left[0,1\right] \to \left[0,1\right] ; U(\omega) = \omega. You can check for yourself that any property you like about the uniform distribution carries over perfectly. In particular, we can check that P\left(a \leq U \leq b\right) = \mu\left( \left( a,b \right) \right) = b-a.

Now, anyone familiar with the inverse-transform will know that defining any other real-valued random variable is a piece of cake. Every real-valued random variable X has a distribution function P\left(X \leq x\right) = F_X(x), so we define X : \left[0,1\right] \to \mathbb{R}; X(\omega) =F_X^{-1} \circ U(\omega) = F_X^{-1}(\omega). F_X^{-1} might not be easy to compute, but it definitely exists.

We are still missing one key element, expected value. We define it as \mathbb{E}(X) = \int_\Omega X d \mathbb{P}. I will leave undefined what it means to actually compute an integral this way, but it can be done. Importantly, it is still achieving the same goal of finding area under a curve. We are now ready to prove:

Theorem 2: If X is a non-negative random variable with probability function F(x), then \mathbb{E}(X) = \int_0^{\infty} 1-F(x) dx.

Proof:

If you’ll allow me a couple of pictures, I argue that it is true by definition. We see that the area integrated by \int_0^{\infty} 1 - F(x) dx and the area integrated by \int_{\left[0,1\right]} F^{-1} \mathop{d\mu} are in fact the same areas.

\int_0^{\infty} 1 - F(x) dx.
\int_{\left[0,1\right]} F^{-1} \mathop{d\mu}.

Q.E.D.

Some healthy skepticism

Now, we should be skeptical of any proof which follows so readily from the definitions. The traditional discrete distribution is marvelously intuitive. Further, we can squint at the traditional continuous expected value definition and notice the pattern. By comparison, the measure-theoretic definition is quite opaque. So far it seems like we just made up a definition so that this proof was easy. What’s the value in that? Here’s how I see it.

I liken it to the intermediate value theorem (IVT). The point of proving the IVT is not to dispel any doubt that if an arrow pierces my heart it must also have pierced my ribcage. The point of the IVT is in showing that the definition of mathematical continuity we have written down captures the same notion of physical and temporal continuity we sense in the real world.

What we have really learned from theorem 2 then, is that we can define expected value in terms of the probability function directly. We essentially drop the density assumption by fiat. The value is in discovering this more powerful definition which unites previously disparate discrete and continuous cases, as well as distributions which are a mix of both.

A concrete mixed distribution example

My favourite mixed distribution is the zero-inflated exponential, with probability function F(x) = p + (1-p)(1-e^{-\lambda x}) when x \geq 0, and 0 otherwise.

Traditionally, to evaluate an expected value we would have to be rather careful or apply some clever insight. Now with measure theory, we can ham-fistedly shove F(x) straight in to \int_0^{\infty} 1-F(x) \mathop{dx} and call it a day.

We can also start sparring with more exotic random variables on non-numeric spaces with confidence. I’m currently working through Diaconis’ Group Representations in Probability and Statistics, so hopefully I can speak on these “applications” in more detail in the future. But for now, I’ll leave it as an enticing mountaintop rather than trying to spoil the ending.

Intuition

It is no secret that I don’t like the IVT, or theorem-motivated definitions more broadly, so I am uncomfortable leaning on it in an argument. What I will provide here is my own post-hoc intuition for the measure-theoretic expected value. Rather fortuitously it leans on the IVT, so I’ll point out pedantically that I’m actually using it as the Intermediate Value Property (IVP) in and of itself. Observe below that the area on the left is the area defining a measure-theoretic expected value as we have seen above.

The area defining a measure-theoretic expected value, and the rectangular region of equal area guaranteed by the IVP.

Note that this area is the same as the area of the rectangular region. As it has unit width, its height is also its area. This height is not coincidentally the mean value of F_X^{-1} guaranteed by the IVP, so we see that the measure-theoretic definition gives us a measure of central tendency. That this is the same measure of central tendency as the traditional definition can be shown in many ways, but we have seen it today as a porism of theorem 2.

What have we learned?

In short, that generalisation is cool, and measure theory is not as scary as I thought after failing it in third-year. It gives us steady footing to go and explore exotic spaces, and it provides some nice perspectives on old favourites. Is it of practical use to the working statistician? Debateable. Our main theorem can certainly be used without actually doing any measure. Perhaps it provides nice perspectives on transformations if one does need to compute certain integrals which aren’t recognisable. What do you think? Have I convinced you measure-theoretic probability isn’t useless? Do you know any interesting applications I didn’t mention? As always, I’d love to hear your thoughts.

Resources

I am always hesitant to endorse texts based solely on how helpful they were to me. We should remember that one always understands something better the second time. That being said, the following two probability-oriented texts were useful to me. Matthew N. Bernstein has a trio of nice blog posts entitled Demystifying measure-theoretic probability theory, which are a nice, slow introduction to some of the basics. I also found Sebastien Roch’s Lecture Notes on Measure-theoretic Probability Theory useful as a much denser, more comprehensive reference. As for strictly measure-theoretic principles, I found plenty enough information by simply clicking the first Wikipedia article to pop up when I searched the relevant terms.

On the problem of adding dice rolls to a threshold

See mathsfeed.blog/problem-adding-dice-rolls/ for the motivation, introduction and immediate discussion of the problem.

The problem

We want to roll N dice at a time, and add up the total over repeated rolls, and continue to do so until we reach a threshold t. When we reach or exceed t, we note X, the sum of the dice we just rolled. What is the distribution of X? We will call the range t - 6N to t-1 the striking range, as a roll in this range might get us to t. Inside this range we need to pay extra attention to the value of our dice roll, as depending on its value we might have to roll again, or we might have to stop.

Let C_{k,N} be the cumulative sum of N dice rolled k times. Let C be the cumulative sum of N dice rolled an unknown number of times. We want to find a mixing point M after which f(c;b) := \mathbb{P} (C = c \mid b \leq C \leq b + 5 N) is uniform in terms of c. Why? If we find such an M, then as long as our threshold is at least a few maximum dice rolls away from M, it doesn’t really matter how far away it is, we can always assume our cumulative total approaches the striking range from somewhere uniformly in an appropriately wide interval just outside the striking range. This significantly reduces the complexity of an analytic solution or a computer simulation. If the threshold is not significantly past the mixing point, then we have to be careful as our cumulative total is more likely to be at particular points, and the calculations become more complex, as our cumulative total will come in chunks of about 3.5N at a time.

Does M exist?

It might be that M doesn’t really exist, and there is always some very subtle nonuniformity to f(c;b). This isn’t necessarily the case, but showing that would be another problem entirely, and we’re probably quite fine with M technically a function of some tolerance level. Let’s quickly develop a mental picture with some histograms. This will hopefully convince us that M exists (or we get close enough to uniform that it looks like M exists), and how we might capture its meaning.

Pictures

For a particular set of k and N, C_{k,N} just follows some discrete distribution with a nice bell-curvy shape. Of real interest is sampling from C, as we don’t know how many rolls it took to reach our threshold. The immediate problem is this requires sampling a k first, and we don’t want to have any assumption on k‘s value. So instead I will just uniformly sample k values between 0 and 200 and hope that our brains can imagine the extension to unbounded k. Play around with my code in the Colab notebook here. Actually, let’s always sample C_{200,N} but keep track of the partial sum at each intermediate step. I know this technically violates independence assumptions, but whatever. I’m also going to work with N = 20 for these pictures, as the results are suitably interesting.

Here we can see a histogram for 10 000 samples of C_{100, 20}. As expected it forms a lovely bell-curve shape.

Here we see a histogram for 10 000 samples of C_{k,20} for all 1 \leq k \leq 200. We can see the nice uniform property we’re looking for emerge definitively once C exceeds about 3 000, but it’s hard to tell at this scale. Let’s zoom in on the more interesting part of the graph.

Here we only look at samples of C_{k,20} for k \leq 50. We can see more clearly now the spikes which indicate we are not at the mixing point yet, which we can make out more clearly here is at about M = 2000. Zooming in further:

We can identify the spikes more clearly here. Given we roll 20 dice at a time, we should not be surprised to see the initial values occur in spikes about 70 units apart, which is roughly what we see.

They do get a bit wider as we move to the right, as the tails of slightly fatter and further right spikes gently nudge up their neighbours. So, whatever more technical answer we derive below should line up roughly with these observations, namely that by the time C is about 2000, or about 29 dice rolls, it should be well mixed.

Defining M

For a more technical definition, you can be as picky as you like as to how you define suitably uniform, probably with some sort of \varepsilon floating around, but I want a rough and ready answer, and I don’t personally enjoy having \varepsilon‘s littered throughout my work, so my working definition is as follows:

If, for all C \geq M, it is no longer obvious to infer from C how many rolls k it took to reach C, then M is a mixing point.

Of course this is entirely heuristic, it is not longer obvious (in some sense) as long as there is more than one value k for which \mathbb{P}(k \mid C) is nonzero. This happens very quickly and does not capture what we see in the simulations. In the other direction, for any C, there will always be some much more sensible guesses for k than others, probably an integer close to \frac{C}{3.5N}. So we need to start by deciding on our criteria for obvious. I’ve come up with a couple of different definitions, and I’ll discuss them both below.

Finding M the easier way

It can be checked that \mathbb{E} C_{k,N} = 3.5 k N, and that var(C_{k,N}) = \frac{35}{12}kN. From now on, if I need to, I will approximate C_{k,N} with \hat{C}_{k,N}, a normal random variable with the same mean and variance. Then I can say we have reached the mixing point if there is significant overlap between C_{k,N} and C_{k+\delta, N} for some \delta \geq 1. Again there are lots of choices for what is meant by significant overlap and choice of \delta. Inspired by mathsfeed.blog/is-human-height-bimodal I think a reasonable choice is to compare \delta = 1, and consider the overlap significant if the there is only one mode, not two. Using the fact that a normal pdf is concave down within 1 standard deviation of its mean, we would like that one standard deviation above the mean for \hat{C}_{k,N}:

3.5 k N + \sqrt{\frac{35}{12} k N}

is equal to one standard deviation below the mean for \hat{C}_{k+1,N}:

3.5(k+1)N-\sqrt{\frac{35}{12}(k+1)N}

One can do some rather boring algebra to arrive at \sqrt{4.2N} = \sqrt{k+1}+\sqrt{k}. You can solve this properly I guess, but I am a deeply lazy person, so I’m going to approximate the right-hand side as 2\sqrt{k}. If this upsets you, then I am deeply sorry, but I will not change. (k is big enough and we’re rounding to a whole number at the end of the day so its fine, but I’ve already spent more time justifying this than I wanted to.) This allows us to arrive at k \approx 1.05N. This roughly agrees with the scale we wanted for N = 20. If you try and count out the first 21 spikes in the above plots, they become very hard to make out by the end. So I’m actually fairly happy with this answer, subject to some proper checking with more choices for N and maybe just topping off with another 20% just for good measure. More important I think is convincing yourself that if I had chosen some other number of standard deviations or some larger \delta, then k as a function of N should still be linear! So instead of rederiving all of these calculations, just remember that if you’re happy N = 1, t = 500 is well-mixed, then N = 10, t = 5000 should be well-mixed too. Note that this condition truly isn’t enough to guarantee uniformity as it makes no attempt to consider the contribution of any C_{i,N} other than i = k and i = k + \delta, but it should ensure any spikiness is rather muted. If you’re happy with this condition, good, so am I, but I may as well mention the other method I thought of for measuring mixing.

Finding M the bayesian way

The definition of suitably uniform above is very heavily based in conditional probability, and I am a dyed-in-the-wool bayesian, so I’m going to attack with all the bayesian magic spells I can muster. If you’re a committed frequentist, maybe it’s time to look away.

We want to derive

p(k \mid C) = \frac{p(C,k)}{p(C)}.

Can we derive p(C,k)? Well by the definition of conditional probability,

p(C,k) = p(k) \cdot p(C \mid k).

I know p(C \mid k) is approximately normal, so

p(C = c \mid k) \approx \frac{1}{\sqrt{kN\frac{35}{12}2\pi}} \exp{\left(\frac{-1}{2} \frac{1}{kN\frac{35}{12}} (c-3.5kN)^2\right)}
\propto \frac{1}{\sqrt{k}} \exp \left( \frac{-1}{k} (c - 3.5kN)^2 \right),

and we have no prior information about what k should be, so we can treat p(k) with a constant uninformative prior. Finally, p(C) is not a function of k, its just a scaling factor, so

p(k \mid C) \propto p(C,k) \propto \frac{1}{\sqrt{k}} \exp \left( \frac{-1}{k} (c - 3.5kN)^2 \right).

Now admittedly it’s been a while since I was properly in the stats game, so my tools might be a bit rusty, but this doesn’t look like a pmf I’m familiar with. It looks like it’s in the exponential family, so maybe somebody with more experience in the dark arts can take it from here. I guess you could always figure out some sort of acceptance-rejection sampler if needed. Okay but what’s the point? Well now we have our posterior for k \mid C, we can be more precise about it being suitably non-obvious what to infer for k. The first criteria that come to mind for me is either specifying the variance should be suitably large (which can be approximated up to proportionality with the pdf, though that proportionality depends on C generally), or that the mode of the distribution is suitably unlikely (also easy up to proportionality, but knowing the actual probability itself feels more integral to the interpretation). Of course in both cases we can approximation the proportionality constant by computing an appropriate partial sum. I’ve knocked up a quick demo on Desmos of what this would look like in practice.

Concluding remarks

Note of course that the normal approximation itself only works if the number of dice in each roll is suitably large to apply CLT. It also then feels like no coincidence that ‘about 30 rolls’ is the conclusion, as it sounds an awful lot like my usual usual retort when asked if a sample mean is big enough to make a normal approximation. Overall I’m okay with making approximations which assume a large N for the same reason we are more interested in deriving results for large t, namely for small t and/or N, we can probably simulate the answer with high precision using a computer, or even by hand for very small values. But these asymptotic results help us to be confident in when we can truncate the simulation for speed, or when we can stop doing simulations and rely only on the asymptotic results.

If you enjoyed this, you might enjoy my other posts on problems I would like to see solved, or find out more about my research from my homepage.

Does this graph have a Hamilton decomposition?

My hope is that this post distills the interesting unanswered questions from my honours thesis. The final question relies on various results and definitions which I will try to introduce in the most interesting way possible, occasionally avoiding formalism if it conflicts with reader enjoyment and intuition. I will try and title broad sections so those already familiar can skip if they want to rush ahead. I will assume you are already familiar or able to look up basic graph theory notations and definitions. If ever you crave more of the gruesome details, they are of course to be found in the thesis.

Hamilton decomposition of a finite graph

Here we see K_5, the complete graph on 5 vertices. The edges have been coloured so that the edges of one colour form a spanning, 2-regular subgraph. Each colour is called a Hamilton cycle, and the colouring of the graph is called a Hamilton decomposition. A Hamilton cycle provides a path which visits every vertex exactly once and returns to the starting vertex.

The natural question for a given graph is whether or not it has a Hamilton decomposition. The obvious criteria are that the graph must be connected, and even-regular. The less obvious necessary condition is a refinement of the connected condition; more precisely, a 2k-regular graph must be 2k-edge connected, i.e. there is no set of 2k-1 edges which, upon deletion, make the graph disconnected.

Exercise: Prove this necessary condition.

A graph which fits the necessary criteria will be called admissible. If every admissible graph had a Hamilton decomposition, then we would be done, but we know of examples where this is not the case, even if we assume a high level of structure in the graph. In general it is still hard to tell if an admissible graph has a Hamilton decomposition, but there are some lovely families of admissible graphs for which we do know a Hamilton decomposition.

One family of graphs for which we know a lot is the Cartesian product G \Box H of two graphs G and H. (wiki) At each vertex of G, insert a copy of H, and connect equivalent vertices of H when they in adjacent vertices of G.

Exercises: Check that G \Box H is isomorphic to H \Box G. Check that if G is r_G-regular and H is r_H-regular, then G \Box H is (r_G + r_H)-regular.

Exercise: (for those who are familiar with graph automorphisms) Verify that Aut(G) \leq Aut(G \Box H).

Of relevance for us are the following results:

Kotzig 1973 [1]: The Cartesian product of two cycles C_n \Box C_m has a Hamilton decomposition.

Foregger 1978 [1]: The Cartesian product of three cycles C_{\ell} \Box C_m \Box C_n has a Hamilton decomposition.

Aubert and Schneider 1982 []: If G is a 4-regular graph with a Hamilton decomposition, then G \Box C_n has a Hamilton decomposition.

There are stronger results, culminating with Stong (1991) [3], but these results are enough for us to forge ahead towards our final problem.

Hamilton decomposition of an infinite graph

We borrow the interpretation of a Hamilton cycle in a finite graph as a spanning 2-regular subgraph. This no longer forms a cycle then, as such a cycle would only visit a finite number of vertices, it instead forms a path. We call such a path in an infinite graph a Hamilton double-ray. Then a Hamilton decomposition of an infinite graph is a colouring of the edges so that every colour is a Hamilton double-ray. We will denote by P the graph which every Hamilton double-ray is isomorphic to.

The Hamilton decomposition of the \{1,3\}-distance graph, which has the integers as vertices, and an edge between two vertices x,y iff \left| x - y \right| \in \{1,3\}.
The Hamilton decomposition of the 2-dimensional lattice.

Observe in the above two examples some lovely properties. In the distance graph, we have the same chunk of colouring repeated every 6 units, a clever exploitation of the sliding symmetry of the integers. Similarly in the lattice, each colour is self-symmetric under a 180º-rotation, and the two colours are each a 90º-rotation of the other, cleverly exploiting the rotational symmetry of the lattice. I find this to be such a simple and lovely picture, I can’t believe this would be a 21st century observation. Like the six-petal rosette, it should have been noticed countless times throughout history, perhaps a Girih in a Syrian shrine, or painted onto an ancient roman pot. Art historians please get in touch. Otherwise, we should pin down a key difference in the structure of these two infinite graphs.

The six-petal rosette is the starting point for another infinite graph with its own lovely Hamilton decomposition.

Ends of infinite graphs

Informally, we can think of an end of a graph as a distinct direction into which our graph can extend infinitely. Slightly less informally, we say two ends are different if there is some finite vertex-deletion which makes the two ends disconnected. Then we can say that the number of ends of a graph is the maximum number of infinite connected components under some finite vertex-deletion. Then any finite graph has 0 ends. What are other possible values for the number of ends of a graph? The distance graph has 2, deleting a reasonably large chuck of vertices in the middle will produce two infinite connected components, one on the left (a negative end) and one on the right (a positive end). It might seem at first that the lattice has many, corresponding to different angles of escape, but in fact it only has 1. Any finite deletion of vertices might leave some finite components in the middle, but will never leave two infinite connected components. A graph can certainly have more than 2: taking the k-star graph and extending each leaf indefinitely produces a k-ended infinite graph, for example. But if a graph does have more than two, we could never hope to visit every vertex in that graph with a single Hamilton double-ray, some end would always be left out. This explains the differing patterns between our two exemplars above. As the distance graph has two ends, and so does a Hamilton double-ray, we can extend each end of the double-ray into each end of the graph quite comfortably. Whereas for the one-ended lattice, we have to carefully wrap our two ends up in a spiralling fashion to, in a sense, make a thicker one-ended spiral.

Extra necessary conditions

Unfortunately, there is another obstacle which arises when we begin to consider such infinite graphs. As well as requiring no more than 2 ends, if the graph is one-ended, there may be a parity issue as to why no decomposition can exist. Consider the following two-ended graph C_3 \Box P:

C_3 \Box P.

Note the three dotted edges. Each Hamilton double-ray must pass through this edge-set an odd number of times, as otherwise it would have both its ends on the same side of the graph, which in turn would mean it only visits a finite number of vertices on the other side. As this graph is 4-regular, it would be decomposed into two Hamilton double-rays. Two odds add to an even, but there are 3 edges, so clearly this is not possible. This argument can be generalised and refined in the case where our two-ended graph is of the form C_n \Box P, namely, a Hamilton decomposition is only possible if n is even. What a relief it is then, that it is always possible if n is even. The below image suggests the pattern we can apply in such a case.

The Hamilton decomposition of C_{10} \Box P.

The big questions

Erde, Lehner and Pitz 2020 [4]: If G and H are infinite, finite-degree graphs, each with a Hamilton decomposition into double-rays, then G \Box H has a Hamilton decomposition into double-rays.

C_3 \Box P \Box P.

In general, it would be nice if C_n \Box P \Box P has a Hamilton decomposition, this would be a nice generalisation of Foregger’s and Aubert and Schneider’s results for finite graphs, either viewing P as an infinite cycle in which case Foregger’s result would apply, or by viewing P \Box P as the lattice which already has a Hamilton decomposition, in which case Aubert and Schneider’s result would apply. If n is even, then we already know that C_n \Box P has a Hamilton decomposition as above, so we can use Erde, Lehner and Pitz’s result to then say that C_n \Box P \Box P has a Hamilton decomposition. If n is odd, then C_n \Box P doesn’t have a Hamilton decomposition, so we can’t approach it this way. We already know it’s possible for one odd case, namely n = 1, in which case we simply have the lattice. Understanding any symmetry and pattern in the even case would be incredibly insightful for determining the odd case, but the construction becomes a bit too complex, and the pictures a bit too messy, that it would take a very careful hand to draw out the decomposition in a way we could wrap our heads around it. If you can manage to draw out the decomposition for n = 4, this would already be a more concrete example than anything I have managed to draw, and probably provide quite a valuable mental image of how to translate this to odd n. Alternatively, it seems that the simplest odd case n = 3 has all the right components for its own lovely symmetric solution, given that a decomposition would have 3 colours and the graph has an order 3 symmetry. Such a solution would provide its own insight into the other odd cases. So there is no reason a clever human brain couldn’t figure out a nice solution directly, perhaps inspired by the lattice decomposition. Alas, so far it seems that clever human brain might not be mine, but it might be yours!

Bibliography

[1] Marsha F. Foregger. Hamiltonian decompositions of products of cycles. Discrete Mathematics, 24:251–260, 1978.
[2] Jacques Aubert and Bernadette Schneider. Decomposition de la somme cartesienne d’un cycle et de l’union de deux cycles hamiltoniens en cycles hamiltoniens. Discrete Mathematics, 38:7–16, 1982.
[3] Richard Stong. Hamilton decompositions of Cartesian products of graphs. Discrete Mathematics, 90:169–190, 1991.
[4] Joshua Erde, Florian Lehner, and Max Pitz. Hamilton decompositions of one-ended Cayley graphs.
Journal of Combinatorial Theory, B(140):171–191, 2020.

Does every pair of bases have a shared palindrome?

The shortest possible answer:

I don’t know.

The slightly longer answer:

The evidence suggests so, but I can’t prove it, nor can I find an existing proof. Maybe you can!

Notation:

An integer x written with no other distinguishing feature is always in base 10. Otherwise, I will write an integer x in its base b notation as [a_n, a_{n-1}, \dots, a_0]_b = \sum_{i=0}^{n} a_i b^i, where 0 < a_n < b (i.e. the leading digit is non-zero) and 0 \leq a_i < b (i.e. each other digit is less than b) for all other a_i. A palindrome is an integer which reads the same forwards as backwards. More formally, an integer x is a palindrome in base b if x = [a_n, \dots, a_0] and a_{i} = a_{n-i} for all 0 \leq i \leq \lfloor \frac{n+1}{2} \rfloor. Such an integer x will be called a dual palindrome (or just a dual) in bases a and b. That’s all really, let’s get stuck in.

Observation:

  • 5 = [1,0,1]_2 = [1,1]_4 is a palindrome in base 2 and base 4, I wonder if I can find such a palindrome for any pair of bases?
  • 1 = [1]_a = [1]_b is a palindrome for any pair of bases, but one-digit palindromes are not very interesting. Let’s make sure our palindrome has at least 2 digits in both bases.

Formal statement of the conjecture:

For any pair of integers 2 \leq a < b, there exists an integer x such that b < x and x is a palindrome in both base a and base b.

Numerical evidence:

After enumerating a few small examples, I did what one should always do when one stumbles on a list of interesting numbers: plug it into the OEIS. It pointed me to A293925, which enumerates exactly my sequence, first authored in 2017. Note that it is natural to view this sequence as an upper-triangular matrix T = (t_{ij}) with the entry t_{ij} at row i less than column j the least non-trivial dual palindrome in bases i and j. So I’m not the first person to think of this problem. But the major unstated premise is that this is in fact a sequence. It is not immediately obvious that:

  1. every pair of bases has such a nice, finite, dual palindrome;
  2. there isn’t some point/line in the matrix beyond which all pairs of bases fail to have a dual palindrome.

As is, the OEIS listing makes no mention of proving either of these statements, and my own perusal of the literature could not find anything either. (I am not saying it couldn’t possibly be some obscure corollary of a paper too dense to read or too obscure to be findable online, but even if it is, it needs to be brought to light. The OEIS does like to make sure its sequences are well defined and actually sequences. )

The element in the ith row, jth column is the smallest dual palindrome in bases i and j. Cells coloured by number of digits in base 10.

The next step is to generate some more data. I have put together a Google Colab notebook with my computational work on the matter. You’re welcome to have a look through it yourself, but ultimately generating more finite cases doesn’t really help a great deal with proving the conjecture. The details are not worth going over here, but it does produce some lovely images which I will scatter throughout. The important takeaway is that there is fairly thorough evidence to suggest that the conjecture is true. It would be selfish to state a conjecture based on purely computational results and demand somebody else solves the problem, without at least trying to solve the problem myself, so here goes. Fair warning I have no experience in number theory, so consider that a warning if you do.

Partial theoretical results

So far I can contribute the following results. They are incredibly elementary, but it feels like I would need a much stronger understanding of number theory to make a more insightful contribution.

Remark 1: x = a^c + 1 = [1,0,0,\dots,0,0,1]_a = [1,1]_{a^c} is a palindrome in base a and a^c.

This fact is self-evident, and enough on its own to answer the second point, namely that there are finite entries to the matrix for arbitrarily large rows and columns. But of course the number of entries guaranteed by this result is quite sparse. We can generalise it as follows:

Remark 2: x = a^{bc} + 1 = (a^b)^c + (a^b)^0 = [1,0,0,\dots,0,1]_{a^b} = (a^c)^b + (a^c)^0 = [1,0,0,\dots,0,1]_{a^c} is a palindrome in base a^b and a^c.

That gives us many more examples, and solves the problem whenever the two bases are both powers of a common integer a. The following observation gives us a much less sparse family of dual palindromes. To motivate it, observe:

Same as the previous image, with a noteworthy sequence of duals identified between the yellow tramlines.

There seems to be a sequence of duals all a knight’s move apart, and each only 2 greater than the last, while the nearby duals seem to get bigger much quicker. The following remark explains why we see such a nice sequence.

Remark 3: If x = b+1 = [1,1]_b = 2a+2 = [2,2]_a is a dual palindrome for bases a and b, then x + 2 = (b+1)+2 = (b+2)+1 = [1,1]_{b+2} = (2a+2)+2 = 2(a+1)+2 = [2,2]_{a+1} is a dual palindrome for bases a+1 and b+2.

8 = [1,1]_7 = [2,2]_3, so there is a dual palindrome x for all bases of the form 7+2i and 3+i, of the form x = 8 + 2i = [1,1]_{7+2i} = [2,2]_{3+i}. This is still only scratching the surface of a general solution, but in a sense it is much denser, and it seems we should be able to generalise this observation. So let’s work on that for a while.

Remark 4: If x = b+1 = [1,1]_b = da+d = [d,d]_a is a dual palindrome for bases a and b, then x + d = (b+1)+d = (b+d)+1 = [1,1]_{b+d} = (da+d)+d = d(a+1)+d = [d,d]_{a+1} is a dual palindrome for bases a+1 and b+d.

This is technically a generalisation, but not a very useful one. A cursory search of the first few cells reveals that most of the dual palindromes which would serve as the first in a sequence of this form are all of the form [1,1]_b = [2,2]_a already. Eventually we find an example like 15 = [1,1]_{14} = [3,3]_4 which generates an infinite family, but ultimately the constraint that the dual is a 2-digit palindrome in both bases is rather restrictive. Let’s call a dual like [1,1]_b = [d,d]_a a starter dual, then we can call a sequence of duals each a d to the right, 1 down knight’s move away from the starter dual a (d,1)-knight’s move family.

The (3,1)-knight’s move family produced by remark 4.

If we squint at our matrix, we eventually notice other knight’s move families, but it seems that having a starter dual is the hardest part, and the easiest way to find a family is to look for a streak of blues on a knight’s move diagonal, not by looking for the starter. Further, if we pick out a random cell, it would be nice if there were another cell just down and to the right of a slightly larger value, this would suggest it belongs to some sort of knight’s move family. But we typically don’t find this. More broadly, if most duals belonged to a knight’s move family, we would expect the size of the smallest dual palindrome to grow roughly linearly as we make knight’s moves, but again, we typically don’t see this. From this we can learn that any attempt to capture a general dual palindrome as part of a knight’s move family of duals will probably need that family to grow larger than linearly. So we need to generalise more. Eventually we might spot the following family:

A (2,5)-knight’s move family growing by 10 each step.

Which motivates the following:

Remark 5: If x =2b+2 = [2,2]_b = 5a+5 = [5,5]_a is a dual palindrome for bases a and b, then x + 10 = 2b+2+10 = 2b+10+2 = 2(b+5)+2 = [2,2]_{b+5} = 5a+5+10 = 5a+10+5 = 5(a+2)+5 = [5,5]_{a+2} is a dual palindrome for bases a+2 and b+5.

The interesting observation being that our knight’s move is now more than 1 step in both directions, and the increase in the size of the dual at each step is a common multiple of the digits. Like a broken record, we generalise as follows:

Remark 6: If x =cb+c = [c,c]_b = da+d = [d,d]_a is a dual palindrome for bases a and b, then x + cd = cb+c + cd = c(b+d)+c = [c,c]_{b+d} = da+d + cd = d(a+c)+d = [d,d]_{a+c} is a dual palindrome for bases a+c and b+d.

This feels like the end of the easy generalisations of what we’ve seen so far, in that we can take any dual palindrome x which has two digits in both bases, and generate a knight’s move family with x as the starter. To generalise further, we would either need to:

  1. handle starters with more than two digits in either base,
  2. generate a knight’s move family in an arbitrary direction, not one dictated by the digits of the starter, or
  3. envision an entirely new approach to creating dual palindromes.

The last of these is a blind hope based on unique insight (that’s where somebody else can swoop in please!), not a predictable strategy, so really we are stuck with either of the first two. The second point would itself require some new insight, as everything we have done so far carefully chooses the step size at each knight’s move so it factors nicely with both digits, in general its not clear how we could do this. So we are left with the first point. Certainly it would be relevant to understand how many bases a typical minimum dual palindrome has. We can analyse the record entries of the sequence as a heuristic for the worst case scenario:

smallest dual palindrome x in bases a and babdigits in base adigits in base b
664323139
968642673
990273453
1600873553
20512303733
3357754273
33621345133
40243365433
47723385733
55155106053
56115406033
65347426333
65883116653
73606446633
120250576833
248833127163
620832118863
6481709013533
6943149213833
448380804213954
20006080591816185
40459266452419175
255944275664039975
617568432583548975
9063702555196089975
287300815217684117575
8459876852000072172785

My first thought when seeing these records is to remark that the dual palindrome in bases 2 and 3 (literally the first entry) is by far the worst in terms of number of digits, even up to the first \binom{1727}{2} entries. In some sense this is promising, maybe it is the lack of choices for the digits which is the main restriction, so for larger pairs of bases it should be easier. One might go so far as to say that the pair 2 and 3 is the worst possible pair. After a bit of meditation this is clearly not true, the smallest possible palindrome in base 2^{1000} has at least 1001 digits in base 2, so it definitely gets worse at this pair of bases, we could just never dream of enumerating the entries this far, so we get tricked by what are ultimately only small cases. If we measure not by the number of digits in the smaller base but instead the larger, there seems to be more hope. We already know that x = [1,0, \dots, 01]_{2} = [1,1]_{2^{1000}} is a palindrome in bases 2 and 2^{1000}, so in this sense this pair of bases isn’t so problematic. But at the moment, all bets are off regarding base 2 and base b = 2^{1000}+1. Given the mind-bogglingly large number of palindromes in base b with less than 5 digits, it seems unlikely that we would need 6 or more digits to find one which is also a palindrome in base 2. Nonetheless, we see the number of digits in the larger base slowly increasing in the table above, so really who is to say. What we can say with confidence is we will often need to resort to more than 2 digits in at least the smallest base, so it seems more sophisticated techniques than the remarks above are required.

Results I’m not really sure what to do with

The heading speaks for itself, perhaps someone smarter than me can figure out why these might be useful observations.

Remark A: If [a_0, a_1, a_2, \dots a_2, a_1, a_0] is palindrome with an even number of digits (n odd), then the formal polynomial:

a_0 + a_1 b + a_2 b^2 + \dots + a_2 b^{n-2} + a_1 b^{n-1} + a_0 b^n

has b = -1 as a root.

Proof:

a_0 + a_1 (-1) + a_2 (-1)^2 + \dots + a_2 (-1)^{n-2} + a_1 (-1)^{n-1} + a_0 (-1)^n

= a_0 - a_1 + a_2 - \dots - a_2 + a_1 - a_0 = 0

[a_0, a_1, a_2, \dots a_{n-2}, a_{n-1}, a_n] a palindrome is a sufficient condition for \sum_{i=0}^{n} a_{i} b^i to have -1 as a root, it is by no means necessary. This suggests some sort of Field-like behaviour with b-1 as a factor of the polynomial, but this holds for non-prime-power bases, and our coefficients do not truly behave like Field elements, we need to maintain the “carry-the-one” behaviour inherent to our base representation when we add a pair of polynomials.

Remark B: By colouring each cell of the table by residue class \pmod{n} for various n, we get the following:

Table of smallest dual palindromes coloured by residue class \pmod{n} for various n. Purple is 0, through to yellow is n -1. [Warning: very big gif, might take a while to load.]

For each value of n, we seem to get these consistent grid-like streaks of purple (0 \pmod{n}) which becomes more apparent for larger n. Why are they there? We can observe the following:

  • There are gaps, not every element on one of the gridlines is necessarily purple.
  • The gridlines are spaced n units apart.
  • The vertical gridlines seem to be one unit wide while the horizontal gridlines seem to be two units wide.
  • They seem to become more clearly defined further down and to the right.
  • Some values of n yield stronger lines than others.

More images for higher values of n are available on the Colab.

Remark C: Of particular note is the \pmod{2} colouring:

\pmod{2} colouring. For technical reasons this image only shows the 500×500 table, but if you click the image you can view an even more enormous image which shows the 1000×1000 table in its full glory. Warning, the image is 7606×7616 pixels.

What gridlines 2 units apart would look like isn’t exactly clear, but we see interesting and varied emergent behaviour in different parts of the table:

In the top right we struggle to make out a pattern. There seem to be vertical streaks but they are very muddled.
In the bottom right we make out clear maze-line patterns reminiscent of some ancient pottery decoration.

Concluding remarks

Many thanks go to Robert G. Wilson v, without whose collaboration the depth of computational results would be noticeably more meagre.

If you have any further insight, please share 🙂

Edits

January 2023

I have traced this question back to Question #3 of Erich Friedman’s Problem of the month, June 1999. Letting P(a,b) be the smallest non-trivial dual palindrome in bases a and b, Friedman attributes to Ulrich Schimke the formula P(a,a+1) = 2a^2 + 3a +2 = (a+1)^2 + a(a+1) + 1 for all a \geq 4, which provides the smallest non-trivial dual palindrome in bases which differ by 1. Friedman also attributes to Schimke a similar pair of polynomials for bases which differ by 2, depending on their parity, and states that Schimke believes a polynomial exists for all P(a,a+n).

At the time of writing I could not come up with a formal description of the maze-like pattern in the \mod 2 colouring. I now make the following observations. There are still imprecise up to some m \neq 1 which depends on the property, mostly because an exhaustive computer search for these properties would be a headache to code up, and consequently because my eyeballs started to hurt after a while. Nonetheless, each of these properties has been tediously verified up against the main diagonal. Inside the area between the a = b diagonal and the a=mb line, there does not exist a set of four entries \{(a,b), (a+1,b), (a,b+1), (a+1,b+1)\} all the same colour, (i.e. a 2×2 square). Similarly, I can’t find any monochromatic S-tetrominos or Z-tetrominos. Conversely, it is effortless to find I, J, L, and T-tetrominos. Why would that be?

I summarised this problem in a post on MathsFeed.