a blog title

A Quick and Dirty Overview of the Central Limit Theorem

The central limit theorem, commonly referred to as CLT, is one of the most popular theorems in classical statistics. It underpins t-distributions and hypothesis testing and confidence intervals. While I suspect it isn’t quite as important to those neck-deep in machine learning for a variety of reasons, CLT describes a really cool phenomenon that is super pivotal to much of classical statistics.

Let’s say we have some population of people each with some annual income. Income data tends to be skewed right — that is, relatively few people are in the right tail making lots of money and you have a bunch of people/observations forming a peak just left of the median income value. That is to say, income data is NOT normally distributed (that nice bell-curve shape everyone loves so much.)

We could take a single random sample of n observations from our income data and calculate the average (or mean) of that data. In a classical statistical approach, we are often working with a sample to say something about the broader population. We may want to say that we feel our sample’s average income is reflective of the average of our overall population (putting aside for the moment that medians are generally a better reporting metric for the central tendency of skewed distributions.) The mean from this sample though may or may not be representative of the population mean. In fact, it could vary from the true population mean quite a lot - even if we think we have taken an unbiased random sample of the data.

We can take another, independent random sample of the same size from the same distribution of income data and calculate another average. And another and yet another. We can keep going and each time, we can also PLOT these averages we’ve calculated (such that the height of the bars in our graph corresponds to the number of times we observed a particular value as our sample average). The resulting graph, over many samples, will show you the sampling distribution of the sample mean for income data.

CLT states that as long as these samples we are taking are of “sufficient size”, something kind of awesome will happen — regardless of the shape of the population distribution! The sampling distribution of the sample mean will converge to/approximate a normal distribution with mean of the sampling distribution being equal to that of the population mean of the distribution you’re drawing samples. (What!) The sampling distribution also has a variance equal to the population, σ² divided by the sample size n.

You can play around with this and see it for yourself by checking out this awesome simulator and watch how the sampling distribution of various statistics. I highly recommend playing with a custom distribution there too.

Note that when I say the population distribution can be anything, I’ll note that there are some distributions where CLT doesn’t hold (e.g. Cauchy). It is important that the distribution you’re drawing samples from has nonzero, finite variance. But, it does work in cases I wouldn’t have expected like Bernoulli random variables! I remember that being really exciting to me back in the day. Bernoulli random variables take two possible values, 1 with some probability p and 0 with some probability (1-p). Actually, this is a really fun little simulation project to do in R/Python/your statistical programming language of choice as it can be a little hard to see in the simulator I linked.

There were a couple points I'd like to highlight here:

  1. In practice, people are working with just their one sample. They may not have a lot of data to sample and re-sample from again and again. There also may be some biases present in the sample being used depending on how the data was collected (nonrandom). You’re not, in practice, taking a bajillion random samples of fixed size from the population of the data but you may very well be using tests that build off of the result the central limit theorem provides us so it is still useful to know this theorem beyond it just being super cool.
  2. The sampling distribution of a statistic is NOT the same thing as the distribution of the data - the distribution of the individual data points you have. When I learned it, I think I grasped this and hopefully, it is clear above! But over time, I think it kind of got mushed together in my brain as teachers used simply “distribution” to refer to both the data distribution and the sampling distribution without specifying. (Statistics professors are smart people and will think you know what they're talking about from context but in my experience, if you’re a beginner, it is easy to lose the thread as you build upon this fundamental theorem.)
  3. The threshold of n that constitutes a sample of “sufficient size” isn’t specified by the theorem. The rule of thumb you’ll see in a lot of textbooks is that you should repeatedly sample at least 30 elements randomly from the population distribution to draw up your sampling distribution of the sample mean but the real answer seems to be that n should be as big as you need it to be for the sampling distribution to look normal. The larger the size of the samples being taken, the more normal the distribution of the sampling statistic will look.
  4. On a related note, a repeated sample of 5 observations (n = 5) averaged and plotted will produce a wider spread sampling distribution than a repeated sample of 50 observations. Note that the variance of the sampling distribution is inversely related to n (σ² / n). A larger sample size will reduce the variance of the sampling distribution of the mean and you’ll see the distribution wrap more tightly around the population mean. This becomes beautifully clear if you play with the linked simulator (or take a stab at writing your own code for it) and the implication is something many intuitively understand -- the larger your sample? The more likely it is your sample mean estimates the population mean well.
  5. Normal distributions? Not all that normal really. Be wary of thinking of normal distributions as some kind of default data distribution. The reason the normal distribution (a.k.a. the Gaussian distribution after this total G) pops up so much in statistics is, in part, due to this awesome central limit theorem where the sampling distribution is shown to converge to this normal shape.

Anyway, I hope that was informative. :) Good night, internet.

#statistics