6 Introduction to statistics
In this final chapter we give an introduction to statistics. The goal of statistics is of an entirely different kind from that of probability, and indeed of any area of mathematics. In contrast to mathematics, which can be regarded as reasoning based on axioms and logic, entirely unconcerned with any physical reality, statistics is an empirical and pragmatic science whose goal is to understand the real world by analysing data from empirical observations. Very briefly, this difference can be summarised as follows.
In probability, one is given a probability measure and random variables, and one studies their behaviour.
In statistics, one is given a collection of observations obtained by repeating a random experiment, and one wishes to determine the law of the underlying random variable.
In practice, as in this course, the study of statistics is often combined with that of probability because it relies heavily on the tools and language of probability.
6.1 Estimators
We suppose that the observations are obtained from repeated experiments that are performed independently under the same conditions. This leads to the following notion.
An \(n\)-sample drawn from \(\mathbb{P}\) is a family \(X_1, \dots, X_n\) of independent random variables with law \(\mathbb{P}.\)
In parametric statistics, we suppose that the probability measure \(\mathbb{P}\equiv \mathbb{P}_\theta\) depends on a parameter \(\theta\) in some parameter set \(\Theta.\) The goal is to estimate the parameter \(\theta \in \Theta\) from an \(n\)-sample drawn from \(\mathbb{P}_\theta.\) Put differently, the rules of the game are the following.
Known: an \(n\)-sample (the observation).
Unknown: the parameter \(\theta\) of \(\mathbb{P}_\theta.\)
The choice of the parametrisation \(\theta \mapsto \mathbb{P}_\theta\) is a subjective decision to be made by the statistician, depending on various constraints and her insight into the nature of the questions she is investigating. As such, for a given question and \(n\)-sample of observations, there is usually no right or wrong parametrisation, although there are certainly reasonable and less reasonable choices. One has to appeal to common sense. Here are some examples along with reasonable choices of \(\mathbb{P}_\theta.\)
A yes/no opinion poll among \(n\) individuals. We choose \(\mathbb{P}_\theta\) to the Bernoulli distribution on \(\{0,1\}\) with parameter \(\theta \in [0,1] = \Theta.\)
The lifetime of an appliance. We choose \(\mathbb{P}_\theta\) to be the exponential distribution on \([0,\infty)\) with parameter \(\theta \in (0,\infty) = \Theta.\)
The size of an individual in a homogeneous population. We choose \(\mathbb{P}_\theta\) to be the normal distribution with mean \(m\) and variance \(\sigma^2,\) i.e. \(\theta = (m, \sigma^2) \in \mathbb{R}\times [0,\infty) = \Theta.\)
We seek to determine the parameter \(\theta\) from the observed \(n\)-sample, or, more generally, any function \(f(\theta)\) of the parameter.
A statistic is a measurable function of an \(n\)-sample.
Let \(f\) be a function on \(\Theta.\) An estimator of \(f(\theta)\) is a statistic with values in \(f(\Theta).\)
Note that an estimator can only depend on the sample (which is observable) and not on the parameter \(\theta\) (which is unknown). It is customary in statistics to denote estimators with decorated symbols, such as \(\hat f, \tilde f, \bar f,\) to distinguish them from (unknown) deterministic functions of \(\theta.\)
A good estimator \(\hat f = F(X_1, \dots, X_n)\) of \(f(\theta)\) should with high probability be close to \(f(\theta)\) provided that the underlying \(n\)-sample was drawn from \(\mathbb{P}_\theta.\) This leads to the following condition.
For each \(n \in \mathbb{N}^*,\) let \(\hat f_n = F_n(X_1, \dots, X_n)\) be an estimator of \(f(\theta)\) depending on an \(n\)-sample \(X_1, \dots, X_n\) drawn from \(\mathbb{P}_\theta.\) The family of estimators \(\hat f_n\) is called consistent if for all \(\theta \in \Theta\) we have \[\hat f_n \overset{\mathbb{P}}{\longrightarrow} f(\theta)\] as \(n \to \infty.\)
The empirical mean \[\bar X_n :=\frac{1}{n} (X_1 + \cdots + X_n)\] is a consistent estimator of the mean \(f(\theta) :=\mathbb{E}_\theta[X_1],\) by the law of large numbers.
A classical characteristic of an estimator is its bias.
The bias of an estimator \(\hat f\) of \(f(\theta)\) is \(\mathbb{E}_\theta[\hat f] - f(\theta).\) If the bias is zero for all \(\theta \in \Theta,\) the estimator is called unbiased. Otherwise, it is called biased.
Being unbiased can be a desirable property for an estimator. However, in practice it is not always a good idea to insist on a lack bias: it can indeed happen that a biased estimator performs better than an unbiased one. We shall examples of this later on. What one almost always will require, however, is that for large values of \(n\) the bias tends to zero.
A family of estimators \(\hat f_n\) of \(t(\theta)\) is asymptotically unbiased if for all \(\theta \in \Theta\) we have \[\lim_{n \to \infty} \mathbb{E}_\theta[\hat f_n] = f(\theta)\,.\]
The empirical mean \(\bar X_n\) from Example 6.5 is an unbiased estimator of the mean \(f(\theta) :=\mathbb{E}_\theta[X_1].\)
We would like to estimate the variance \[\sigma^2 = f(\theta) :=\mathop{\mathrm{Var}}_\theta(X_1) = \mathbb{E}_\theta[X_1^2] - \mathbb{E}_\theta[X_1]^2\] of the sample. A natural way to come with an estimator is to replace \(\mathbb{E}_\theta\) with an empirical average over the \(n\)-sample: \[\tilde \sigma^2 :=\frac{1}{n} (X_1^2 + \cdots + X_n^2) - \biggl(\frac{1}{n} (X_1 + \cdots + X_n)\biggr)^2\,.\] By the law of large numbers, \(\tilde \sigma^2\) is a consistent estimator of \(\sigma^2.\)
Is it biased? Let us find out: \[\mathbb{E}[\tilde \sigma^2] = \mathbb{E}_\theta[X_1^2] - \frac{1}{n} \mathbb{E}_\theta[X_1^2] - \frac{n-1}{n} \mathbb{E}_\theta[X_1]^2 = \frac{n - 1}{n} \sigma^2\,,\] so that the bias is \(-\sigma^2/n.\) Hence, \(\tilde \sigma^2\) is biased bu asymptotically unbiased. The bias can be removed by considering the slightly modified estimator \[S_n^2 :=\frac{n}{n-1} \tilde \sigma^2 = \frac{1}{n-1} \sum_{i = 1}^n (X_i - \bar X_n)^2\,,\] which is usually called the empirical variance. It is a consistent and unbiased estimator of the variance.
We saw a few examples of estimators that we defined essentially by guessing. In general, how does one find estimators? There is no general formula or algorithm, but there are some general recipies that are a good place to start. We discuss the two most common methods: the method of moments and the maximum likelihood estimator.
We have already used the method of moments in the Examples 6.5 and 6.9. In general, for \(f(\theta) = \mathbb{E}_\theta[g(X_1)]\) for some function \(g,\) we can estimate \(f(\theta)\) using the estimator \[\hat f :=\frac{1}{n} (g(X_1) + \cdots + g(X_n))\,,\] which is unbiased and consistent (by the law of large numbers). A classical choice, giving the method its name, is \(g(x) :=x^r\) for some exponent \(r \in \mathbb{R},\) which allows to estimate \(\theta = h(\mathbb{E}_\theta[X_1^r])\) using the estimator \[\hat \theta :=h\biggl(\frac{1}{n} (X_1^r + \cdots + X_n^r)\biggr)\,.\] This estimator is consistent by the law of large numbers, but in general biased.
If \(\mathbb{P}_\theta\) is the exponential law with parameter \(\theta,\) then we have \(\mathbb{E}_\theta[X_1] = 1/\theta,\) and therefore \[\hat \theta = 1 / \bar X_n\] is a consistent estimator of \(\theta.\)
The maximum likelihood estimator is somewhat more subtle and based on a simple but powerful idea: given a realisation \(x_1, \dots, x_n\) of an \(n\)-sample, we look for \(\theta \in \Theta\) for which the probability of observing \(x_1, \dots, x_n\) is the highest. Informally, we look for the \(\theta\) that best explains the observed sample. We shall construct an estimator based on this idea. For its definition, it is necessary to distinguish the discrete and continuous cases.
Let \(\mathbb{P}_\theta\) discrete. The likelihood at \(x_1, \dots, x_n\) is the function \[L(\theta ; x_1, \dots, x_n) :=\prod_{i = 1}^n \mathbb{P}_\theta(X_i = x_i)\,.\]
Let \(\mathbb{P}_\theta(\mathrm dx) = f_\theta(x) \, \mathrm dx\) be continuous with density \(f_\theta.\) The likelihood at \(x_1, \dots, x_n\) is the function \[L(\theta ; x_1, \dots, x_n) :=\prod_{i = 1}^n f_\theta(x_i)\,.\]
In each case, \(L\) should be regarded as a function of \(\theta,\) with \(x_1, \dots, x_n\) acting as fixed parameters.
Note that such a maximum might not exist or, if it exists, it might not be unique. Hence, the operator \(\operatorname{argmax}\) is rarely used in mathematics, but it is very useful in statistics, where such questions of existence and uniqueness are always understood to be solved by common sense (e.g. choosing one of the maxima in some prescribed way in case the maximum is not unique.)
Here, and in many other situations, the following remark is helpful: because the function \(\log\) is strictly increasing, maximising \(L\) is equivalent to maximising \(\log L.\) The derivative of the latter is often easier to compute. For this reason, one often consider the log-likelihood instead of the likelihood.
If \(\mathbb{P}_\theta\) is the uniform law on \([0,\theta]\) and \(x_1, \dots, x_n \geqslant 0\) is a realisation of an \(n\)-sample, then \[L(\theta; x_1, \dots, x_n) = \frac{1}{\theta^n} \mathbf 1_{x_1 \leqslant\theta} \cdots \mathbf 1_{x_n \leqslant\theta} = \frac{1}{\theta^n} \mathbf 1_{max_i x_i \leqslant\theta}\,.\] The maximum likelihood estimator is therefore \[\hat \theta = \max_i X_i\,.\]
We conclude this section by quantifying the quality of an estimator. Since one can often find several estimators for the same quantity, it is important to be able to quantify their accuracy and compare them.
Let \(\hat f\) be an estimator of \(f(\theta).\) The quadratic risk of \(\hat f\) is \[R_{\hat f}(\theta) :=\mathbb{E}_\theta[(\hat f - f(\theta))^2]\,.\]
By writing \(\hat f - f(\theta) = \hat f - \mathbb{E}_\theta[\hat f] + \mathbb{E}_\theta[\hat f] - f(\theta)\) and expanding in Definition 6.15, we obtain \[\tag{6.1} R_{\hat f}(\theta) = \mathop{\mathrm{Var}}_\theta(\hat f) + (\mathbb{E}_\theta[\hat f] - f(\theta))^2\,.\] In particular, if \(\hat f\) is unbiased then \[R_{\hat f}(\theta) = \mathop{\mathrm{Var}}_\theta(\hat f)\,.\]
If \(\hat f\) and \(\tilde f\) are estimators of \(f(\theta),\) we say that \(\hat f\) is better than \(\tilde f\) if, for all \(\theta \in \Theta,\) \[R_{\hat f}(\theta) \leqslant R_{\tilde f}(\theta)\,.\]
The relation (6.1) shows that the quadratic risk can be viewed as a sum of the variance and the square of the bias. In general, in order to minimise the quadratic risk, it can sometimes be advantageous to introduce a bias provided this sufficiently reduces the variance.
Let \(\mathbb{P}_\theta\) be the uniform law on \([0,\theta].\) The estimator \[\bar \theta = \frac{2}{n}(X_1 + \cdots + X_n)\] is an unbiased estimator of \(\theta.\) Its quadratic risk is \[\tag{6.2} R_{\bar \theta}(\theta) = \frac{4}{n} \mathop{\mathrm{Var}}_\theta(X_1) = \frac{\theta^2}{3n}\,.\]
Let us compare \(\bar \theta\) to the maximum likelihood estimator \(\hat \theta = \max_i X_i\) from Example 6.14. Clearly, \(\hat \theta\) is biased because \(\mathbb{E}_\theta[\hat \theta] < \theta.\) To compute the quadratic risk of \(\hat \theta,\) let us first compute its cumulative distribution function \[\mathbb{P}_\theta(\hat \theta \leqslant x) = \mathbb{P}_\theta(X_1 \leqslant x, \dots, X_n \leqslant x) = \mathbb{P}(X_1 \leqslant x)^n = \biggl(\frac{x}{\theta}\biggr)^n\] for \(x \leqslant\theta.\) Differentiating in \(x,\) we deduce that the density of the law of \(\hat \theta\) is \[f_{\hat \theta}(x) = \frac{n}{\theta^n} x^{n-1} \mathbf 1_{0 \leqslant x \leqslant\theta}\,.\] We deduce that \[\mathbb{E}_\theta[\hat \theta] = \int x \, f_{\hat \theta}(x) \, \mathrm dx = \frac{n}{n+1} \theta\,,\] so that \(\hat \theta\) is asymptotically unbiased. For the quadratic risk, we obtain \[\tag{6.3} R_{\hat \theta}(\theta) = \int (x - \theta)^2 \, f_{\hat \theta}(x) \, \mathrm dx = \frac{2 \theta}{(n+1) (n+2)}\,.\]
Comparing (6.2) and (6.3), we conclude: for \(n \geqslant 3\) the estimator \(\hat \theta\) is better than \(\bar \theta,\) despite being biased. (Note that the bias of \(\hat \theta\) can be removed by considering the estimator \(\frac{n+1}{n} \hat \theta.\))
6.2 Confidence intervals
Suppose that we have a sample drawn from \(\mathbb{P}_\theta\) and we wish to estimate \(f(\theta) \in \mathbb{R}.\) Often one not only wishes to estimate \(f(\theta)\) but one would also like to have a notion of how likely it is that this estimate is close to the true value \(f(\theta).\) Thus, we seek an interval \(I,\) depending only on the sample (and hence random), such that we know that \(f(\theta) \in I\) with a certain confidence.
We note that there is no universal convention in the literature regarding this terminology, and both \(\geqslant\) and \(=\) are commonly used in the definition of confidence intervals.
Suppose that \(\mathbb{P}_\theta\) is the normal distribution with mean \(\theta\) and variance one. Let \[I :=[\bar X_n - a, \bar X_n + a]\,.\] We know that \(\sqrt{n} (\bar X_n - a) =:Z\) is normal with mean zero and variance one. The condition of a strict confidence interval for \(\theta\) reads \[\gamma = \mathbb{P}_\theta(\theta \in I) = \mathbb{P}_\theta(\lvert \bar X_n - a \rvert \leqslant a) = \mathbb{P}(\lvert Z \rvert \leqslant a \sqrt{n})\,.\] The right-hand side is an explicit function of the standard normal distribution can be easily computed numerically. For instance, for the confidence level \(\gamma = 90\%\) we have \[I = \biggl[\bar X_n - \frac{1.64}{\sqrt{n}}, \bar X_n + \frac{1.64}{\sqrt{n}}\biggr]\,.\]
Suppose that \(\mathbb{P}_\theta\) is the uniform distribution on \([0,\theta].\) We use the estimator \(\hat \theta = \max_i X_i\) from Example 6.14. Clearly, \(\hat \theta \leqslant\theta.\) Consider the interval \[I = [\hat \theta, C \hat \theta]\] for \(C > 1.\) The condition of a strict confidence interval for \(\theta\) reads \[\gamma = \mathbb{P}_\theta(\theta \in I) = \mathbb{P}_\theta(\theta \leqslant C \hat \theta) = 1 - \mathbb{P}_\theta \biggl(\hat \theta < \frac{\theta}{C}\biggr) = 1 - \frac{1}{C^n}\,,\] where in the last step we used the law of \(\hat \theta\) computed in Example 6.18. We conclude that the strict confidence interval with confidence level \(\gamma\) is \[I = \biggl[\hat \theta, \biggl(\frac{1}{1 - \gamma}\biggr)^{1/n} \hat \theta\biggr]\,.\]
In the previous examples, thanks to a rather special form of the law \(\mathbb{P}_\theta,\) we could compute the probability \(\mathbb{P}_\theta(f(\theta) \in I)\) exactly, and hence obtain stric confidence intervals. In general, such an exact computation is not possible and one is reduced to finding non-strict confidence intervals.
Suppose that \(\mathbb{P}_\theta\) has variance \(\mathbb{E}_\theta[X_1^2] = \sigma^2\) and expectation \(\mu = \mathbb{E}_\theta[X_1^2].\) We would like to estimate \(\mu.\) As an estimator, we use the empirical mean \(\bar X_n.\) From Chebyshev’s inequality we get \[\mathbb{P}_\theta(\lvert \bar X_n - \mu \rvert < \delta) \geqslant 1 - \frac{\sigma^2}{n \delta^2}\,,\] from which we conclude that \[\tag{6.4} I :=\biggl[\bar X_n - \frac{\sigma}{\sqrt{n(1 - \gamma)}}, \bar X_n + \frac{\sigma}{\sqrt{n(1 - \gamma)}}\biggr]\] is a confidence interval for \(\mu\) with confidence level \(\gamma.\)
If \(n\) is large, by the Central Limit Theorem the estimate from the previous example is highly wasteful, as \(\bar X_n\) is asymptotically Gaussian. The following definition is sometimes used to capture this phenomenon.
Let \(0 < \gamma < 1.\) Let \(X_1, \dots, X_n\) be an \(n\)-sample drawn from \(\mathbb{P}_\theta.\) Let \(f \,\colon\Theta \to \mathbb{R}.\) An interval \(I_n = I_n(X_1, \dots, X_n)\) is an asymptotic confidence interval for \(f(\theta)\) with confidence level \(\gamma\) if \[\lim_{n \to \infty} \mathbb{P}_\theta(f(\theta) \in I_n) \geqslant\gamma \,, \qquad \forall \theta \in \Theta\,.\] If there is equality, we call the \(I_n\) a strict asymptotic confidence interval.
Let us return to Example 6.22. By the Central Limit Theorem, \[\lim_{n \to \infty} \mathbb{P}_\theta \biggl(\bar X_n \in \biggl[\mu - \frac{a \sigma}{\sqrt{n}}, \mu + \frac{a \sigma}{\sqrt{n}}\biggr]\biggr) = \mathbb{P}(\lvert Z \rvert \leqslant a)\,,\] where \(Z\) is a standard Gaussian random variable. Given \(\gamma,\) choose \(a\) such that \(\mathbb{P}(\lvert Z \rvert \leqslant a) = \gamma.\) In that case the interval \[\tag{6.5} I_n :=\biggl[\bar X_n - \frac{a \sigma}{\sqrt{n}}, \bar X_n + \frac{a \sigma}{\sqrt{n}}\biggr]\] is an asymptotic confidence interval for \(\mu\) with confidence level \(\gamma.\) Note that, if one aims for high confidence levels, where \(\gamma\) is close to \(1,\) the interval (6.5) is much smaller than (6.4), because of the strong Gaussian decay of \(Z.\)
For instance, suppose that I am measuring the mean size \(\mu\) of a population. The average uncertainty is \(\sigma = 0.73.\) How many samples do I need to determine \(\mu\) with an accuracy of \(0.1\)? To answer the question, I first have to choose a confidence level; this is a choice to be made by the statistician based on personal preferences and needs. Let us say that I would like \(\gamma = 99\%.\) This yields \(a \approx 2.58\) and hence \[I_n = \biggl[\bar X_n - \frac{1.88}{\sqrt{n}}, \bar X_n + \frac{1.88}{\sqrt{n}}\biggr]\,.\] I need to choose \(n\) such that \(\frac{1.88}{\sqrt{n}} > 0.1,\) i.e. \(n \geqslant 355.\)