6 Introduction to statistics

In this final chapter we give an introduction to statistics. The goal of statistics is of an entirely different kind from that of probability, and indeed of any area of mathematics. In contrast to mathematics, which can be regarded as reasoning based on axioms and logic, entirely unconcerned with any physical reality, statistics is an empirical and pragmatic science whose goal is to understand the real world by analysing data from empirical observations. Very briefly, this difference can be summarised as follows.

In practice, as in this course, the study of statistics is often combined with that of probability because it relies heavily on the tools and language of probability.

6.1 Estimators

We suppose that the observations are obtained from repeated experiments that are performed independently under the same conditions. This leads to the following notion.

Definition 6.1

An \(n\)-sample drawn from \(\mathbb{P}\) is a family \(X_1, \dots, X_n\) of independent random variables with law \(\mathbb{P}.\)

In parametric statistics, we suppose that the probability measure \(\mathbb{P}\equiv \mathbb{P}_\theta\) depends on a parameter \(\theta\) in some parameter set \(\Theta.\) The goal is to estimate the parameter \(\theta \in \Theta\) from an \(n\)-sample drawn from \(\mathbb{P}_\theta.\) Put differently, the rules of the game are the following.

The choice of the parametrisation \(\theta \mapsto \mathbb{P}_\theta\) is a subjective decision to be made by the statistician, depending on various constraints and her insight into the nature of the questions she is investigating. As such, for a given question and \(n\)-sample of observations, there is usually no right or wrong parametrisation, although there are certainly reasonable and less reasonable choices. One has to appeal to common sense. Here are some examples along with reasonable choices of \(\mathbb{P}_\theta.\)

Example 6.2

  • A yes/no opinion poll among \(n\) individuals. We choose \(\mathbb{P}_\theta\) to the Bernoulli distribution on \(\{0,1\}\) with parameter \(\theta \in [0,1] = \Theta.\)

  • The lifetime of an appliance. We choose \(\mathbb{P}_\theta\) to be the exponential distribution on \([0,\infty)\) with parameter \(\theta \in (0,\infty) = \Theta.\)

  • The size of an individual in a homogeneous population. We choose \(\mathbb{P}_\theta\) to be the normal distribution with mean \(m\) and variance \(\sigma^2,\) i.e. \(\theta = (m, \sigma^2) \in \mathbb{R}\times [0,\infty) = \Theta.\)

We seek to determine the parameter \(\theta\) from the observed \(n\)-sample, or, more generally, any function \(f(\theta)\) of the parameter.

Definition 6.3

  1. A statistic is a measurable function of an \(n\)-sample.

  2. Let \(f\) be a function on \(\Theta.\) An estimator of \(f(\theta)\) is a statistic with values in \(f(\Theta).\)

Note that an estimator can only depend on the sample (which is observable) and not on the parameter \(\theta\) (which is unknown). It is customary in statistics to denote estimators with decorated symbols, such as \(\hat f, \tilde f, \bar f,\) to distinguish them from (unknown) deterministic functions of \(\theta.\)

A good estimator \(\hat f = F(X_1, \dots, X_n)\) of \(f(\theta)\) should with high probability be close to \(f(\theta)\) provided that the underlying \(n\)-sample was drawn from \(\mathbb{P}_\theta.\) This leads to the following condition.

Definition 6.4

For each \(n \in \mathbb{N}^*,\) let \(\hat f_n = F_n(X_1, \dots, X_n)\) be an estimator of \(f(\theta)\) depending on an \(n\)-sample \(X_1, \dots, X_n\) drawn from \(\mathbb{P}_\theta.\) The family of estimators \(\hat f_n\) is called consistent if for all \(\theta \in \Theta\) we have \[\hat f_n \overset{\mathbb{P}}{\longrightarrow} f(\theta)\] as \(n \to \infty.\)

Example 6.5 • Empirical mean

The empirical mean \[\bar X_n :=\frac{1}{n} (X_1 + \cdots + X_n)\] is a consistent estimator of the mean \(f(\theta) :=\mathbb{E}_\theta[X_1],\) by the law of large numbers.

A classical characteristic of an estimator is its bias.

Definition 6.6

The bias of an estimator \(\hat f\) of \(f(\theta)\) is \(\mathbb{E}_\theta[\hat f] - f(\theta).\) If the bias is zero for all \(\theta \in \Theta,\) the estimator is called unbiased. Otherwise, it is called biased.

Being unbiased can be a desirable property for an estimator. However, in practice it is not always a good idea to insist on a lack bias: it can indeed happen that a biased estimator performs better than an unbiased one. We shall examples of this later on. What one almost always will require, however, is that for large values of \(n\) the bias tends to zero.

Definition 6.7

A family of estimators \(\hat f_n\) of \(t(\theta)\) is asymptotically unbiased if for all \(\theta \in \Theta\) we have \[\lim_{n \to \infty} \mathbb{E}_\theta[\hat f_n] = f(\theta)\,.\]

Example 6.8 • Example 6.5 continued

The empirical mean \(\bar X_n\) from Example 6.5 is an unbiased estimator of the mean \(f(\theta) :=\mathbb{E}_\theta[X_1].\)

Example 6.9 • Empirical variance

We would like to estimate the variance \[\sigma^2 = f(\theta) :=\mathop{\mathrm{Var}}_\theta(X_1) = \mathbb{E}_\theta[X_1^2] - \mathbb{E}_\theta[X_1]^2\] of the sample. A natural way to come with an estimator is to replace \(\mathbb{E}_\theta\) with an empirical average over the \(n\)-sample: \[\tilde \sigma^2 :=\frac{1}{n} (X_1^2 + \cdots + X_n^2) - \biggl(\frac{1}{n} (X_1 + \cdots + X_n)\biggr)^2\,.\] By the law of large numbers, \(\tilde \sigma^2\) is a consistent estimator of \(\sigma^2.\)

Is it biased? Let us find out: \[\mathbb{E}[\tilde \sigma^2] = \mathbb{E}_\theta[X_1^2] - \frac{1}{n} \mathbb{E}_\theta[X_1^2] - \frac{n-1}{n} \mathbb{E}_\theta[X_1]^2 = \frac{n - 1}{n} \sigma^2\,,\] so that the bias is \(-\sigma^2/n.\) Hence, \(\tilde \sigma^2\) is biased bu asymptotically unbiased. The bias can be removed by considering the slightly modified estimator \[S_n^2 :=\frac{n}{n-1} \tilde \sigma^2 = \frac{1}{n-1} \sum_{i = 1}^n (X_i - \bar X_n)^2\,,\] which is usually called the empirical variance. It is a consistent and unbiased estimator of the variance.

We saw a few examples of estimators that we defined essentially by guessing. In general, how does one find estimators? There is no general formula or algorithm, but there are some general recipies that are a good place to start. We discuss the two most common methods: the method of moments and the maximum likelihood estimator.

We have already used the method of moments in the Examples 6.5 and 6.9. In general, for \(f(\theta) = \mathbb{E}_\theta[g(X_1)]\) for some function \(g,\) we can estimate \(f(\theta)\) using the estimator \[\hat f :=\frac{1}{n} (g(X_1) + \cdots + g(X_n))\,,\] which is unbiased and consistent (by the law of large numbers). A classical choice, giving the method its name, is \(g(x) :=x^r\) for some exponent \(r \in \mathbb{R},\) which allows to estimate \(\theta = h(\mathbb{E}_\theta[X_1^r])\) using the estimator \[\hat \theta :=h\biggl(\frac{1}{n} (X_1^r + \cdots + X_n^r)\biggr)\,.\] This estimator is consistent by the law of large numbers, but in general biased.

Example 6.10

If \(\mathbb{P}_\theta\) is the exponential law with parameter \(\theta,\) then we have \(\mathbb{E}_\theta[X_1] = 1/\theta,\) and therefore \[\hat \theta = 1 / \bar X_n\] is a consistent estimator of \(\theta.\)

The maximum likelihood estimator is somewhat more subtle and based on a simple but powerful idea: given a realisation \(x_1, \dots, x_n\) of an \(n\)-sample, we look for \(\theta \in \Theta\) for which the probability of observing \(x_1, \dots, x_n\) is the highest. Informally, we look for the \(\theta\) that best explains the observed sample. We shall construct an estimator based on this idea. For its definition, it is necessary to distinguish the discrete and continuous cases.

Definition 6.11 • Likelihood

  1. Let \(\mathbb{P}_\theta\) discrete. The likelihood at \(x_1, \dots, x_n\) is the function \[L(\theta ; x_1, \dots, x_n) :=\prod_{i = 1}^n \mathbb{P}_\theta(X_i = x_i)\,.\]

  2. Let \(\mathbb{P}_\theta(\mathrm dx) = f_\theta(x) \, \mathrm dx\) be continuous with density \(f_\theta.\) The likelihood at \(x_1, \dots, x_n\) is the function \[L(\theta ; x_1, \dots, x_n) :=\prod_{i = 1}^n f_\theta(x_i)\,.\]

In each case, \(L\) should be regarded as a function of \(\theta,\) with \(x_1, \dots, x_n\) acting as fixed parameters.

Definition 6.12 (Maximum likelihood estimator). The maximum likelihood estimator of \(\theta\) is the estimator \[\hat \theta :=\operatorname{argmax}_\theta L(\theta; X_1, \dots, X_n)\,,\] where \(\operatorname{argmax}\) denotes the24 value of \(\theta \in \Theta\) at which the function \(L\) attains its maximum.
Example 6.13 If \(\mathbb{P}_\theta\) is the exponential law with parameter \(\theta\) and \(x_1, \dots, x_n \geqslant 0\) is a realisation of an \(n\)-sample, then \[L(\theta; x_1, \dots, x_n) = \prod_{i = 1}^n \theta \mathrm e^{-\theta x_i}\,.\] The maximising argument is easily determined by differentiation25 in \(\theta\): \[\hat \theta = \frac{n}{X_1 + \cdots + X_n}\,,\] which coincides with the estimator from Example 6.10 found using the method of moments.
Example 6.14

If \(\mathbb{P}_\theta\) is the uniform law on \([0,\theta]\) and \(x_1, \dots, x_n \geqslant 0\) is a realisation of an \(n\)-sample, then \[L(\theta; x_1, \dots, x_n) = \frac{1}{\theta^n} \mathbf 1_{x_1 \leqslant\theta} \cdots \mathbf 1_{x_n \leqslant\theta} = \frac{1}{\theta^n} \mathbf 1_{max_i x_i \leqslant\theta}\,.\] The maximum likelihood estimator is therefore \[\hat \theta = \max_i X_i\,.\]

We conclude this section by quantifying the quality of an estimator. Since one can often find several estimators for the same quantity, it is important to be able to quantify their accuracy and compare them.

Definition 6.15

Let \(\hat f\) be an estimator of \(f(\theta).\) The quadratic risk of \(\hat f\) is \[R_{\hat f}(\theta) :=\mathbb{E}_\theta[(\hat f - f(\theta))^2]\,.\]

Remark 6.16

By writing \(\hat f - f(\theta) = \hat f - \mathbb{E}_\theta[\hat f] + \mathbb{E}_\theta[\hat f] - f(\theta)\) and expanding in Definition 6.15, we obtain \[\tag{6.1} R_{\hat f}(\theta) = \mathop{\mathrm{Var}}_\theta(\hat f) + (\mathbb{E}_\theta[\hat f] - f(\theta))^2\,.\] In particular, if \(\hat f\) is unbiased then \[R_{\hat f}(\theta) = \mathop{\mathrm{Var}}_\theta(\hat f)\,.\]

Definition 6.17

If \(\hat f\) and \(\tilde f\) are estimators of \(f(\theta),\) we say that \(\hat f\) is better than \(\tilde f\) if, for all \(\theta \in \Theta,\) \[R_{\hat f}(\theta) \leqslant R_{\tilde f}(\theta)\,.\]

The relation (6.1) shows that the quadratic risk can be viewed as a sum of the variance and the square of the bias. In general, in order to minimise the quadratic risk, it can sometimes be advantageous to introduce a bias provided this sufficiently reduces the variance.

Example 6.18 • Example 6.14 continued

Let \(\mathbb{P}_\theta\) be the uniform law on \([0,\theta].\) The estimator \[\bar \theta = \frac{2}{n}(X_1 + \cdots + X_n)\] is an unbiased estimator of \(\theta.\) Its quadratic risk is \[\tag{6.2} R_{\bar \theta}(\theta) = \frac{4}{n} \mathop{\mathrm{Var}}_\theta(X_1) = \frac{\theta^2}{3n}\,.\]

Let us compare \(\bar \theta\) to the maximum likelihood estimator \(\hat \theta = \max_i X_i\) from Example 6.14. Clearly, \(\hat \theta\) is biased because \(\mathbb{E}_\theta[\hat \theta] < \theta.\) To compute the quadratic risk of \(\hat \theta,\) let us first compute its cumulative distribution function \[\mathbb{P}_\theta(\hat \theta \leqslant x) = \mathbb{P}_\theta(X_1 \leqslant x, \dots, X_n \leqslant x) = \mathbb{P}(X_1 \leqslant x)^n = \biggl(\frac{x}{\theta}\biggr)^n\] for \(x \leqslant\theta.\) Differentiating in \(x,\) we deduce that the density of the law of \(\hat \theta\) is \[f_{\hat \theta}(x) = \frac{n}{\theta^n} x^{n-1} \mathbf 1_{0 \leqslant x \leqslant\theta}\,.\] We deduce that \[\mathbb{E}_\theta[\hat \theta] = \int x \, f_{\hat \theta}(x) \, \mathrm dx = \frac{n}{n+1} \theta\,,\] so that \(\hat \theta\) is asymptotically unbiased. For the quadratic risk, we obtain \[\tag{6.3} R_{\hat \theta}(\theta) = \int (x - \theta)^2 \, f_{\hat \theta}(x) \, \mathrm dx = \frac{2 \theta}{(n+1) (n+2)}\,.\]

Comparing (6.2) and (6.3), we conclude: for \(n \geqslant 3\) the estimator \(\hat \theta\) is better than \(\bar \theta,\) despite being biased. (Note that the bias of \(\hat \theta\) can be removed by considering the estimator \(\frac{n+1}{n} \hat \theta.\))

6.2 Confidence intervals

Suppose that we have a sample drawn from \(\mathbb{P}_\theta\) and we wish to estimate \(f(\theta) \in \mathbb{R}.\) Often one not only wishes to estimate \(f(\theta)\) but one would also like to have a notion of how likely it is that this estimate is close to the true value \(f(\theta).\) Thus, we seek an interval \(I,\) depending only on the sample (and hence random), such that we know that \(f(\theta) \in I\) with a certain confidence.

Definition 6.19 (Confidence interval). Let \(0 < \gamma < 1.\) Let \(X_1, \dots, X_n\) be an \(n\)-sample drawn from \(\mathbb{P}_\theta.\) Let \(f \,\colon\Theta \to \mathbb{R}.\) An interval \(I = I(X_1, \dots, X_n)\) is a confidence interval for \(f(\theta)\) with confidence level \(\gamma\) if \[\mathbb{P}_\theta(f(\theta) \in I) \geqslant\gamma \,, \qquad \forall \theta \in \Theta\,.\] If there is equality, we call the \(I\) a strict26 confidence interval.
Example 6.20

Suppose that \(\mathbb{P}_\theta\) is the normal distribution with mean \(\theta\) and variance one. Let \[I :=[\bar X_n - a, \bar X_n + a]\,.\] We know that \(\sqrt{n} (\bar X_n - a) =:Z\) is normal with mean zero and variance one. The condition of a strict confidence interval for \(\theta\) reads \[\gamma = \mathbb{P}_\theta(\theta \in I) = \mathbb{P}_\theta(\lvert \bar X_n - a \rvert \leqslant a) = \mathbb{P}(\lvert Z \rvert \leqslant a \sqrt{n})\,.\] The right-hand side is an explicit function of the standard normal distribution can be easily computed numerically. For instance, for the confidence level \(\gamma = 90\%\) we have \[I = \biggl[\bar X_n - \frac{1.64}{\sqrt{n}}, \bar X_n + \frac{1.64}{\sqrt{n}}\biggr]\,.\]

Example 6.21

Suppose that \(\mathbb{P}_\theta\) is the uniform distribution on \([0,\theta].\) We use the estimator \(\hat \theta = \max_i X_i\) from Example 6.14. Clearly, \(\hat \theta \leqslant\theta.\) Consider the interval \[I = [\hat \theta, C \hat \theta]\] for \(C > 1.\) The condition of a strict confidence interval for \(\theta\) reads \[\gamma = \mathbb{P}_\theta(\theta \in I) = \mathbb{P}_\theta(\theta \leqslant C \hat \theta) = 1 - \mathbb{P}_\theta \biggl(\hat \theta < \frac{\theta}{C}\biggr) = 1 - \frac{1}{C^n}\,,\] where in the last step we used the law of \(\hat \theta\) computed in Example 6.18. We conclude that the strict confidence interval with confidence level \(\gamma\) is \[I = \biggl[\hat \theta, \biggl(\frac{1}{1 - \gamma}\biggr)^{1/n} \hat \theta\biggr]\,.\]

In the previous examples, thanks to a rather special form of the law \(\mathbb{P}_\theta,\) we could compute the probability \(\mathbb{P}_\theta(f(\theta) \in I)\) exactly, and hence obtain stric confidence intervals. In general, such an exact computation is not possible and one is reduced to finding non-strict confidence intervals.

Example 6.22

Suppose that \(\mathbb{P}_\theta\) has variance \(\mathbb{E}_\theta[X_1^2] = \sigma^2\) and expectation \(\mu = \mathbb{E}_\theta[X_1^2].\) We would like to estimate \(\mu.\) As an estimator, we use the empirical mean \(\bar X_n.\) From Chebyshev’s inequality we get \[\mathbb{P}_\theta(\lvert \bar X_n - \mu \rvert < \delta) \geqslant 1 - \frac{\sigma^2}{n \delta^2}\,,\] from which we conclude that \[\tag{6.4} I :=\biggl[\bar X_n - \frac{\sigma}{\sqrt{n(1 - \gamma)}}, \bar X_n + \frac{\sigma}{\sqrt{n(1 - \gamma)}}\biggr]\] is a confidence interval for \(\mu\) with confidence level \(\gamma.\)

If \(n\) is large, by the Central Limit Theorem the estimate from the previous example is highly wasteful, as \(\bar X_n\) is asymptotically Gaussian. The following definition is sometimes used to capture this phenomenon.

Definition 6.23 • Asymptotic confidence interval

Let \(0 < \gamma < 1.\) Let \(X_1, \dots, X_n\) be an \(n\)-sample drawn from \(\mathbb{P}_\theta.\) Let \(f \,\colon\Theta \to \mathbb{R}.\) An interval \(I_n = I_n(X_1, \dots, X_n)\) is an asymptotic confidence interval for \(f(\theta)\) with confidence level \(\gamma\) if \[\lim_{n \to \infty} \mathbb{P}_\theta(f(\theta) \in I_n) \geqslant\gamma \,, \qquad \forall \theta \in \Theta\,.\] If there is equality, we call the \(I_n\) a strict asymptotic confidence interval.

Example 6.24 • Example 6.22 continued

Let us return to Example 6.22. By the Central Limit Theorem, \[\lim_{n \to \infty} \mathbb{P}_\theta \biggl(\bar X_n \in \biggl[\mu - \frac{a \sigma}{\sqrt{n}}, \mu + \frac{a \sigma}{\sqrt{n}}\biggr]\biggr) = \mathbb{P}(\lvert Z \rvert \leqslant a)\,,\] where \(Z\) is a standard Gaussian random variable. Given \(\gamma,\) choose \(a\) such that \(\mathbb{P}(\lvert Z \rvert \leqslant a) = \gamma.\) In that case the interval \[\tag{6.5} I_n :=\biggl[\bar X_n - \frac{a \sigma}{\sqrt{n}}, \bar X_n + \frac{a \sigma}{\sqrt{n}}\biggr]\] is an asymptotic confidence interval for \(\mu\) with confidence level \(\gamma.\) Note that, if one aims for high confidence levels, where \(\gamma\) is close to \(1,\) the interval (6.5) is much smaller than (6.4), because of the strong Gaussian decay of \(Z.\)

For instance, suppose that I am measuring the mean size \(\mu\) of a population. The average uncertainty is \(\sigma = 0.73.\) How many samples do I need to determine \(\mu\) with an accuracy of \(0.1\)? To answer the question, I first have to choose a confidence level; this is a choice to be made by the statistician based on personal preferences and needs. Let us say that I would like \(\gamma = 99\%.\) This yields \(a \approx 2.58\) and hence \[I_n = \biggl[\bar X_n - \frac{1.88}{\sqrt{n}}, \bar X_n + \frac{1.88}{\sqrt{n}}\biggr]\,.\] I need to choose \(n\) such that \(\frac{1.88}{\sqrt{n}} > 0.1,\) i.e. \(n \geqslant 355.\)

Home

Contents

Study Weeks