Introduction to Mathematical Statistics, Seventh Edition, provides students with a comprehensive introduction to mathema

*668*
*100*
*5MB*

*English*
*Pages 654
[655]*
*Year 2013*

- Author / Uploaded
- Hogg
- Robert V;McKean
- Joeseph;Craig
- Allen T

*Table of contents : Cover......Page 1Table of Contents......Page 41. Probability and Distributions......Page 62. Multivariate Distributions......Page 803. Some Special Distributions......Page 1464. Some Elementary Statistical Inferences......Page 2125. Consistency and Limiting Distributions......Page 3006. Maximum Likelihood Methods......Page 3327. Sufficiency......Page 3888. Optimal Tests of Hypotheses......Page 4449. Inferences About Normal Models......Page 49010. Nonparametric and Robust Statistics......Page 54211. Appendix: Mathematical Comments......Page 62212. Appendix: R Functions......Page 62613. Appendix: Tables of Distributions......Page 63614. Appendix: Lists of Common Distributions......Page 646 D......Page 650 I......Page 651 O......Page 652 S......Page 653 Z......Page 654*

Introduction to Mathematical Statistics Hogg et al.

9 781292 024998

7e

ISBN 978-1-29202-499-8

Introduction to Mathematical Statistics Robert V. Hogg Joeseph McKean Allen T. Craig Seventh Edition

Introduction to Mathematical Statistics Robert V. Hogg Joeseph McKean Allen T. Craig Seventh Edition

Pearson Education Limited Edinburgh Gate Harlow Essex CM20 2JE England and Associated Companies throughout the world Visit us on the World Wide Web at: www.pearsoned.co.uk © Pearson Education Limited 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without either the prior written permission of the publisher or a licence permitting restricted copying in the United Kingdom issued by the Copyright Licensing Agency Ltd, Saffron House, 6–10 Kirby Street, London EC1N 8TS. All trademarks used herein are the property of their respective owners. The use of any trademark in this text does not vest in the author or publisher any trademark ownership rights in such trademarks, nor does the use of such trademarks imply any afﬁliation with or endorsement of this book by such owners.

ISBN 10: 1-292-02499-2 ISBN 13: 978-1-292-02499-8

British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library Printed in the United States of America

P

E

A

R

S

O

N

C U

S T O

M

L

I

B

R

A

R Y

Table of Contents 1. Probability and Distributions Robert V. Hogg/Joeseph McKean/Allen T. Craig

1

2. Multivariate Distributions Robert V. Hogg/Joeseph McKean/Allen T. Craig

75

3. Some Special Distributions Robert V. Hogg/Joeseph McKean/Allen T. Craig

141

4. Some Elementary Statistical Inferences Robert V. Hogg/Joeseph McKean/Allen T. Craig

207

5. Consistency and Limiting Distributions Robert V. Hogg/Joeseph McKean/Allen T. Craig

295

6. Maximum Likelihood Methods Robert V. Hogg/Joeseph McKean/Allen T. Craig

327

7. Sufficiency Robert V. Hogg/Joeseph McKean/Allen T. Craig

383

8. Optimal Tests of Hypotheses Robert V. Hogg/Joeseph McKean/Allen T. Craig

439

9. Inferences About Normal Models Robert V. Hogg/Joeseph McKean/Allen T. Craig

485

10. Nonparametric and Robust Statistics Robert V. Hogg/Joeseph McKean/Allen T. Craig

537

11. Appendix: Mathematical Comments Robert V. Hogg/Joeseph McKean/Allen T. Craig

617

12. Appendix: R Functions Robert V. Hogg/Joeseph McKean/Allen T. Craig

621

13. Appendix: Tables of Distributions Robert V. Hogg/Joeseph McKean/Allen T. Craig

631

I

14. Appendix: Lists of Common Distributions

II

Robert V. Hogg/Joeseph McKean/Allen T. Craig

641

Index

645

Probability and Distributions 1

Introduction

Many kinds of investigations may be characterized in part by the fact that repeated experimentation, under essentially the same conditions, is more or less standard procedure. For instance, in medical research, interest may center on the eﬀect of a drug that is to be administered; or an economist may be concerned with the prices of three speciﬁed commodities at various time intervals; or the agronomist may wish to study the eﬀect that a chemical fertilizer has on the yield of a cereal grain. The only way in which an investigator can elicit information about any such phenomenon is to perform the experiment. Each experiment terminates with an outcome. But it is characteristic of these experiments that the outcome cannot be predicted with certainty prior to the performance of the experiment. Suppose that we have such an experiment, but the experiment is of such a nature that a collection of every possible outcome can be described prior to its performance. If this kind of experiment can be repeated under the same conditions, it is called a random experiment, and the collection of every possible outcome is called the experimental space or the sample space. Example 1.1. In the toss of a coin, let the outcome tails be denoted by T and let the outcome heads be denoted by H. If we assume that the coin may be repeatedly tossed under the same conditions, then the toss of this coin is an example of a random experiment in which the outcome is one of the two symbols T and H; that is, the sample space is the collection of these two symbols. Example 1.2. In the cast of one red die and one white die, let the outcome be the ordered pair (number of spots up on the red die, number of spots up on the white die). If we assume that these two dice may be repeatedly cast under the same conditions, then the cast of this pair of dice is a random experiment. The sample space consists of the 36 ordered pairs: (1, 1), . . . , (1, 6), (2, 1), . . . , (2, 6), . . . , (6, 6). Let C denote a sample space, let c denote an element of C, and let C represent a collection of elements of C. If, upon the performance of the experiment, the outcome

From Chapter 1 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

1

Probability and Distributions is in C, we shall say that the event C has occurred. Now conceive of our having made N repeated performances of the random experiment. Then we can count the number f of times (the frequency) that the event C actually occurred throughout the N performances. The ratio f /N is called the relative frequency of the event C in these N experiments. A relative frequency is usually quite erratic for small values of N , as you can discover by tossing a coin. But as N increases, experience indicates that we associate with the event C a number, say p, that is equal or approximately equal to that number about which the relative frequency seems to stabilize. If we do this, then the number p can be interpreted as that number which, in future performances of the experiment, the relative frequency of the event C will either equal or approximate. Thus, although we cannot predict the outcome of a random experiment, we can, for a large value of N , predict approximately the relative frequency with which the outcome will be in C. The number p associated with the event C is given various names. Sometimes it is called the probability that the outcome of the random experiment is in C; sometimes it is called the probability of the event C; and sometimes it is called the probability measure of C. The context usually suggests an appropriate choice of terminology. Example 1.3. Let C denote the sample space of Example 1.2 and let C be the collection of every ordered pair of C for which the sum of the pair is equal to seven. Thus C is the collection (1, 6), (2, 5), (3, 4), (4, 3), (5, 2), and (6, 1). Suppose that the dice are cast N = 400 times and let f , the frequency of a sum of seven, be f = 60. 60 = 0.15. Then the relative frequency with which the outcome was in C is f /N = 400 Thus we might associate with C a number p that is close to 0.15, and p would be called the probability of the event C. Remark 1.1. The preceding interpretation of probability is sometimes referred to as the relative frequency approach, and it obviously depends upon the fact that an experiment can be repeated under essentially identical conditions. However, many persons extend probability to other situations by treating it as a rational measure of belief. For example, the statement p = 25 would mean to them that their personal or subjective probability of the event C is equal to 25 . Hence, if they are not opposed to gambling, this could be interpreted as a willingness on their part to bet on the outcome of C so that the two possible payoﬀs are in the ratio p/(1 − p) = 25 / 35 = 23 . Moreover, if they truly believe that p = 25 is correct, they would be willing to accept either side of the bet: (a) win 3 units if C occurs and lose 2 if it does not occur, or (b) win 2 units if C does not occur and lose 3 if it does. However, since the mathematical properties of probability given in Section 3 are consistent with either of these interpretations, the subsequent mathematical development does not depend upon which approach is used. The primary purpose of having a mathematical theory of statistics is to provide mathematical models for random experiments. Once a model for such an experiment has been provided and the theory worked out in detail, the statistician may, within this framework, make inferences (that is, draw conclusions) about the random experiment. The construction of such a model requires a theory of probability. One of the more logically satisfying theories of probability is that based on the concepts of sets and functions of sets. These concepts are introduced in Section 2.

2

Probability and Distributions

2

Set Theory

The concept of a set or a collection of objects is usually left undeﬁned. However, a particular set can be described so that there is no misunderstanding as to what collection of objects is under consideration. For example, the set of the ﬁrst 10 positive integers is suﬃciently well described to make clear that the numbers 34 and 14 are not in the set, while the number 3 is in the set. If an object belongs to a set, it is said to be an element of the set. For example, if C denotes the set of real numbers x for which 0 ≤ x ≤ 1, then 34 is an element of the set C. The fact that 3 3 4 is an element of the set C is indicated by writing 4 ∈ C. More generally, c ∈ C means that c is an element of the set C. The sets that concern us are frequently sets of numbers. However, the language of sets of points proves somewhat more convenient than that of sets of numbers. Accordingly, we brieﬂy indicate how we use this terminology. In analytic geometry considerable emphasis is placed on the fact that to each point on a line (on which an origin and a unit point have been selected) there corresponds one and only one number, say x; and that to each number x there corresponds one and only one point on the line. This one-to-one correspondence between the numbers and points on a line enables us to speak, without misunderstanding, of the “point x” instead of the “number x.” Furthermore, with a plane rectangular coordinate system and with x and y numbers, to each symbol (x, y) there corresponds one and only one point in the plane; and to each point in the plane there corresponds but one such symbol. Here again, we may speak of the “point (x, y),” meaning the “ordered number pair x and y.” This convenient language can be used when we have a rectangular coordinate system in a space of three or more dimensions. Thus the “point (x1 , x2 , . . . , xn )” means the numbers x1 , x2 , . . . , xn in the order stated. Accordingly, in describing our sets, we frequently speak of a set of points (a set whose elements are points), being careful, of course, to describe the set so as to avoid any ambiguity. The notation C = {x : 0 ≤ x ≤ 1} is read “C is the one-dimensional set of points x for which 0 ≤ x ≤ 1.” Similarly, C = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} can be read “C is the two-dimensional set of points (x, y) that are interior to, or on the boundary of, a square with opposite vertices at (0, 0) and (1, 1).” We say a set C is countable if C is ﬁnite or has as many elements as there are positive integers. For example, the sets C1 = {1, 2, . . . , 100} and C2 = {1, 3, 5, 7, . . .} are countable sets. The interval of real numbers (0, 1], though, is not countable. We now give some deﬁnitions (together with illustrative examples) that lead to an elementary algebra of sets adequate for our purposes. Deﬁnition 2.1. If each element of a set C1 is also an element of set C2 , the set C1 is called a subset of the set C2 . This is indicated by writing C1 ⊂ C2 . If C1 ⊂ C2 and also C2 ⊂ C1 , the two sets have the same elements, and this is indicated by writing C1 = C2 . Example 2.1. Let C1 = {x : 0 ≤ x ≤ 1} and C2 = {x : −1 ≤ x ≤ 2}. Here the one-dimensional set C1 is seen to be a subset of the one-dimensional set C2 ; that is, C1 ⊂ C2 . Subsequently, when the dimensionality of the set is clear, we do not make speciﬁc reference to it.

3

Probability and Distributions Example 2.2. Deﬁne the two sets C1 = {(x, y) : 0 ≤ x = y ≤ 1} and C2 = {(x, y) : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1}. Because the elements of C1 are the points on one diagonal of the square, then C1 ⊂ C2 . Deﬁnition 2.2. If a set C has no elements, C is called the null set. This is indicated by writing C = φ. Deﬁnition 2.3. The set of all elements that belong to at least one of the sets C1 and C2 is called the union of C1 and C2 . The union of C1 and C2 is indicated by writing C1 ∪ C2 . The union of several sets C1 , C2 , C3 , . . . is the set of all elements that belong to at least one of the several sets, denoted by C1 ∪C2 ∪C3 ∪· · · = ∪∞ j=1 Cj k or by C1 ∪ C2 ∪ · · · ∪ Ck = ∪j=1 Cj if a ﬁnite number k of sets is involved. We refer to a union of the form ∪∞ j=1 Cj as a countable union. Example 2.3. Deﬁne the sets C1 = {x : x = 8, 9, 10, 11, or 11 < x ≤ 12} and C2 = {x : x = 0, 1, . . . , 10}. Then C1 ∪ C2

=

{x : x = 0, 1, . . . , 8, 9, 10, 11, or 11 < x ≤ 12}

=

{x : x = 0, 1, . . . , 8, 9, 10 or 11 ≤ x ≤ 12}.

Example 2.4. Deﬁne C1 and C2 as in Example 2.1. Then C1 ∪ C2 = C2 . Example 2.5. Let C2 = φ. Then C1 ∪ C2 = C1 , for every set C1 . Example 2.6. For every set C, C ∪ C = C. Example 2.7. Let Ck =

x:

1 k+1

≤x≤1 ,

k = 1, 2, 3, . . . .

Then ∪∞ k=1 Ck = {x : 0 < x ≤ 1}. Note that the number zero is not in this set, since it is not in one of the sets C1 , C2 , C3 , . . . . Deﬁnition 2.4. The set of all elements that belong to each of the sets C1 and C2 is called the intersection of C1 and C2 . The intersection of C1 and C2 is indicated by writing C1 ∩ C2 . The intersection of several sets C1 , C2 , C3 , . . . is the set of all elements that belong to each of the sets C1 , C2 , C3 , . . . . This intersection is denoted k by C1 ∩ C2 ∩ C3 ∩ · · · = ∩∞ j=1 Cj or by C1 ∩ C2 ∩ · · · ∩ Ck = ∩j=1 Cj if a ﬁnite number k of sets is involved. We refer to an intersection of the form ∩∞ j=1 Cj as a countable intersection. Example 2.8. Let C1 = {(0, 0), (0, 1), (1, 1)} and C2 = {(1, 1), (1, 2), (2, 1)}. Then C1 ∩ C2 = {(1, 1)}. Example 2.9. Let C1 = {(x, y) : 0 ≤ x + y ≤ 1} and C2 = {(x, y) : 1 < x + y}. Then C1 and C2 have no points in common and C1 ∩ C2 = φ. Example 2.10. For every set C, C ∩ C = C and C ∩ φ = φ.

4

Probability and Distributions C1

C2

C2

C1

(a)

(b)

Figure 2.1: (a) C1 ∪ C2 and (b) C1 ∩ C2 . Example 2.11. Let Ck =

x:0 0, y > 0}. Deﬁnition 2.6. Let C denote a space and let C be a subset of the set C. The set that consists of all elements of C that are not elements of C is called the complement of C (actually, with respect to C). The complement of C is denoted by C c . In particular, C c = φ.

5

Probability and Distributions Example 2.15. Let C be deﬁned as in Example 2.13, and let the set C = {0, 1}. The complement of C (with respect to C) is C c = {2, 3, 4}. Example 2.16. Given C ⊂ C. Then C ∪C c = C, C ∩C c = φ, C ∪C = C, C ∩C = C, and (C c )c = C. Example 2.17 (DeMorgan’s Laws). A set of useful rules is known as DeMorgan’s Laws. Let C denote a space and let Ci ⊂ C, i = 1, 2. Then (C1 ∩ C2 )c (C1 ∪ C2 )c

= =

C1c ∪ C2c C1c ∩ C2c .

(2.1) (2.2)

The reader is asked to prove these in Exercise 2.4 and to extend them to countable unions and intersections. Many of the functions used in calculus and in this chapter are functions which map real numbers into real numbers. We are often, however, concerned with functions that map sets into real numbers. Such functions are naturally called functions of a set or, more simply, set functions. Next we give some examples of set functions and evaluate them for certain simple sets. Example 2.18. Let C be a set in one-dimensional space and let Q(C) be equal to the number of points in C which correspond to positive integers. Then Q(C) is a function of the set C. Thus, if C = {x : 0 < x < 5}, then Q(C) = 4; if C = {−2, −1}, then Q(C) = 0; if C = {x : −∞ < x < 6}, then Q(C) = 5. Example 2.19. Let C be a set in two-dimensional space and let Q(C) be the area of C if C has a ﬁnite area; otherwise, let Q(C) be undeﬁned. Thus, if C = {(x, y) : x2 + y 2 ≤ 1}, then Q(C) = π; if C = {(0, 0), (1, 1), (0, 1)}, then Q(C) = 0; if C = {(x, y) : 0 ≤ x, 0 ≤ y, x + y ≤ 1}, then Q(C) = 12 . Example 2.20. Let C be a set in three-dimensional space and let Q(C) be the volume of C if C has a ﬁnite volume; otherwise, let Q(C) be undeﬁned. Thus, if C = {(x, y, z) : 0 ≤ x ≤ 2, 0 ≤ y ≤ 1, 0 ≤ z ≤ 3}, then Q(C) = 6; if C = {(x, y, z) : x2 + y 2 + z 2 ≥ 1}, then Q(C) is undeﬁned. At this point we introduce the following notations. The symbol f (x) dx C

means the ordinary (Riemann) integral of f (x) over a prescribed one-dimensional set C; the symbol g(x, y) dxdy C

means the Riemann integral of g(x, y) over a prescribed two-dimensional set C; and so on. To be sure, unless these sets C and these functions f (x) and g(x, y) are chosen with care, the integrals frequently fail to exist. Similarly, the symbol f (x) C

6

Probability and Distributions means the sum extended over all x ∈ C; the symbol g(x, y) C

means the sum extended over all (x, y) ∈ C; and so on. Example 2.21. Let C be a set in one-dimensional space and let Q(C) = where 1 x x = 1, 2, 3, . . . (2) f (x) = 0 elsewhere.

C

f (x),

If C = {x : 0 ≤ x ≤ 3}, then Q(C) = 12 + ( 12 )2 + ( 12 )3 = 78 . Example 2.22. Let Q(C) = C f (x), where x p (1 − p)1−x x = 0, 1 f (x) = 0 elsewhere. If C = {0}, then Q(C) =

0

px (1 − p)1−x = 1 − p;

x=0

if C = {x : 1 ≤ x ≤ 2}, then Q(C) = f (1) = p. Example 2.23. Let C be a one-dimensional set and let e−x dx. Q(C) = C

Thus, if C = {x : 0 ≤ x < ∞}, then

∞

Q(C) =

e−x dx = 1;

0

if C = {x : 1 ≤ x ≤ 2}, then

2

Q(C) = 1

e−x dx = e−1 − e−2 ;

if C1 = {x : 0 ≤ x ≤ 1} and C2 = {x : 1 < x ≤ 3}, then 3 e−x dx Q(C1 ∪ C2 ) = 0

=

0

=

1

e−x dx +

3

e−x dx

1

Q(C1 ) + Q(C2 ).

7

Probability and Distributions Example 2.24. Let C be a set in n-dimensional space and let Q(C) = · · · dx1 dx2 · · · dxn . C

If C = {(x1 , x2 , . . . , xn ) : 0 ≤ x1 ≤ x2 ≤ · · · ≤ xn ≤ 1}, then x3 x2 1 xn ··· dx1 dx2 · · · dxn−1 dxn Q(C) = 0

=

0

0

0

1 , n!

where n! = n(n − 1) · · · 3 · 2 · 1. EXERCISES 2.1. Find the union C1 ∪ C2 and the intersection C1 ∩ C2 of the two sets C1 and C2 , where (a) C1 = {0, 1, 2, }, C2 = {2, 3, 4}. (b) C1 = {x : 0 < x < 2}, C2 = {x : 1 ≤ x < 3}. (c) C1 = {(x, y) : 0 < x < 2, 1 < y < 2}, C2 = {(x, y) : 1 < x < 3, 1 < y < 3}. 2.2. Find the complement C c of the set C with respect to the space C if (a) C = {x : 0 < x < 1}, C = {x :

5 8

< x < 1}.

(b) C = {(x, y, z) : x2 + y 2 + z 2 ≤ 1}, C = {(x, y, z) : x2 + y 2 + z 2 = 1}. (c) C = {(x, y) : |x| + |y| ≤ 2}, C = {(x, y) : x2 + y 2 < 2}. 2.3. List all possible arrangements of the four letters m, a, r, and y. Let C1 be the collection of the arrangements in which y is in the last position. Let C2 be the collection of the arrangements in which m is in the ﬁrst position. Find the union and the intersection of C1 and C2 . 2.4. Referring to Example 2.17, verify DeMorgan’s Laws (2.1) and (2.2) by using Venn diagrams and then prove that the laws are true. Generalize the laws to countable unions and intersections. 2.5. By the use of Venn diagrams, in which the space C is the set of points enclosed by a rectangle containing the circles C1 , C2 , and C3 , compare the following sets. These laws are called the distributive laws. (a) C1 ∩ (C2 ∪ C3 ) and (C1 ∩ C2 ) ∪ (C1 ∩ C3 ). (b) C1 ∪ (C2 ∩ C3 ) and (C1 ∪ C2 ) ∩ (C1 ∪ C3 ). 2.6. If a sequence of sets C1 , C2 , C3 , . . . is such that Ck ⊂ Ck+1 , k = 1, 2, 3, . . . , the sequence is said to be a nondecreasing sequence. Give an example of this kind of sequence of sets.

8

Probability and Distributions 2.7. If a sequence of sets C1 , C2 , C3 , . . . is such that Ck ⊃ Ck+1 , k = 1, 2, 3, . . . , the sequence is said to be a nonincreasing sequence. Give an example of this kind of sequence of sets. 2.8. Suppose C1 , C2 , C3 , . . . is a nondecreasing sequence of sets, i.e., Ck ⊂ Ck+1 , for k = 1, 2, 3, . . . . Then limk→∞ Ck is deﬁned as the union C1 ∪ C2 ∪ C3 ∪ · · · . Find limk→∞ Ck if (a) Ck = {x : 1/k ≤ x ≤ 3 − 1/k}, k = 1, 2, 3, . . . . (b) Ck = {(x, y) : 1/k ≤ x2 + y 2 ≤ 4 − 1/k}, k = 1, 2, 3, . . . . 2.9. If C1 , C2 , C3 , . . . are sets such that Ck ⊃ Ck+1 , k = 1, 2, 3, . . ., limk→∞ Ck is deﬁned as the intersection C1 ∩ C2 ∩ C3 ∩ · · · . Find limk→∞ Ck if (a) Ck = {x : 2 − 1/k < x ≤ 2}, k = 1, 2, 3, . . . . (b) Ck = {x : 2 < x ≤ 2 + 1/k}, k = 1, 2, 3, . . . . (c) Ck = {(x, y) : 0 ≤ x2 + y 2 ≤ 1/k}, k = 1, 2, 3, . . . .

2.10. For every one-dimensional set C, deﬁne the function Q(C) = C f (x), where f (x) = ( 23 )( 13 )x , x = 0, 1, 2, . . . , zero elsewhere. If C1 = {x : x = 0, 1, 2, 3} and C2 = {x : x = 0, 1, 2, . . .}, ﬁnd Q(C1 ) and Q(C2 ). Hint: Recall that Sn = a + ar + · · · + arn−1 = a(1 − rn )/(1 − r) and, hence, it follows that limn→∞ Sn = a/(1 − r) provided that |r| < 1. 2.11. For every one-dimensional set C for which the integral exists, let Q(C) =

f (x) dx, where f (x) = 6x(1 − x), 0 < x < 1, zero elsewhere; otherwise, let Q(C) C be undeﬁned. If C1 = {x : 14 < x < 34 }, C2 = { 12 }, and C3 = {x : 0 < x < 10}, ﬁnd Q(C1 ), Q(C2 ), and Q(C3 ). 2.12. For every set C contained in R2 for which the integral exists,

two-dimensional 2 2 (x + y ) dxdy. If C1 = {(x, y) : −1 ≤ x ≤ 1, −1 ≤ y ≤ 1}, let Q(C) = C C2 = {(x, y) : −1 ≤ x = y ≤ 1}, and C3 = {(x, y) : x2 + y 2 ≤ 1}, ﬁnd Q(C1 ), Q(C2 ), and Q(C3 ). 2.13. Let C denote the set of points that are interior to, or on the boundary

of, a dy dx. square with opposite vertices at the points (0, 0) and (1, 1). Let Q(C) = C (a) If C ⊂ C is the set {(x, y) : 0 < x < y < 1}, compute Q(C). (b) If C ⊂ C is the set {(x, y) : 0 < x = y < 1}, compute Q(C). (c) If C ⊂ C is the set {(x, y) : 0 < x/2 ≤ y ≤ 3x/2 < 1}, compute Q(C). 2.14. Let C be the set of points interior to or on the boundary of a cube with edge of length 1. Moreover, say that the cube is in the ﬁrst octant with one vertex at

the

point (0, 0, 0) and an opposite vertex at the point (1, 1, 1). Let Q(C) = dxdydz. C (a) If C ⊂ C is the set {(x, y, z) : 0 < x < y < z < 1}, compute Q(C).

9

Probability and Distributions (b) If C is the subset {(x, y, z) : 0 < x = y = z < 1}, compute Q(C). 2.15. Let C denote the set {(x, y, z) : x2 +y 2 +z 2 ≤ 1}. Using spherical coordinates, evaluate x2 + y 2 + z 2 dxdydz. Q(C) = C

2.16. To join a certain club, a person must be either a statistician or a mathematician or both. Of the 25 members in this club, 19 are statisticians and 16 are mathematicians. How many persons in the club are both a statistician and a mathematician? 2.17. After a hard-fought football game, it was reported that, of the 11 starting players, 8 hurt a hip, 6 hurt an arm, 5 hurt a knee, 3 hurt both a hip and an arm, 2 hurt both a hip and a knee, 1 hurt both an arm and a knee, and no one hurt all three. Comment on the accuracy of the report.

3

The Probability Set Function

Given an experiment, let C denote the sample space of all possible outcomes. As discussed in Section 1, we are interested in assigning probabilities to events, i.e., subsets of C. What should be our collection of events? If C is a ﬁnite set, then we could take the set of all subsets as this collection. For inﬁnite sample spaces, though, with assignment of probabilities in mind, this poses mathematical technicalities which are better left to a course in probability theory. We assume that in all cases, the collection of events is suﬃciently rich to include all possible events of interest and is closed under complements and countable unions of these events. Using DeMorgan’s Laws, Example 2.17, the collection is then also closed under countable intersections. We denote this collection of events by B. Technically, such a collection of events is called a σ-ﬁeld of subsets. Now that we have a sample space, C, and our collection of events, B, we can deﬁne the third component in our probability space, namely a probability set function. In order to motivate its deﬁnition, we consider the relative frequency approach to probability. Remark 3.1. The deﬁnition of probability consists of three axioms which we motivate by the following three intuitive properties of relative frequency. Let C be a sample space and let C ⊂ C. Suppose we repeat the experiment N times. Then the relative frequency of C is fC = #{C}/N , where #{C} denotes the number of times C occurred in the N repetitions. Note that fC ≥ 0 and fC = 1. These are the ﬁrst two properties. For the third, suppose that C1 and C2 are disjoint events. Then fC1 ∪C2 = fC1 + fC2 . These three properties of relative frequencies form the axioms of a probability, except that the third axiom is in terms of countable unions. As with the axioms of probability, the readers should check that the theorems we prove below about probabilities agree with their intuition of relative frequency.

10

Probability and Distributions Deﬁnition 3.1 (Probability). Let C be a sample space and let B be the set of events. Let P be a real-valued function deﬁned on B. Then P is a probability set function if P satisﬁes the following three conditions: 1. P (C) ≥ 0, for all C ∈ B. 2. P (C) = 1. 3. If {Cn } is a sequence of events in B and Cm ∩ Cn = φ for all m = n, then ∞ ∞

Cn = P (Cn ). P n=1

n=1

A collection of events whose members are pairwise disjoint, as in (3), is said to be a mutually exclusive collection. The collection is further said to be exhaustive ∞ if the union of its events is the sample space, in which case n=1 P (Cn ) = 1. We often say that a mutually exclusive and exhaustive collection of events forms a partition of C. A probability set function tells us how the probability is distributed over the set of events, B. In this sense we speak of a distribution of probability. We often drop the word “set” and refer to P as a probability function. The following theorems give us some other properties of a probability set function. In the statement of each of these theorems, P (C) is taken, tacitly, to be a probability set function deﬁned on the collection of events B of a sample space C. Theorem 3.1. For each event C ∈ B, P (C) = 1 − P (C c ). Proof: We have C = C ∪ C c and C ∩ C c = φ. Thus, from (2) and (3) of Deﬁnition 3.1, it follows that 1 = P (C) + P (C c ), which is the desired result. Theorem 3.2. The probability of the null set is zero; that is, P (φ) = 0. Proof: In Theorem 3.1, take C = φ so that C c = C. Accordingly, we have P (φ) = 1 − P (C) = 1 − 1 = 0 and the theorem is proved. Theorem 3.3. If C1 and C2 are events such that C1 ⊂ C2 , then P (C1 ) ≤ P (C2 ). Proof: Now C2 = C1 ∪ (C1c ∩ C2 ) and C1 ∩ (C1c ∩ C2 ) = φ. Hence, from (3) of Deﬁnition 3.1, P (C2 ) = P (C1 ) + P (C1c ∩ C2 ). From (1) of Deﬁnition 3.1, P (C1c ∩ C2 ) ≥ 0. Hence, P (C2 ) ≥ P (C1 ).

11

Probability and Distributions Theorem 3.4. For each C ∈ B, 0 ≤ P (C) ≤ 1. Proof: Since φ ⊂ C ⊂ C, we have by Theorem 3.3 that P (φ) ≤ P (C) ≤ P (C)

or

0 ≤ P (C) ≤ 1,

the desired result. Part (3) of the deﬁnition of probability says that P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) if C1 and C2 are disjoint, i.e., C1 ∩ C2 = φ . The next theorem gives the rule for any two events. Theorem 3.5. If C1 and C2 are events in C, then P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) − P (C1 ∩ C2 ). Proof: Each of the sets C1 ∪ C2 and C2 can be represented, respectively, as a union of nonintersecting sets as follows: C1 ∪ C2 = C1 ∪ (C1c ∩ C2 )

and

C2 = (C1 ∩ C2 ) ∪ (C1c ∩ C2 ).

Thus, from (3) of Deﬁnition 3.1, P (C1 ∪ C2 ) = P (C1 ) + P (C1c ∩ C2 ) and P (C2 ) = P (C1 ∩ C2 ) + P (C1c ∩ C2 ). If the second of these equations is solved for P (C1c ∩ C2 ) and this result substituted in the ﬁrst equation, we obtain P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) − P (C1 ∩ C2 ). This completes the proof. Remark 3.2 (Inclusion Exclusion Formula). It is easy to show (Exercise 3.9) that P (C1 ∪ C2 ∪ C3 ) = p1 − p2 + p3 , where p1 p2 p3

= =

P (C1 ) + P (C2 ) + P (C3 ) P (C1 ∩ C2 ) + P (C1 ∩ C3 ) + P (C2 ∩ C3 )

=

P (C1 ∩ C2 ∩ C3 ).

(3.1)

This can be generalized to the inclusion exclusion formula: P (C1 ∪ C2 ∪ · · · ∪ Ck ) = p1 − p2 + p3 − · · · + (−1)k+1 pk ,

12

(3.2)

Probability and Distributions where pi equals the sum of the probabilities of all possible intersections involving i sets. It is clear in the case k = 3 that p1 ≥ p2 ≥ p3 , but more generally p1 ≥ p2 ≥ · · · ≥ pk . As shown in Theorem 3.7, p1 = P (C1 ) + P (C2 ) + · · · + P (Ck ) ≥ P (C1 ∪ C2 ∪ · · · ∪ Ck ). This is known as Boole’s inequality. For k = 2, we have 1 ≥ P (C1 ∪ C2 ) = P (C1 ) + P (C2 ) − P (C1 ∩ C2 ), which gives Bonferroni’s inequality, P (C1 ∩ C2 ) ≥ P (C1 ) + P (C2 ) − 1,

(3.3)

that is only useful when P (C1 ) and P (C2 ) are large. The inclusion exclusion formula provides other inequalities that are useful, such as p1 ≥ P (C1 ∪ C2 ∪ · · · ∪ Ck ) ≥ p1 − p2 and p1 − p2 + p3 ≥ P (C1 ∪ C2 ∪ · · · ∪ Ck ) ≥ p1 − p2 + p3 − p4 . Example 3.1. Let C denote the sample space of Example 1.2. Let the probability 1 to each of the 36 points in C; that is, the dice set function assign a probability of 36 are fair. If C1 = {(1, 1), (2, 1), (3, 1), (4, 1), (5, 1)} and C2 = {(1, 2), (2, 2), (3, 2)}, 5 3 8 , P (C2 ) = 36 , P (C1 ∪ C2 ) = 36 , and P (C1 ∩ C2 ) = 0. then P (C1 ) = 36 Example 3.2. Two coins are to be tossed and the outcome is the ordered pair (face on the ﬁrst coin, face on the second coin). Thus the sample space may be represented as C = {(H, H), (H, T ), (T, H), (T, T )}. Let the probability set function assign a probability of 14 to each element of C. Let C1 = {(H, H), (H, T )} and C2 = {(H, H), (T, H)}. Then P (C1 ) = P (C2 ) = 12 , P (C1 ∩ C2 ) = 14 , and, in accordance with Theorem 3.5, P (C1 ∪ C2 ) = 12 + 12 − 14 = 34 . Example 3.3 (Equilikely Case). Let C be partitioned into k mutually disjoint subsets C1 , C2 , . . . , Ck in such a way that the union of these k mutually disjoint subsets is the sample space C. Thus the events C1 , C2 , . . . , Ck are mutually exclusive and exhaustive. Suppose that the random experiment is of such a character that it is reasonable to assume that each of the mutually exclusive and exhaustive events Ci , i = 1, 2, . . . , k, has the same probability. It is necessary then that P (Ci ) = 1/k, i = 1, 2, . . . , k; and we often say that the events C1 , C2 , . . . , Ck are equally likely. Let the event E be the union of r of these mutually exclusive events, say E = C 1 ∪ C2 ∪ · · · ∪ C r ,

r ≤ k.

Then

r . k Frequently, the integer k is called the total number of ways (for this particular partition of C) in which the random experiment can terminate and the integer r is P (E) = P (C1 ) + P (C2 ) + · · · + P (Cr ) =

13

Probability and Distributions called the number of ways that are favorable to the event E. So, in this terminology, P (E) is equal to the number of ways favorable to the event E divided by the total number of ways in which the experiment can terminate. It should be emphasized that in order to assign, in this manner, the probability r/k to the event E, we must assume that each of the mutually exclusive and exhaustive events C1 , C2 , . . . , Ck has the same probability 1/k. This assumption of equally likely events then becomes a part of our probability model. Obviously, if this assumption is not realistic in an application, the probability of the event E cannot be computed in this way. In order to illustrate the equilikely case, it is helpful to use some elementary counting rules. These are usually discussed in an elementary algebra course. In the next remark, we oﬀer a brief review of these rules. Remark 3.3 (Counting Rules). Suppose we have two experiments. The ﬁrst experiment results in m outcomes, while the second experiment results in n outcomes. The composite experiment, ﬁrst experiment followed by second experiment, has mn outcomes, which can be represented as mn ordered pairs. This is called the multiplication rule or the mn-rule. This is easily extended to more than two experiments. Let A be a set with n elements. Suppose we are interested in k-tuples whose components are elements of A. Then by the extended multiplication rule, there are n · n · · · n = nk such a k-tuples whose components are elements of A. Next, suppose k ≤ n and we are interested in k-tuples whose components are distinct (no repeats) elements of A. There are n elements from which to choose for the ﬁrst component, n − 1 for the second component, . . . , n − (k − 1) for the kth. Hence, by the multiplication rule, there are n(n − 1) · · · (n − (k − 1)) such k-tuples with distinct elements. We call each such k-tuple a permutation and use the symbol Pkn to denote the number of k permutations taken from a set of n elements. Hence, we have the formula Pkn = n(n − 1) · · · (n − (k − 1)) =

n! . (n − k)!

(3.4)

Next, suppose order is not important, so instead of counting the number of permutations we want to count the number of subsets of k elements taken from A. We use the symbol nk to denote the total number of these subsets. Consider a subset of k elements from A. By the permutation rule it generates Pkk = k(k − 1) · · · 1 permutations. Furthermore, all these permutations are distinct from permutations generated by other subsets of k elements from A. Finally, each permutation of k distinct elements drawn from A must be generated by one of these subsets. Hence, we have just shown that Pkn = nk k!; that is, n! n = . (3.5) k k!(n − k)! We often use the terminology combinations instead of subsets. So we say that there of k things taken from a set of n things. Another common are nk combinations n n symbol for k is Ck .

14

Probability and Distributions It is interesting to note that if we expand the binomial, (a + b)n = (a + b)(a + b) · · · (a + b), we get (a + b)n =

n n k n−k a b k

k=0

because we can select the k factors from which to take a in referred to as a binomial coeﬃcient.

(3.6) n k

ways. So

n k

is also

Example 3.4 (Poker Hands). Let a card be drawn at random from an ordinary deck of 52 playing cards which has been well shuﬄed. The sample space C is the union of k = 52 outcomes, and it is reasonable to assume that each of these outcomes has 1 . Accordingly, if E1 is the set of outcomes that are spades, the same probability 52 13 1 P (E1 ) = 52 = 4 because there are r1 = 13 spades in the deck; that is, 14 is the probability of drawing a card that is a spade. If E2 is the set of outcomes that are 4 1 1 = 13 because there are r2 = 4 kings in the deck; that is, 13 is kings, P (E2 ) = 52 the probability of drawing a card that is a king. These computations are very easy because there are no diﬃculties in the determination of the appropriate values of r and k. However, instead of drawing only one card, suppose that ﬁve cards are taken, at random and without replacement, from this deck; i.e, a ﬁve card poker hand. In this instance, order is not important. So a hand is a subset of ﬁve elements drawn poker hands. If the deck is from a set of 52 elements. Hence, by (3.5) there are 52 5 well shuﬄed, each hand should be equilikely; i.e., each hand has probability 1/ 52 5 . We can now compute the probabilities of some interesting poker hands. Let E1 be the event of a ﬂush, all ﬁve cards of the same suit. There are 41 = 4 suits to choose for the ﬂush and in each suit there are 13 5 possible hands; hence, using the multiplication rule, the probability of getting a ﬂush is 413 4 · 1287 = 0.00198. P (E1 ) = 1525 = 2598960 5 Real poker players note that this includes the probability of obtaining a straight ﬂush. Next, consider the probability of the event E2 of getting exactly three of a kind, (the other two are distinct and are of diﬀerent kinds). Choose the kind for cards 4 ways; choose the three, in ways; choose the other two kinds, the three, in 13 1 3 in 12 ways; and choose one card from each of these last two kinds, in 41 41 ways. 2 Hence the probability of exactly three of a kind is 1341242 P (E2 ) =

1

3

522

1

= 0.0211.

5

Now suppose that E3 is the set of outcomes in which exactly three cards are kings and exactly two cards are queens. Select the kings, in 43 ways, and select

15

Probability and Distributions the queens, in

4 2

ways. Hence, the probability of E3 is 52 4 4 = 0.0000093. P (E3 ) = 5 3 2

The event E3 is an example of a full house: three of one kind and two of another kind. Exercise 3.18 asks for the determination of the probability of a full house. Example 3.4 and the previous discussion allow us to see one way in which we can deﬁne a probability set function, that is, a set function that satisﬁes the requirements of Deﬁnition 3.1. Suppose that our space C consists of k distinct points, which, for this discussion, we take to be in a one-dimensional space. If the random experiment that ends in one of those k points is such that it is reasonable to assume that these points are equally likely, we could assign 1/k to each point and let, for C ⊂ C, P (C)

= =

number of points in C k 1 f (x), where f (x) = , k

x ∈ C.

x∈C

For illustration, in the cast of a die, we could take C = {1, 2, 3, 4, 5, 6} and f (x) = 16 , x ∈ C, if we believe the die to be unbiased. Clearly, such a set function satisﬁes Deﬁnition 3.1. The word unbiased in this illustration suggests the possibility that all six points might not, in all such cases, be equally likely. As a matter of fact, loaded dice do exist. In the case of a loaded die, some numbers occur more frequently than others in a sequence of casts of that die. For example, suppose that a die has been loaded so that the relative frequencies of the numbers in C seem to stabilize proportional to the number of spots that are on the up side. Thus we might assign f (x) = x/21, x ∈ C, and the corresponding f (x) P (C) = x∈C

would satisfy Deﬁnition 3.1. For illustration, this means that if C = {1, 2, 3}, then P (C) =

3 x=1

f (x) =

2 3 6 2 1 + + = = . 21 21 21 21 7

Whether this probability set function is realistic can only be checked by performing the random experiment a large number of times.

We end this section with an additional property of probability which proves useful in the sequel. Recall in Exercise 2.8 we said that a sequence of events

16

Probability and Distributions {Cn } is a nondecreasing sequence if Cn ⊂ Cn+1 , for all n, in which case we wrote limn→∞ Cn = ∪∞ n=1 Cn . Consider limn→∞ P (Cn ). The question is: can we interchange the limit and P ? As the following theorem shows, the answer is yes. The result also holds for a decreasing sequence of events. Because of this interchange, this theorem is sometimes referred to as the continuity theorem of probability. Theorem 3.6. Let {Cn } be a nondecreasing sequence of events. Then lim P (Cn ) = P ( lim Cn ) = P

n→∞

n→∞

∞

Cn

.

(3.7)

.

(3.8)

n=1

Let {Cn } be a decreasing sequence of events. Then lim P (Cn ) = P ( lim Cn ) = P

n→∞

n→∞

∞

Cn

n=1

Proof. We prove the result (3.7) and leave the second result as Exercise 3.19. c Deﬁne the sets, ∞ as R1 = C1 and for n > 1, Rn = Cn ∩ Cn−1 . It ∞ called rings, follows that n=1 Cn = n=1 Rn and that Rm ∩ Rn = φ, for m = n. Also, P (Rn ) = P (Cn ) − P (Cn−1 ). Applying the third axiom of probability yields the following string of equalities: P

lim Cn

n→∞

=

P

∞

Cn

=P

n=1

lim

n→∞ ⎩

Rn

n=1

⎧ ⎨

=

∞

P (C1 ) +

n

=

∞

P (Rn ) = lim

n=1

[P (Cj ) − P (Cj−1 )]

j=2

n→∞

⎫ ⎬ ⎭

n

P (Rj )

j=1

= lim P (Cn ). (3.9) n→∞

This is the desired result. Another useful result for arbitrary unions is given by Theorem 3.7 (Boole’s Inequality). Let {Cn } be an arbitrary sequence of events. Then ∞ ∞

Cn ≤ P (Cn ). (3.10) P n=1

n=1

n

Proof: Let Dn = i=1 Ci . Then {Dn } is an increasing sequence of events which go ∞ up to n=1 Cn . Also, for all j, Dj = Dj−1 ∪ Cj . Hence, by Theorem 3.5, P (Dj ) ≤ P (Dj−1 ) + P (Cj ), that is, P (Dj ) − P (Dj−1 ) ≤ P (Cj ).

17

Probability and Distributions In this case, the Ci s are replaced by the Di s in expression (3.9). Hence, using the above inequality in this expression and the fact that P (C1 ) = P (D1 ), we have ⎫ ⎧ ∞ ∞ n ⎬ ⎨

= P Cn Dn = lim P (D1 ) + [P (Dj ) − P (Dj−1 )] P n→∞ ⎩ ⎭ n=1

n=1

≤

lim

n→∞

n j=1

j=2

P (Cj ) =

∞

P (Cn ).

n=1

EXERCISES 3.1. A positive integer from one to six is to be chosen by casting a die. Thus the elements c of the sample space C are 1, 2, 3, 4, 5, 6. Suppose C1 = {1, 2, 3, 4} and C2 = {3, 4, 5, 6}. If the probability set function P assigns a probability of 16 to each of the elements of C, compute P (C1 ), P (C2 ), P (C1 ∩ C2 ), and P (C1 ∪ C2 ). 3.2. A random experiment consists of drawing a card from an ordinary deck of 52 1 to each playing cards. Let the probability set function P assign a probability of 52 of the 52 possible outcomes. Let C1 denote the collection of the 13 hearts and let C2 denote the collection of the 4 kings. Compute P (C1 ), P (C2 ), P (C1 ∩ C2 ), and P (C1 ∪ C2 ). 3.3. A coin is to be tossed as many times as necessary to turn up one head. Thus the elements c of the sample space C are H, T H, T T H, T T T H, and so forth. Let the probability set function P assign to these elements the respec1 , and so forth. Show that P (C) = 1. Let C1 = {c : tive probabilities 12 , 14 , 18 , 16 c is H, T H, T T H, T T T H, or T T T T H}. Compute P (C1 ). Next, suppose that C2 = {c : c is T T T T H or T T T T T H}. Compute P (C2 ), P (C1 ∩ C2 ), and P (C1 ∪ C2 ). 3.4. If the sample space is C = C1 ∪ C2 and if P (C1 ) = 0.8 and P (C2 ) = 0.5, ﬁnd P (C1 ∩ C2 ). 3.5. Let the sample space be C = {c : 0 < c < ∞}. Let C ⊂ C be deﬁned by C = {c : 4 < c < ∞} and take P (C) = C e−x dx. Show that P (C) = 1. Evaluate P (C), P (C c ), and P (C ∪ C c ). 3.6. If the sample space is C = {c : −∞ < c < ∞} and if C ⊂ C is a set for which

the integral C e−|x| dx exists, show that this set function is not a probability set function. What constant do we multiply the integrand by to make it a probability set function? 3.7. If C1 and C2 are subsets of the sample space C, show that P (C1 ∩ C2 ) ≤ P (C1 ) ≤ P (C1 ∪ C2 ) ≤ P (C1 ) + P (C2 ). 3.8. Let C1 , C2 , and C3 be three mutually disjoint subsets of the sample space C. Find P [(C1 ∪ C2 ) ∩ C3 ] and P (C1c ∪ C2c ).

18

Probability and Distributions 3.9. Consider Remark 3.2. (a) If C1 , C2 , and C3 are subsets of C, show that P (C1 ∪ C2 ∪ C3 )

=

P (C1 ) + P (C2 ) + P (C3 ) − P (C1 ∩ C2 ) − P (C1 ∩ C3 ) − P (C2 ∩ C3 ) + P (C1 ∩ C2 ∩ C3 ).

(b) Now prove the general inclusion exclusion formula given by the expression (3.2). Remark 3.4. In order to solve Exercises (3.10)-(3.18), certain reasonable assumptions must be made. 3.10. A bowl contains 16 chips, of which 6 are red, 7 are white, and 3 are blue. If four chips are taken at random and without replacement, ﬁnd the probability that: (a) each of the four chips is red; (b) none of the four chips is red; (c) there is at least one chip of each color. 3.11. A person has purchased 10 of 1000 tickets sold in a certain raﬄe. To determine the ﬁve prize winners, ﬁve tickets are to be drawn at random and without replacement. Compute the probability that this person wins at least one prize. Hint: First compute the probability that the person does not win a prize. 3.12. Compute the probability of being dealt at random and without replacement a 13-card bridge hand consisting of: (a) 6 spades, 4 hearts, 2 diamonds, and 1 club; (b) 13 cards of the same suit. 3.13. Three distinct integers are chosen at random from the ﬁrst 20 positive integers. Compute the probability that: (a) their sum is even; (b) their product is even. 3.14. There are ﬁve red chips and three blue chips in a bowl. The red chips are numbered 1, 2, 3, 4, 5, respectively, and the blue chips are numbered 1, 2, 3, respectively. If two chips are to be drawn at random and without replacement, ﬁnd the probability that these chips have either the same number or the same color. 3.15. In a lot of 50 light bulbs, there are 2 bad bulbs. An inspector examines ﬁve bulbs, which are selected at random and without replacement. (a) Find the probability of at least one defective bulb among the ﬁve. (b) How many bulbs should be examined so that the probability of ﬁnding at least one bad bulb exceeds 12 ? 3.16. If C1 , . . . , Ck are k events in the sample space C, show that the probability that at least one of the events occurs is one minus the probability that none of them occur; i.e., (3.11) P (C1 ∪ · · · ∪ Ck ) = 1 − P (C1c ∩ · · · ∩ Ckc ).

19

Probability and Distributions 3.17. A secretary types three letters and the three corresponding envelopes. In a hurry, he places at random one letter in each envelope. What is the probability that at least one letter is in the correct envelope? Hint: Let Ci be the event that the ith letter is in the correct envelope. Expand P (C1 ∪ C2 ∪ C3 ) to determine the probability. 3.18. Consider poker hands drawn from a well-shuﬄed deck as described in Example 3.4. Determine the probability of a full house, i.e, three of one kind and two of another. 3.19. Prove expression (3.8). 3.20. Suppose the experiment is to choose a real number at random in the interval (0, 1). For any subinterval (a, b) ⊂ (0, 1), it seems reasonable to assign the probability P [(a, b)] = b − a; i.e., the probability of selecting the point from a subinterval is directly proportional to the length of the subinterval. If this is the case, choose an appropriate sequence of subintervals and use expression (3.8) to show that P [{a}] = 0, for all a ∈ (0, 1). 3.21. Consider the events C1 , C2 , C3 . (a) Suppose C1 , C2 , C3 are mutually exclusive events. If P (Ci ) = pi , i = 1, 2, 3, what is the restriction on the sum p1 + p2 + p3 ? (b) In the notation of part (a), if p1 = 4/10, p2 = 3/10, and p3 = 5/10, are C1 , C2 , C3 mutually exclusive?

For the last two exercises it is assumed that the reader is familar with σ-ﬁelds. 3.22. Suppose D is a nonempty collection of subsets of C. Consider the collection of events B = ∩{E : D ⊂ E and E is a σ-ﬁeld}. Note that φ ∈ B because it is in each σ-ﬁeld, and, hence, in particular, it is in each σ-ﬁeld E ⊃ D. Continue in this way to show that B is a σ-ﬁeld. 3.23. Let C = R, where R is the set of all real numbers. Let I be the set of all open intervals in R. The Borel σ-ﬁeld on the real line is given by B0 = ∩{E : I ⊂ E and E is a σ-ﬁeld}. By deﬁnition, B0 contains the open intervals. Because [a, ∞) = (−∞, a)c and B0 is closed under complements, it contains all intervals of the form [a, ∞), for a ∈ R. Continue in this way and show that B0 contains all the closed and half-open intervals of real numbers.

20

Probability and Distributions

4

Conditional Probability and Independence

In some random experiments, we are interested only in those outcomes that are elements of a subset C1 of the sample space C. This means, for our purposes, that the sample space is eﬀectively the subset C1 . We are now confronted with the problem of deﬁning a probability set function with C1 as the “new” sample space. Let the probability set function P (C) be deﬁned on the sample space C and let C1 be a subset of C such that P (C1 ) > 0. We agree to consider only those outcomes of the random experiment that are elements of C1 ; in essence, then, we take C1 to be a sample space. Let C2 be another subset of C. How, relative to the new sample space C1 , do we want to deﬁne the probability of the event C2 ? Once deﬁned, this probability is called the conditional probability of the event C2 , relative to the hypothesis of the event C1 , or, more brieﬂy, the conditional probability of C2 , given C1 . Such a conditional probability is denoted by the symbol P (C2 |C1 ). We now return to the question that was raised about the deﬁnition of this symbol. Since C1 is now the sample space, the only elements of C2 that concern us are those, if any, that are also elements of C1 , that is, the elements of C1 ∩ C2 . It seems desirable, then, to deﬁne the symbol P (C2 |C1 ) in such a way that P (C1 |C1 ) = 1 and

P (C2 |C1 ) = P (C1 ∩ C2 |C1 ).

Moreover, from a relative frequency point of view, it would seem logically inconsistent if we did not require that the ratio of the probabilities of the events C1 ∩ C2 and C1 , relative to the space C1 , be the same as the ratio of the probabilities of these events relative to the space C; that is, we should have P (C1 ∩ C2 ) P (C1 ∩ C2 |C1 ) = . P (C1 |C1 ) P (C1 ) These three desirable conditions imply that the relation P (C2 |C1 ) =

P (C1 ∩ C2 ) P (C1 )

is a suitable deﬁnition of the conditional probability of the event C2 , given the event C1 , provided that P (C1 ) > 0. Moreover, we have 1. P (C2 |C1 ) ≥ 0. ∞ 2. P ∪∞ j=2 Cj |C1 = j=2 P (Cj |C1 ), provided that C2 , C3 , . . . are mutually exclusive events. 3. P (C1 |C1 ) = 1. Properties (1) and (3) are evident and the proof of property (2) is left as Exercise 4.1. But these are precisely the conditions that a probability set function must satisfy. Accordingly, P (C2 |C1 ) is a probability set function, deﬁned for subsets of C1 . It may be called the conditional probability set function, relative to the hypothesis C1 , or the conditional probability set function, given C1 . It should be noted that this conditional probability set function, given C1 , is deﬁned at this time only when P (C1 ) > 0.

21

Probability and Distributions Example 4.1. A hand of ﬁve cards is to be dealt at random without replacement from an ordinary deck of 52 playing cards. The conditional probability of an allspade hand (C2 ), relative to the hypothesis that there are at least four spades in the hand (C1 ), is, since C1 ∩ C2 = C2 , 13 52 / 5 P (C2 ) 52 P (C2 |C1 ) = = 13395 13 P (C1 ) / 5 4 1 + 5 13 =

13395 4

1

+

13 = 0.0441. 5

Note that this is not the same as drawing for a spade to complete a ﬂush in draw poker; see Exercise 4.3. From the deﬁnition of the conditional probability set function, we observe that P (C1 ∩ C2 ) = P (C1 )P (C2 |C1 ). This relation is frequently called the multiplication rule for probabilities. Sometimes, after considering the nature of the random experiment, it is possible to make reasonable assumptions so that both P (C1 ) and P (C2 |C1 ) can be assigned. Then P (C1 ∩ C2 ) can be computed under these assumptions. This is illustrated in Examples 4.2 and 4.3. Example 4.2. A bowl contains eight chips. Three of the chips are red and the remaining ﬁve are blue. Two chips are to be drawn successively, at random and without replacement. We want to compute the probability that the ﬁrst draw results in a red chip (C1 ) and that the second draw results in a blue chip (C2 ). It is reasonable to assign the following probabilities: P (C1 ) =

3 8

and

P (C2 |C1 ) = 57 .

Thus, under these assignments, we have P (C1 ∩ C2 ) = ( 38 )( 57 ) =

15 56

= 0.2679.

Example 4.3. From an ordinary deck of playing cards, cards are to be drawn successively, at random and without replacement. The probability that the third spade appears on the sixth draw is computed as follows. Let C1 be the event of two spades in the ﬁrst ﬁve draws and let C2 be the event of a spade on the sixth draw. Thus the probability that we wish to compute is P (C1 ∩ C2 ). It is reasonable to take 1339 11 = 0.2340. P (C1 ) = 2523 = 0.2743 and P (C2 |C1 ) = 47 5 The desired probability P (C1 ∩ C2 ) is then the product of these two numbers, which to four places is 0.0642. The multiplication rule can be extended to three or more events. In the case of three events, we have, by using the multiplication rule for two events, P (C1 ∩ C2 ∩ C3 )

22

= P [(C1 ∩ C2 ) ∩ C3 ] = P (C1 ∩ C2 )P (C3 |C1 ∩ C2 ).

Probability and Distributions But P (C1 ∩ C2 ) = P (C1 )P (C2 |C1 ). Hence, provided P (C1 ∩ C2 ) > 0, P (C1 ∩ C2 ∩ C3 ) = P (C1 )P (C2 |C1 )P (C3 |C1 ∩ C2 ). This procedure can be used to extend the multiplication rule to four or more events. The general formula for k events can be proved by mathematical induction. Example 4.4. Four cards are to be dealt successively, at random and without replacement, from an ordinary deck of playing cards. The probability of receiving a 13 13 13 spade, a heart, a diamond, and a club, in that order, is ( 13 52 )( 51 )( 50 )( 49 ) = 0.0044. This follows from the extension of the multiplication rule. Consider k mutually exclusive and exhaustive events C1 , C2 , . . . , Ck such that P (Ci ) > 0, i = 1, 2, . . . , k; i.e., C1 , C2 , . . . , Ck form a partition of C. Here the events C1 , C2 , . . . , Ck do not need to be equally likely. Let C be another event such that P (C) > 0. Thus C occurs with one and only one of the events C1 , C2 , . . . , Ck ; that is, C

= =

C ∩ (C1 ∪ C2 ∪ · · · Ck ) (C ∩ C1 ) ∪ (C ∩ C2 ) ∪ · · · ∪ (C ∩ Ck ).

Since C ∩ Ci , i = 1, 2, . . . , k, are mutually exclusive, we have P (C) = P (C ∩ C1 ) + P (C ∩ C2 ) + · · · + P (C ∩ Ck ). However, P (C ∩ Ci ) = P (Ci )P (C|Ci ), i = 1, 2, . . . , k; so P (C)

= =

P (C1 )P (C|C1 ) + P (C2 )P (C|C2 ) + · · · + P (Ck )P (C|Ck ) k

P (Ci )P (C|Ci ).

i=1

This result is sometimes called the law of total probability. From the deﬁnition of conditional probability, we have, using the law of total probability, that P (Cj |C) =

P (Cj )P (C|Cj ) P (C ∩ Cj ) = k , P (C) i=1 P (Ci )P (C|Ci )

(4.1)

which is the well-known Bayes’ Theorem. This permits us to calculate the conditional probability of Cj , given C, from the probabilities of C1 , C2 , . . . , Ck and the conditional probabilities of C, given Ci , i = 1, 2, . . . , k. Example 4.5. Say it is known that bowl C1 contains three red and seven blue chips and bowl C2 contains eight red and two blue chips. All chips are identical in size and shape. A die is cast and bowl C1 is selected if ﬁve or six spots show on the side that is up; otherwise, bowl C2 is selected. In a notation that is fairly obvious, it seems reasonable to assign P (C1 ) = 26 and P (C2 ) = 46 . The selected bowl is handed to another person and one chip is taken at random. Say that this chip is

23

Probability and Distributions red, an event which we denote by C. By considering the contents of the bowls, it is 3 8 and P (C|C2 ) = 10 . reasonable to assign the conditional probabilities P (C|C1 ) = 10 Thus the conditional probability of bowl C1 , given that a red chip is drawn, is P (C1 |C)

= =

P (C1 )P (C|C1 ) P (C1 )P (C|C1 ) + P (C2 )P (C|C2 ) 3 ) ( 26 )( 10 3 . = 2 3 4 8 19 ( 6 )( 10 ) + ( 6 )( 10 )

In a similar manner, we have P (C2 |C) =

16 19 .

In Example 4.5, the probabilities P (C1 ) = 26 and P (C2 ) = 46 are called prior probabilities of C1 and C2 , respectively, because they are known to be due to the random mechanism used to select the bowls. After the chip is taken and observed 3 and P (C2 |C) = 16 to be red, the conditional probabilities P (C1 |C) = 19 19 are called posterior probabilities. Since C2 has a larger proportion of red chips than does C1 , it appeals to one’s intuition that P (C2 |C) should be larger than P (C2 ) and, of course, P (C1 |C) should be smaller than P (C1 ). That is, intuitively the chances of having bowl C2 are better once that a red chip is observed than before a chip is taken. Bayes’ theorem provides a method of determining exactly what those probabilities are. Example 4.6. Three plants, C1 , C2 , and C3 , produce respectively, 10%, 50%, and 40% of a company’s output. Although plant C1 is a small plant, its manager believes in high quality and only 1% of its products are defective. The other two, C2 and C3 , are worse and produce items that are 3% and 4% defective, respectively. All products are sent to a central warehouse. One item is selected at random and observed to be defective, say event C. The conditional probability that it comes from plant C1 is found as follows. It is natural to assign the respective prior probabilities of getting an item from the plants as P (C1 ) = 0.1, P (C2 ) = 0.5, and P (C3 ) = 0.4, while the conditional probabilities of defective items are P (C|C1 ) = 0.01, P (C|C2 ) = 0.03, and P (C|C3 ) = 0.04. Thus the posterior probability of C1 , given a defective, is P (C1 |C) =

(0.10)(0.01) P (C1 ∩ C) = , P (C) (0.1)(0.01) + (0.5)(0.03) + (0.4)(0.04)

1 1 which equals 32 ; this is much smaller than the prior probability P (C1 ) = 10 . This is as it should be because the fact that the item is defective decreases the chances that it comes from the high-quality plant C1 .

Example 4.7. Suppose we want to investigate the percentage of abused children in a certain population. The events of interest are: a child is abused (A) and its complement a child is not abused (N = Ac ). For the purposes of this example, we assume that P (A) = 0.01 and, hence, P (N ) = 0.99. The classiﬁcation as to whether a child is abused or not is based upon a doctor’s examination. Because doctors are not perfect, they sometimes classify an abused child (A) as one that is not abused

24

Probability and Distributions (ND , where ND means classiﬁed as not abused by a doctor). On the other hand, doctors sometimes classify a nonabused child (N ) as abused (AD ). Suppose these error rates of misclassiﬁcation are P (ND | A) = 0.04 and P (AD | N ) = 0.05; thus the probabilities of correct decisions are P (AD | A) = 0.96 and P (ND | N ) = 0.95. Let us compute the probability that a child taken at random is classiﬁed as abused by a doctor. Because this can happen in two ways, A ∩ AD or N ∩ AD , we have P (AD ) = P (AD | A)P (A) + P (AD | N )P (N ) = (0.96)(0.01) + (0.05)(0.99) = 0.0591, which is quite high relative to the probability of an abused child, 0.01. Further, the probability that a child is abused when the doctor classiﬁed the child as abused is P (A | AD ) =

(0.96)(0.01) P (A ∩ AD ) = = 0.1624, P (AD ) 0.0591

which is quite low. In the same way, the probability that a child is not abused when the doctor classiﬁed the child as abused is 0.8376, which is quite high. The reason that these probabilities are so poor at recording the true situation is that the doctors’ error rates are so high relative to the fraction 0.01 of the population that is abused. An investigation such as this would, hopefully, lead to better training of doctors for classifying abused children. See also Exercise 4.17. Sometimes it happens that the occurrence of event C1 does not change the probability of event C2 ; that is, when P (C1 ) > 0, P (C2 |C1 ) = P (C2 ). In this case, we say that the events C1 and C2 are independent. Moreover, the multiplication rule becomes P (C1 ∩ C2 ) = P (C1 )P (C2 |C1 ) = P (C1 )P (C2 ).

(4.2)

This, in turn, implies, when P (C2 ) > 0, that P (C1 |C2 ) =

P (C1 )P (C2 ) P (C1 ∩ C2 ) = = P (C1 ). P (C2 ) P (C2 )

Note that if P (C1 ) > 0 and P (C2 ) > 0, then by the above discussion, independence is equivalent to (4.3) P (C1 ∩ C2 ) = P (C1 )P (C2 ). What if either P (C1 ) = 0 or P (C2 ) = 0? In either case, the right side of (4.3) is 0. However, the left side is 0 also because C1 ∩ C2 ⊂ C1 and C1 ∩ C2 ⊂ C2 . Hence, we take Equation (4.3) as our formal deﬁnition of independence; that is, Deﬁnition 4.2. Let C1 and C2 be two events. We say that C1 and C2 are independent if Equation (4.3) holds. Suppose C1 and C2 are independent events. Then the following three pairs of events are independent: C1 and C2c , C1c and C2 , and C1c and C2c (see Exercise 4.11).

25

Probability and Distributions Remark 4.1. Events that are independent are sometimes called statistically independent, stochastically independent, or independent in a probability sense. In most instances, we use independent without a modiﬁer if there is no possibility of misunderstanding. Example 4.8. A red die and a white die are cast in such a way that the numbers of spots on the two sides that are up are independent events. If C1 represents a four on the red die and C2 represents a three on the white die, with an equally likely assumption for each side, we assign P (C1 ) = 16 and P (C2 ) = 16 . Thus, from independence, the probability of the ordered pair (red = 4, white = 3) is P [(4, 3)] = ( 16 )( 16 ) =

1 36 .

The probability that the sum of the up spots of the two dice equals seven is P [(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)] = 16 16 + 16 16 + 16 16 + 16 16 + 16 16 + 16 16 =

6 36 .

In a similar manner, it is easy to show that the probabilities of the sums of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 are, respectively, 1 2 3 4 5 6 5 4 3 2 1 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 .

Suppose now that we have three events, C1 , C2 , and C3 . We say that they are mutually independent if and only if they are pairwise independent: P (C1 ∩ C3 ) = P (C1 )P (C3 ), P (C2 ∩ C3 ) = P (C2 )P (C3 ),

P (C1 ∩ C2 ) = P (C1 )P (C2 ),

and P (C1 ∩ C2 ∩ C3 ) = P (C1 )P (C2 )P (C3 ). More generally, the n events C1 , C2 , . . . , Cn are mutually independent if and only if for every collection of k of these events, 2 ≤ k ≤ n, the following is true: Say that d1 , d2 , . . . , dk are k distinct integers from 1, 2, . . . , n; then P (Cd1 ∩ Cd2 ∩ · · · ∩ Cdk ) = P (Cd1 )P (Cd2 ) · · · P (Cdk ). In particular, if C1 , C2 , . . . , Cn are mutually independent, then P (C1 ∩ C2 ∩ · · · ∩ Cn ) = P (C1 )P (C2 ) · · · P (Cn ). Also, as with two sets, many combinations of these events and their complements are independent, such as 1. The events C1c and C2 ∪ C3c ∪ C4 are independent, 2. The events C1 ∪ C2c , C3c and C4 ∩ C5c are mutually independent. If there is no possibility of misunderstanding, independent is often used without the modiﬁer mutually when considering more than two events.

26

Probability and Distributions Example 4.9. Pairwise independence does not imply mutual independence. As an example, suppose we twice spin a fair spinner with the numbers 1, 2, 3, and 4. Let C1 be the event that the sum of the numbers spun is 5, let C2 be the event that the ﬁrst number spun is a 1, and let C3 be the event that the second number spun is a 4. Then P (Ci ) = 1/4, i = 1, 2, 3, and for i = j, P (Ci ∩ Cj ) = 1/16. So the three events are pairwise independent. But C1 ∩ C2 ∩ C3 is the event that (1, 4) is spun, which has probability 1/16 = 1/64 = P (C1 )P (C2 )P (C3 ). Hence the events C1 , C2 , and C3 are not mutually independent. We often perform a sequence of random experiments in such a way that the events associated with one of them are independent of the events associated with the others. For convenience, we refer to these events as as outcomes of independent experiments, meaning that the respective events are independent. Thus we often refer to independent ﬂips of a coin or independent casts of a die or, more generally, independent trials of some given random experiment. Example 4.10. A coin is ﬂipped independently several times. Let the event Ci represent a head (H) on the ith toss; thus Cic represents a tail (T). Assume that Ci and Cic are equally likely; that is, P (Ci ) = P (Cic ) = 12 . Thus the probability of an ordered sequence like HHTH is, from independence, P (C1 ∩ C2 ∩ C3c ∩ C4 ) = P (C1 )P (C2 )P (C3c )P (C4 ) = ( 12 )4 =

1 16 .

Similarly, the probability of observing the ﬁrst head on the third ﬂip is P (C1c ∩ C2c ∩ C3 ) = P (C1c )P (C2c )P (C3 ) = ( 12 )3 = 18 . Also, the probability of getting at least one head on four ﬂips is P (C1 ∪ C2 ∪ C3 ∪ C4 )

=

1 − P [(C1 ∪ C2 ∪ C3 ∪ C4 )c ]

=

1 − P (C1c ∩ C2c ∩ C3c ∩ C4c ) 4 1 − 12 = 15 16 .

=

See Exercise 4.13 to justify this last probability. Example 4.11. A computer system is built so that if component K1 fails, it is bypassed and K2 is used. If K2 fails, then K3 is used. Suppose that the probability that K1 fails is 0.01, that K2 fails is 0.03, and that K3 fails is 0.08. Moreover, we can assume that the failures are mutually independent events. Then the probability of failure of the system is (0.01)(0.03)(0.08) = 0.000024, as all three components would have to fail. Hence, the probability that the system does not fail is 1 − 0.000024 = 0.999976. EXERCISES 4.1. If P (C1 ) > 0 and if C2 , C3 , C4 , . . . are mutually disjoint sets, show that P (C2 ∪ C3 ∪ · · · |C1 ) = P (C2 |C1 ) + P (C3 |C1 ) + · · · .

27

Probability and Distributions 4.2. Assume that P (C1 ∩ C2 ∩ C3 ) > 0. Prove that P (C1 ∩ C2 ∩ C3 ∩ C4 ) = P (C1 )P (C2 |C1 )P (C3 |C1 ∩ C2 )P (C4 |C1 ∩ C2 ∩ C3 ). 4.3. Suppose we are playing draw poker. We are dealt (from a well-shuﬄed deck) ﬁve cards, which contain four spades and another card of a diﬀerent suit. We decide to discard the card of a diﬀerent suit and draw one card from the remaining cards to complete a ﬂush in spades (all ﬁve cards spades). Determine the probability of completing the ﬂush. 4.4. From a well-shuﬄed deck of ordinary playing cards, four cards are turned over one at a time without replacement. What is the probability that the spades and red cards alternate? 4.5. A hand of 13 cards is to be dealt at random and without replacement from an ordinary deck of playing cards. Find the conditional probability that there are at least three kings in the hand given that the hand contains at least two kings. 4.6. A drawer contains eight diﬀerent pairs of socks. If six socks are taken at random and without replacement, compute the probability that there is at least one matching pair among these six socks. Hint: Compute the probability that there is not a matching pair. 4.7. A pair of dice is cast until either the sum of seven or eight appears. (a) Show that the probability of a seven before an eight is 6/11. (b) Next, this pair of dice is cast until a seven appears twice or until each of a six and eight has appeared at least once. Show that the probability of the six and eight occurring before two sevens is 0.546. 4.8. In a certain factory, machines I, II, and III are all producing springs of the same length. Machines I, II, and III produce 1%, 4%, and 2% defective springs, respectively. Of the total production of springs in the factory, Machine I produces 30%, Machine II produces 25%, and Machine III produces 45%. (a) If one spring is selected at random from the total springs produced in a given day, determine the probability that it is defective. (b) Given that the selected spring is defective, ﬁnd the conditional probability that it was produced by Machine II. 4.9. Bowl I contains six red chips and four blue chips. Five of these 10 chips are selected at random and without replacement and put in bowl II, which was originally empty. One chip is then drawn at random from bowl II. Given that this chip is blue, ﬁnd the conditional probability that two red chips and three blue chips are transferred from bowl I to bowl II. 4.10. In an oﬃce there are two boxes of computer disks: Box C1 contains seven Verbatim disks and three Control Data disks, and box C2 contains two Verbatim

28

Probability and Distributions disks and eight Control Data disks. A person is handed a box at random with prior probabilities P (C1 ) = 23 and P (C2 ) = 13 , possibly due to the boxes’ respective locations. A disk is then selected at random and the event C occurs if it is from Control Data. Using an equally likely assumption for each disk in the selected box, compute P (C1 |C) and P (C2 |C). 4.11. If C1 and C2 are independent events, show that the following pairs of events are also independent: (a) C1 and C2c , (b) C1c and C2 , and (c) C1c and C2c . Hint: In (a), write P (C1 ∩ C2c ) = P (C1 )P (C2c |C1 ) = P (C1 )[1 − P (C2 |C1 )]. From the independence of C1 and C2 , P (C2 |C1 ) = P (C2 ). 4.12. Let C1 and C2 be independent events with P (C1 ) = 0.6 and P (C2 ) = 0.3. Compute (a) P (C1 ∩ C2 ), (b) P (C1 ∪ C2 ), and (c) P (C1 ∪ C2c ). 4.13. Generalize Exercise 2.5 to obtain (C1 ∪ C2 ∪ · · · ∪ Ck )c = C1c ∩ C2c ∩ · · · ∩ Ckc . Say that C1 , C2 , . . . , Ck are independent events that have respective probabilities p1 , p2 , . . . , pk . Argue that the probability of at least one of C1 , C2 , . . . , Ck is equal to 1 − (1 − p1 )(1 − p2 ) · · · (1 − pk ). 4.14. Each of four persons ﬁres one shot at a target. Let Ck denote the event that the target is hit by person k, k = 1, 2, 3, 4. If C1 , C2 , C3 , C4 are independent and if P (C1 ) = P (C2 ) = 0.7, P (C3 ) = 0.9, and P (C4 ) = 0.4, compute the probability that (a) all of them hit the target; (b) exactly one hits the target; (c) no one hits the target; (d) at least one hits the target. 4.15. A bowl contains three red (R) balls and seven white (W) balls of exactly the same size and shape. Select balls successively at random and with replacement so that the events of white on the ﬁrst trial, white on the second, and so on, can be assumed to be independent. In four trials, make certain assumptions and compute the probabilities of the following ordered sequences: (a) WWRW; (b) RWWW; (c) WWWR; and (d) WRWW. Compute the probability of exactly one red ball in the four trials. 4.16. A coin is tossed two independent times, each resulting in a tail (T) or a head (H). The sample space consists of four ordered pairs: TT, TH, HT, HH. Making certain assumptions, compute the probability of each of these ordered pairs. What is the probability of at least one head? 4.17. For Example 4.7, obtain the following probabilities. Explain what they mean in terms of the problem. (a) P (ND ). (b) P (N | AD ). (c) P (A | ND ).

29

Probability and Distributions (d) P (N | ND ). 4.18. A die is cast independently until the ﬁrst 6 appears. If the casting stops on an odd number of times, Bob wins; otherwise, Joe wins. (a) Assuming the die is fair, what is the probability that Bob wins? (b) Let p denote the probability of a 6. Show that the game favors Bob, for all p, 0 < p < 1. 4.19. Cards are drawn at random and with replacement from an ordinary deck of 52 cards until a spade appears. (a) What is the probability that at least four draws are necessary? (b) Same as part (a), except the cards are drawn without replacement. 4.20. A person answers each of two multiple choice questions at random. If there are four possible choices on each question, what is the conditional probability that both answers are correct given that at least one is correct? 4.21. Suppose a fair 6-sided die is rolled six independent times. A match occurs if side i is observed on the ith trial, i = 1, . . . , 6. (a) What is the probability of at least one match on the six rolls? Hint: Let Ci be the event of a match on the ith trial and use Exercise 4.13 to determine the desired probability. (b) Extend part (a) to a fair n-sided die with n independent rolls. Then determine the limit of the probability as n → ∞. 4.22. Players A and B play a sequence of independent games. Player A throws a die ﬁrst and wins on a “six.” If he fails, B throws and wins on a “ﬁve” or “six.” If he fails, A throws and wins on a “four,” “ﬁve,” or “six.” And so on. Find the probability of each player winning the sequence. 4.23. Let C1 , C2 , C3 be independent events with probabilities tively. Compute P (C1 ∪ C2 ∪ C3 ).

1 1 1 2, 3, 4,

respec-

4.24. From a bowl containing ﬁve red, three white, and seven blue chips, select four at random and without replacement. Compute the conditional probability of one red, zero white, and three blue chips, given that there are at least three blue chips in this sample of four chips. 4.25. Let the three mutually independent events C1 , C2 , and C3 be such that P (C1 ) = P (C2 ) = P (C3 ) = 14 . Find P [(C1c ∩ C2c ) ∪ C3 ]. 4.26. Person A tosses a coin and then person B rolls a die. This is repeated independently until a head or one of the numbers 1, 2, 3, 4 appears, at which time the game is stopped. Person A wins with the head and B wins with one of the numbers 1, 2, 3, 4. Compute the probability that A wins the game.

30

Probability and Distributions 4.27. Each bag in a large box contains 25 tulip bulbs. It is known that 60% of the bags contain bulbs for 5 red and 20 yellow tulips, while the remaining 40% of the bags contain bulbs for 15 red and 10 yellow tulips. A bag is selected at random and a bulb taken at random from this bag is planted. (a) What is the probability that it will be a yellow tulip? (b) Given that it is yellow, what is the conditional probability it comes from a bag that contained 5 red and 20 yellow bulbs? 4.28. A bowl contains 10 chips numbered 1, 2, . . . , 10, respectively. Five chips are drawn at random, one at a time, and without replacement. What is the probability that two even-numbered chips are drawn and they occur on even-numbered draws? 4.29. A person bets 1 dollar to b dollars that he can draw two cards from an ordinary deck of cards without replacement and that they will be of the same suit. Find b so that the bet is fair. 4.30 (Monte Hall Problem). Suppose there are three curtains. Behind one curtain there is a nice prize, while behind the other two there are worthless prizes. A contestant selects one curtain at random, and then Monte Hall opens one of the other two curtains to reveal a worthless prize. Hall then expresses the willingness to trade the curtain that the contestant has chosen for the other curtain that has not been opened. Should the contestant switch curtains or stick with the one that she has? To answer the question, determine the probability that she wins the prize if she switches. 4.31. A French nobleman, Chevalier de M´er´e, had asked a famous mathematician, Pascal, to explain why the following two probabilities were diﬀerent (the diﬀerence had been noted from playing the game many times): (1) at least one six in four independent casts of a six-sided die; (2) at least a pair of sixes in 24 independent casts of a pair of dice. From proportions it seemed to de M´er´e that the probabilities should be the same. Compute the probabilities of (1) and (2). 4.32. Hunters A and B shoot at a target; the probabilities of hitting the target are p1 and p2 , respectively. Assuming independence, can p1 and p2 be selected so that P (zero hits) = P (one hit) = P (two hits)? 4.33. At the beginning of a study of individuals, 15% were classiﬁed as heavy smokers, 30% were classiﬁed as light smokers, and 55% were classiﬁed as nonsmokers. In the ﬁve-year study, it was determined that the death rates of the heavy and light smokers were ﬁve and three times that of the nonsmokers, respectively. A randomly selected participant died over the ﬁve-year period: calculate the probability that the participant was a nonsmoker. 4.34. A chemist wishes to detect an impurity in a certain compound that she is making. There is a test that detects an impurity with probability 0.90; however, this test indicates that an impurity is there when it is not about 5% of the time.

31

Probability and Distributions The chemist produces compounds with the impurity about 20% of the time. A compound is selected at random from the chemist’s output. The test indicates that an impurity is present. What is the conditional probability that the compound actually has the impurity?

5

Random Variables

The reader perceives that a sample space C may be tedious to describe if the elements of C are not numbers. We now discuss how we may formulate a rule, or a set of rules, by which the elements c of C may be represented by numbers. We begin the discussion with a very simple example. Let the random experiment be the toss of a coin and let the sample space associated with the experiment be C = {H, T }, where H and T represent heads and tails, respectively. Let X be a function such that X(T ) = 0 and X(H) = 1. Thus X is a real-valued function deﬁned on the sample space C which takes us from the sample space C to a space of real numbers D = {0, 1}. We now formulate the deﬁnition of a random variable and its space. Deﬁnition 5.1. Consider a random experiment with a sample space C. A function X, which assigns to each element c ∈ C one and only one number X(c) = x, is called a random variable. The space or range of X is the set of real numbers D = {x : x = X(c), c ∈ C}. In this text, D generally is a countable set or an interval of real numbers. We call random variables of the ﬁrst type discrete random variables, while we call those of the second type continuous random variables. In this section, we present examples of discrete and continuous random variables and then in the next two sections we discuss them separately. Given a random variable X, its range D becomes the sample space of interest. Besides inducing the sample space D, X also induces a probability which we call the distribution of X. Consider ﬁrst the case where X is a discrete random variable with a ﬁnite space D = {d1 , . . . , dm }. The only events of interest in the new sample space D are subsets of D. The induced probability distribution of X is also clear. Deﬁne the function pX (di ) on D by pX (di ) = P [{c : X(c) = di }],

for i = 1, . . . , m.

(5.1)

In the next section, we formally deﬁne pX (di ) as the probability mass function (pmf) of X. Then the induced probability distribution, PX (·), of X is PX (D) = pX (di ), D ⊂ D. di ∈D

As Exercise 5.11 shows, PX (D) is a probability on D. An example is helpful here. Example 5.1 (First Roll in the Game of Craps). Let X be the sum of the upfaces on a roll of a pair of fair 6-sided dice, each with the numbers 1 through 6 on

32

Probability and Distributions it. The sample space is C = {(i, j) : 1 ≤ i, j ≤ 6}. Because the dice are fair, P [{(i, j)}] = 1/36. The random variable X is X(i, j) = i + j. The space of X is D = {2, . . . , 12}. By enumeration, the pmf of X is given by Range value

x

2

3

4

5

6

7

8

9

10

11

12

Probability

pX (x)

1 36

2 36

3 36

4 36

5 36

6 36

5 36

4 36

3 36

2 36

1 36

To illustrate the computation of probabilities concerning X, suppose B1 = {x : x = 7, 11} and B2 = {x : x = 2, 3, 12}. Then, using the values of pX (x) given in the table, PX (B1 )

=

pX (x) =

2 8 6 + = 36 36 36

pX (x) =

2 1 4 1 + + = . 36 36 36 36

x∈B1

PX (B2 )

=

x∈B2

The second case is when X is a continuous random variable. In this case, D is an interval of real numbers. In practice, continuous random variables are often measurements. For example, the weight of an adult is modeled by a continuous random variable. Here we would not be interested in the probability that a person weighs exactly 200 pounds, but we may be interested in the probability that a person weighs over 200 pounds. Generally, for the continuous random variables, the simple events of interest are intervals. We can usually determine a nonnegative function fX (x) such that for any interval of real numbers (a, b) ∈ D, the induced probability distribution of X, PX (·), is deﬁned as PX [(a, b)] = P [{c ∈ C : a < X(c) < b}] =

b

fX (x) dx;

(5.2)

a

that is, the probability that X falls between a and b is the area under the curve y = fX (x) between a and b. Besides fX (x) ≥ 0, we also require that PX (D) = f (x) dx = 1 (total area under the curve over the sample space of X is 1). There D X are some technical issues in deﬁning events in general for the space D; however, it can be shown that PX (D) is a probability on D; see Exercise 5.11. The function fX is formally deﬁned as the probability density function (pdf) of X in Section 7. An example is in order. Example 5.2. For an example of a continuous random variable, consider the following simple experiment: choose a real number at random from the interval (0, 1). Let X be the number chosen. In this case the space of X is D = (0, 1). It is not obvious as it was in the last example what the induced probability PX is. But there are some intuitive probabilities. For instance, because the number is chosen at random, it is reasonable to assign PX [(a, b)] = b − a, for 0 < a < b < 1.

(5.3)

33

Probability and Distributions It follows that the pdf of X is fX (x) =

1 0 2|X > 1).

4

The Correlation Coeﬃcient

Because the result that we obtain in this section is more familiar in terms of X and Y , we use X and Y rather than X1 and X2 as symbols for our two random variables. Rather than discussing these concepts separately for continuous and discrete cases, we use continuous notation in our discussion. But the same properties hold for the discrete case also. Let X and Y have the joint pdf f (x, y). If u(x, y) is a function of x and y, then E[u(X, Y )] was deﬁned, subject to its existence, in Section 1. The existence of all mathematical expectations is assumed in this discussion. The means of X and Y , say μ1 and μ2 , are obtained by taking u(x, y) to be x and y, respectively; and the variances of X and Y , say σ12 and σ22 , are obtained by setting the function u(x, y) equal to (x − μ1 )2 and (y − μ2 )2 , respectively. Consider the mathematical expectation E[(X − μ1 )(Y − μ2 )]

= =

E(XY − μ2 X − μ1 Y + μ1 μ2 ) E(XY ) − μ2 E(X) − μ1 E(Y ) + μ1 μ2

=

E(XY ) − μ1 μ2 .

This number is called the covariance of X and Y and is often denoted by cov(X, Y ). If each of σ1 and σ2 is positive, the number ρ=

cov(X, Y ) E[(X − μ1 )(Y − μ2 )] = σ1 σ2 σ1 σ2

is called the correlation coeﬃcient of X and Y . It should be noted that the expected value of the product of two random variables is equal to the product of their expectations plus their covariance; that is, E(XY ) = μ1 μ2 + ρσ1 σ2 = μ1 μ2 + cov(X, Y ). Example 4.1. Let the random variables X and Y have the joint pdf x + y 0 < x < 1, 0 < y < 1 f (x, y) = 0 elsewhere. We next compute the correlation coeﬃcient ρ of X and Y . Now 1 1 7 x(x + y) dxdy = μ1 = E(X) = 12 0 0

104

Multivariate Distributions and σ12 = E(X 2 ) − μ21 =

1

0

1 0

x2 (x + y) dxdy −

7 12

2 =

11 . 144

Similarly, μ2 = E(Y ) =

7 12

The covariance of X and Y is E(XY ) − μ1 μ2 =

1

and

0

σ22 = E(Y 2 ) − μ22 =

1 0

xy(x + y) dxdy −

7 12

11 . 144

2 =−

1 . 144

Accordingly, the correlation coeﬃcient of X and Y is ρ=

1 − 144 11 11 ( 144 )( 144 )

=−

1 . 11

Remark 4.1. For certain kinds of distributions of two random variables, say X and Y , the correlation coeﬃcient ρ proves to be a very useful characteristic of the distribution. Unfortunately, the formal deﬁnition of ρ does not reveal this fact. At this time we make some observations about ρ, some of which will be explored more fully at a later stage. It will soon be seen that if a joint distribution of two variables has a correlation coeﬃcient (that is, if both of the variances are positive), then ρ satisﬁes −1 ≤ ρ ≤ 1. If ρ = 1, there is a line with equation y = a + bx, b > 0, the graph of which contains all of the probability of the distribution of X and Y . In this extreme case, we have P (Y = a + bX) = 1. If ρ = −1, we have the same state of aﬀairs except that b < 0. This suggests the following interesting question: When ρ does not have one of its extreme values, is there a line in the xy-plane such that the probability for X and Y tends to be concentrated in a band about this line? Under certain restrictive conditions this is, in fact, the case, and under those conditions we can look upon ρ as a measure of the intensity of the concentration of the probability for X and Y about that line. Next, let f (x, y) denote the joint pdf of two random variables X and Y and let f1 (x) denote the marginal pdf of X. Recall from Section 3 that the conditional pdf of Y , given X = x, is f (x, y) f2|1 (y|x) = f1 (x) at points where f1 (x) > 0, and the conditional mean of Y , given X = x, is given by ∞ yf (x, y) dy ∞ −∞ , E(Y |x) = yf2|1 (y|x) dy = f1 (x) −∞ when dealing with random variables of the continuous type. This conditional mean of Y , given X = x, is, of course, a function of x, say u(x). In a like vein, the conditional mean of X, given Y = y, is a function of y, say v(y).

105

Multivariate Distributions In case u(x) is a linear function of x, say u(x) = a + bx, we say the conditional mean of Y is linear in x; or that Y is a linear conditional mean. When u(x) = a+bx, the constants a and b have simple values which we summarize in the following theorem. Theorem 4.1. Suppose (X, Y ) have a joint distribution with the variances of X and Y ﬁnite and positive. Denote the means and variances of X and Y by μ1 , μ2 and σ12 , σ22 , respectively, and let ρ be the correlation coeﬃcient between X and Y . If E(Y |X) is linear in X then E(Y |X) = μ2 + ρ

σ2 (X − μ1 ) σ1

(4.1)

and E(Var(Y |X)) = σ22 (1 − ρ2 ).

(4.2)

Proof: The proof is given in the continuous case. The discrete case follows similarly by changing integrals to sums. Let E(Y |x) = a + bx. From ∞ yf (x, y) dy −∞ = a + bx, E(Y |x) = f1 (x) we have

∞

−∞

yf (x, y) dy = (a + bx)f1 (x).

(4.3)

If both members of Equation (4.3) are integrated on x, it is seen that E(Y ) = a + bE(X) or μ2 = a + bμ1 ,

(4.4)

where μ1 = E(X) and μ2 = E(Y ). If both members of Equation (4.3) are ﬁrst multiplied by x and then integrated on x, we have E(XY ) = aE(X) + bE(X 2 ), or ρσ1 σ2 + μ1 μ2 = aμ1 + b(σ12 + μ21 ),

(4.5)

where ρσ1 σ2 is the covariance of X and Y . The simultaneous solution of equations (4.4) and (4.5) yields a = μ2 − ρ

σ2 σ2 μ1 and b = ρ . σ1 σ1

These values give the ﬁrst result (4.1).

106

Multivariate Distributions The conditional variance of Y is given by Var(Y |x)

∞

= −∞ ∞

=

−∞

y − μ2 − ρ

σ2 (x − μ1 ) σ1

2

σ2 (y − μ2 ) − ρ (x − μ1 ) σ1 f1 (x)

f2|1 (y|x) dy 2 f (x, y) dy .

(4.6)

This variance is nonnegative and is at most a function of x alone. If it is multiplied by f1 (x) and integrated on x, the result obtained is nonnegative. This result is

2 σ2 (y − μ2 ) − ρ (x − μ1 ) f (x, y) dydx σ1 −∞ −∞ ∞ ∞

2 σ2 2 2 σ2 2 (y − μ2 ) − 2ρ (y − μ2 )(x − μ1 ) + ρ 2 (x − μ1 ) f (x, y) dydx = σ1 σ1 −∞ −∞

∞

= = =

∞

σ2 σ2 E[(X − μ1 )(Y − μ2 )] + ρ2 22 E[(X − μ1 )2 ] σ1 σ1 2 σ2 σ σ22 − 2ρ ρσ1 σ2 + ρ2 22 σ12 σ1 σ1 E[(Y − μ2 )2 ] − 2ρ

σ22 − 2ρ2 σ22 + ρ2 σ22 = σ22 (1 − ρ2 ),

which is the desired result. Note that if the variance, Equation (4.6), is denoted by k(x), then E[k(X)] = σ22 (1 − ρ2 ) ≥ 0. Accordingly, ρ2 ≤ 1, or −1 ≤ ρ ≤ 1. It is left as an exercise to prove that −1 ≤ ρ ≤ 1 whether the conditional mean is or is not linear; see Exercise 4.7. Suppose that the variance, Equation (4.6), is positive but not a function of x; that is, the variance is a constant k > 0. Now if k is multiplied by f1 (x) and integrated on x, the result is k, so that k = σ22 (1 − ρ2 ). Thus, in this case, the variance of each conditional distribution of Y , given X = x, is σ22 (1 − ρ2 ). If ρ = 0, the variance of each conditional distribution of Y , given X = x, is σ22 , the variance of the marginal distribution of Y . On the other hand, if ρ2 is near 1, the variance of each conditional distribution of Y , given X = x, is relatively small, and there is a high concentration of the probability for this conditional distribution near the mean E(Y |x) = μ2 + ρ(σ2 /σ1 )(x − μ1 ). Similar comments can be made about E(X|y) if it is linear. In particular, E(X|y) = μ1 + ρ(σ1 /σ2 )(y − μ2 ) and E[Var(X|Y )] = σ12 (1 − ρ2 ). Example 4.2. Let the random variables X and Y have the linear conditional 1 y − 3. In accordance with the general means E(Y |x) = 4x + 3 and E(X|y) = 16 formulas for the linear conditional means, we see that E(Y |x) = μ2 if x = μ1 and E(X|y) = μ1 if y = μ2 . Accordingly, in this special case, we have μ2 = 4μ1 + 3 1 μ2 − 3 so that μ1 = − 15 and μ1 = 16 4 and μ2 = −12. The general formulas for the linear conditional means also show that the product of the coeﬃcients of x and y, respectively, is equal to ρ2 and that the quotient of these coeﬃcients is equal to

107

Multivariate Distributions 1 σ22 /σ12 . Here ρ2 = 4( 16 ) = 14 with ρ = 12 (not − 12 ), and σ22 /σ12 = 64. Thus, from the two linear conditional means, we are able to ﬁnd the values of μ1 , μ2 , ρ, and σ2 /σ1 , but not the values of σ1 and σ2 .

y

E(Y|x) = bx

a

x

(0, 0)

–h

h

–a

Figure 4.1: Illustration for Example 4.3.

Example 4.3. To illustrate how the correlation coeﬃcient measures the intensity of the concentration of the probability for X and Y about a line, let these random variables have a distribution that is uniform over the area depicted in Figure 4.1. That is, the joint pdf of X and Y is 1 −a + bx < y < a + bx, −h < x < h 4ah f (x, y) = 0 elsewhere. We assume here that b ≥ 0, but the argument can be modiﬁed for b ≤ 0. It is easy to show that the pdf of X is uniform, namely a+bx 1 1 dy = 2h −h < x < h −a+bx 4ah f1 (x) = 0 elsewhere. The conditional mean and variance are E(Y |x) = bx

and

var(Y |x) =

a2 . 3

From the general expressions for those characteristics we know that b=ρ

108

σ2 σ1

and

a2 = σ22 (1 − ρ2 ). 3

Multivariate Distributions Additionally, we know that σ12 = h2 /3. If we solve these three equations, we obtain an expression for the correlation coeﬃcient, namely bh . ρ= √ 2 a + b2 h 2 Referring to Figure 4.1, we note 1. As a gets small (large), the straight-line eﬀect is more (less) intense and ρ is closer to 1 (0). 2. As h gets large (small), the straight-line eﬀect is more (less) intense and ρ is closer to 1 (0). 3. As b gets large (small), the straight-line eﬀect is more (less) intense and ρ is closer to 1 (0). Recall that in Section 1 we introduced the mgf for the random vector (X, Y ). As for random variables, the joint mgf also gives explicit formulas for certain moments. In the case of random variables of the continuous type, ∂ k+m M (t1 , t2 ) = ∂tk1 ∂tm 2

∞ −∞

∞

xk y m et1 x+t2 y f (x, y) dxdy,

−∞

so that ∞ ∞ ∂ k+m M (t1 , t2 ) = xk y m f (x, y) dxdy = E(X k Y m ). ∂tk1 ∂tm −∞ −∞ 2 t1 =t2 =0 For instance, in a simpliﬁed notation that appears to be clear, ∂M (0, 0) ∂t1 ∂M (0, 0) E(Y ) = ∂t2 ∂ 2 M (0, 0) E(X 2 ) − μ21 = − μ21 ∂t21 ∂ 2 M (0, 0) E(Y 2 ) − μ22 = − μ22 ∂t22 ∂ 2 M (0, 0) − μ1 μ2 , ∂t1 ∂t2

μ1

= E(X) =

μ2

=

σ12

=

σ22

=

E[(X − μ1 )(Y − μ2 )]

=

(4.7)

and from these we can compute the correlation coeﬃcient ρ. It is fairly obvious that the results of equations (4.7) hold if X and Y are random variables of the discrete type. Thus the correlation coeﬃcients may be computed by using the mgf of the joint distribution if that function is readily available. An illustrative example follows.

109

Multivariate Distributions Example 4.4 (Example 1.7, Continued). In Example 1.7, we considered the joint density −y 0 0, x2 ∈ S2 , zero elsewhere. Proof. If X1 and X2 are independent, then f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ), where f1 (x1 ) and f2 (x2 ) are the marginal probability density functions of X1 and X2 , respectively. Thus the condition f (x1 , x2 ) ≡ g(x1 )h(x2 ) is fulﬁlled. Conversely, if f (x1 , x2 ) ≡ g(x1 )h(x2 ), then, for random variables of the continuous type, we have ∞ ∞ g(x1 )h(x2 ) dx2 = g(x1 ) h(x2 ) dx2 = c1 g(x1 ) f1 (x1 ) = −∞

and f2 (x2 ) =

−∞

∞ −∞

g(x1 )h(x2 ) dx1 = h(x2 )

∞ −∞

g(x1 ) dx1 = c2 h(x2 ),

where c1 and c2 are constants, not functions of x1 or x2 . Moreover, c1 c2 = 1 because ∞

∞ ∞ ∞ 1= g(x1 )h(x2 ) dx1 dx2 = g(x1 ) dx1 h(x2 ) dx2 = c2 c1 . −∞

−∞

−∞

−∞

These results imply that f (x1 , x2 ) ≡ g(x1 )h(x2 ) ≡ c1 g(x1 )c2 h(x2 ) ≡ f1 (x1 )f2 (x2 ).

113

Multivariate Distributions Accordingly, X1 and X2 are independent. This theorem is true for the discrete case also. Simply replace the joint pdf by the joint pmf. If we now refer to Example 5.1, we see that the joint pdf x1 + x2 0 < x1 < 1, 0 < x2 < 1 f (x1 , x2 ) = 0 elsewhere cannot be written as the product of a nonnegative function of x1 and a nonnegative function of x2 . Accordingly, X1 and X2 are dependent. Example 5.2. Let the pdf of the random variable X1 and X2 be f (x1 , x2 ) = 8x1 x2 , 0 < x1 < x2 < 1, zero elsewhere. The formula 8x1 x2 might suggest to some that X1 and X2 are independent. However, if we consider the space S = {(x1 , x2 ) : 0 < x1 < x2 < 1}, we see that it is not a product space. This should make it clear that, in general, X1 and X2 must be dependent if the space of positive probability density of X1 and X2 is bounded by a curve that is neither a horizontal nor a vertical line. Instead of working with pdfs (or pmfs) we could have presented independence in terms of cumulative distribution functions. The following theorem shows the equivalence. Theorem 5.2. Let (X1 , X2 ) have the joint cdf F (x1 , x2 ) and let X1 and X2 have the marginal cdfs F1 (x1 ) and F2 (x2 ), respectively. Then X1 and X2 are independent if and only if F (x1 , x2 ) = F1 (x1 )F2 (x2 )

for all (x1 , x2 ) ∈ R2 .

(5.1)

Proof: We give the proof for the continuous case. Suppose expression (5.1) holds. Then the mixed second partial is ∂2 F (x1 , x2 ) = f1 (x1 )f2 (x2 ). ∂x1 ∂x2 Hence, X1 and X2 are independent. Conversely, suppose X1 and X2 are independent. Then by the deﬁnition of the joint cdf, x1 x2 f1 (w1 )f2 (w2 ) dw2 dw1 F (x1 , x2 ) = −∞ −∞ x1 x2 = f1 (w1 ) dw1 · f2 (w2 ) dw2 = F1 (x1 )F2 (x2 ). −∞

−∞

Hence, condition (5.1) is true. We now give a theorem that frequently simpliﬁes the calculations of probabilities of events which involve independent variables.

114

Multivariate Distributions Theorem 5.3. The random variables X1 and X2 are independent random variables if and only if the following condition holds, P (a < X1 ≤ b, c < X2 ≤ d) = P (a < X1 ≤ b)P (c < X2 ≤ d)

(5.2)

for every a < b and c < d, where a, b, c, and d are constants. Proof: If X1 and X2 are independent, then an application of the last theorem and expression (1.2) shows that P (a < X1 ≤ b, c < X2 ≤ d)

= =

F (b, d) − F (a, d) − F (b, c) + F (a, c) F1 (b)F2 (d) − F1 (a)F2 (d) − F1 (b)F2 (c) +F1 (a)F2 (c)

=

[F1 (b) − F1 (a)][F2 (d) − F2 (c)],

which is the right side of expression (5.2). Conversely, condition (5.2) implies that the joint cdf of (X1 , X2 ) factors into a product of the marginal cdfs, which in turn by Theorem 5.2 implies that X1 and X2 are independent. Example 5.3 (Example 5.1, Continued). Independence is necessary for condition (5.2). For example, consider the dependent variables X1 and X2 of Example 5.1. For these random variables, we have 1/2 1/2 (x1 + x2 ) dx1 dx2 = 18 , P (0 < X1 < 12 , 0 < X2 < 12 ) = 0 0 whereas and

P (0 < X1 < 12 ) = P (0 < X2 < 12 ) =

1/2 0

1/2 0

(x1 + 12 ) dx1 =

3 8

( 12 + x1 ) dx2 = 38 .

Hence, condition (5.2) does not hold. Not merely are calculations of some probabilities usually simpler when we have independent random variables, but many expectations, including certain moment generating functions, have comparably simpler computations. The following result proves so useful that we state it in the form of a theorem. Theorem 5.4. Suppose X1 and X2 are independent and that E(u(X1 )) and E(v(X2 )) exist. Then E[u(X1 )v(X2 )] = E[u(X1 )]E[v(X2 )]. Proof. We give the proof in the continuous case. The independence of X1 and X2 implies that the joint pdf of X1 and X2 is f1 (x1 )f2 (x2 ). Thus we have, by deﬁnition of expectation, ∞ ∞ u(x1 )v(x2 )f1 (x1 )f2 (x2 ) dx1 dx2 E[u(X1 )v(X2 )] = −∞ −∞ ∞

∞ = u(x1 )f1 (x1 ) dx1 v(x2 )f2 (x2 ) dx2 −∞

=

−∞

E[u(X1 )]E[v(X2 )].

115

Multivariate Distributions Hence, the result is true. Upon taking the functions u(·) and v(·) to be the identity functions in Theorem 5.4, we have that for independent random variables X1 and X2 , E(X1 X2 ) = E(X1 )E(X2 ).

(5.3)

Example 5.4. Let X and Y be two independent random variables with means μ1 and μ2 and positive variances σ12 and σ22 , respectively. We show that the independence of X and Y implies that the correlation coeﬃcient of X and Y is zero. This is true because the covariance of X and Y is equal to E[(X − μ1 )(Y − μ2 )] = E(X − μ1 )E(Y − μ2 ) = 0. We next prove a very useful theorem about independent random variables. The proof of the theorem relies heavily upon our assertion that an mgf, when it exists, is unique and that it uniquely determines the distribution of probability. Theorem 5.5. Suppose the joint mgf, M (t1 , t2 ), exists for the random variables X1 and X2 . Then X1 and X2 are independent if and only if M (t1 , t2 ) = M (t1 , 0)M (0, t2 ); that is, the joint mgf is identically equal to the product of the marginal mgfs. Proof. If X1 and X2 are independent, then M (t1 , t2 )

= E(et1 X1 +t2 X2 ) = E(et1 X1 et2 X2 ) = E(et1 X1 )E(et2 X2 ) = M (t1 , 0)M (0, t2 ).

Thus the independence of X1 and X2 implies that the mgf of the joint distribution factors into the product of the moment-generating functions of the two marginal distributions. Suppose next that the mgf of the joint distribution of X1 and X2 is given by M (t1 , t2 ) = M (t1 , 0)M (0, t2 ). Now X1 has the unique mgf, which, in the continuous case, is given by ∞

M (t1 , 0) =

−∞

et1 x1 f1 (x1 ) dx1 .

Similarly, the unique mgf of X2 , in the continuous case, is given by ∞ M (0, t2 ) = et2 x2 f2 (x2 ) dx2 . −∞

Thus we have M (t1 , 0)M (0, t2 )

∞ et1 x1 f1 (x1 ) dx1 et2 x2 f2 (x2 ) dx2 −∞ −∞ ∞ ∞ t1 x1 +t2 x2 e f1 (x1 )f2 (x2 ) dx1 dx2 .

= =

∞

−∞

116

−∞

Multivariate Distributions We are given that M (t1 , t2 ) = M (t1 , 0)M (0, t2 ); so ∞ ∞ M (t1 , t2 ) = et1 x1 +t2 x2 f1 (x1 )f2 (x2 ) dx1 dx2 . −∞

−∞

But M (t1 , t2 ) is the mgf of X1 and X2 . Thus ∞ ∞ M (t1 , t2 ) = et1 x1 +t2 x2 f (x1 , x2 ) dx1 dx2 . −∞

−∞

The uniqueness of the mgf implies that the two distributions of probability that are described by f1 (x1 )f2 (x2 ) and f (x1 , x2 ) are the same. Thus f (x1 , x2 ) ≡ f1 (x1 )f2 (x2 ). That is, if M (t1 , t2 ) = M (t1 , 0)M (0, t2 ), then X1 and X2 are independent. This completes the proof when the random variables are of the continuous type. With random variables of the discrete type, the proof is made by using summation instead of integration. Example 5.5 (Example 1.7, Continued). Let (X, Y ) be a pair of random variables with the joint pdf −y e 0 0 and the integral converges (absolutely). A useful random variable is given by h(X1 ) = E[u(X2 , . . . , Xn )|X1 )]. The above discussion of marginal and conditional distributions generalizes to random variables of the discrete type by using pmfs and summations instead of integrals. Let the random variables X1 , X2 , . . . , Xn have the joint pdf f (x1 , x2 , . . . , xn ) and the marginal probability density functions f1 (x1 ), f2 (x2 ), . . . , fn (xn ), respectively. The deﬁnition of the independence of X1 and X2 is generalized to the mutual independence of X1 , X2 , . . . , Xn as follows: The random variables X1 , X2 , . . . , Xn are said to be mutually independent if and only if f (x1 , x2 , . . . , xn ) ≡ f1 (x1 )f2 (x2 ) · · · fn (xn ), for the continuous case. In the discrete case, X1 , X2 , . . . , Xn are said to be mutually independent if and only if p(x1 , x2 , . . . , xn ) ≡ p1 (x1 )p2 (x2 ) · · · pn (xn ). Suppose X1 , X2 , . . . , Xn are mutually independent. Then P (a1 < X1 < b1 , a2 < X2 < b2 , . . . , an < Xn < bn ) = P (a1 < X1 < b1 )P (a2 < X2 < b2 ) · · · P (an < Xn < bn ) n # P (ai < Xi < bi ), = i=1

where the symbol

$n i=1

ϕ(i) is deﬁned to be n #

ϕ(i) = ϕ(1)ϕ(2) · · · ϕ(n).

i=1

The theorem that E[u(X1 )v(X2 )] = E[u(X1 )]E[v(X2 )] for independent random variables X1 and X2 becomes, for mutually independent random variables X1 , X2 , . . . , Xn , E[u1 (X1 )u2 (X2 ) · · · un (Xn )] = E[u1 (X1 )]E[u2 (X2 )] · · · E[un (Xn )],

or E

n #

i=1

122

ui (Xi ) =

n # i=1

E[ui (Xi )].

Multivariate Distributions The moment-generating function (mgf) of the joint distribution of n random variables X1 , X2 , . . . , Xn is deﬁned as follows. Let E[exp(t1 X1 + t2 X2 + · · · + tn Xn )] exists for −hi < ti < hi , i = 1, 2, . . . , n, where each hi is positive. This expectation is denoted by M (t1 , t2 , . . . , tn ) and it is called the mgf of the joint distribution of X1 , . . . , Xn (or simply the mgf of X1 , . . . , Xn ). As in the cases of one and two variables, this mgf is unique and uniquely determines the joint distribution of the n variables (and hence all marginal distributions). For example, the mgf of the marginal distributions of Xi is M (0, . . . , 0, ti , 0, . . . , 0), i = 1, 2, . . . , n; that of the marginal distribution of Xi and Xj is M (0, . . . , 0, ti , 0, . . . , 0, tj , 0, . . . , 0); and so on. Theorem 5.5 of this chapter can be generalized, and the factorization M (t1 , t2 , . . . , tn ) =

n #

M (0, . . . , 0, ti , 0, . . . , 0)

(6.6)

i=1

is a necessary and suﬃcient condition for the mutual independence of X1 , X2 , . . . , Xn . Note that we can write the joint mgf in vector notation as M (t) = E[exp(t X)],

for t ∈ B ⊂ Rn ,

where B = {t : −hi < ti < hi , i = 1, . . . , n}. The following is a theorem that proves useful in the sequel. It gives the mgf of a linear combination of independent random variables. Theorem 6.1. Suppose X1 , X2 , . . . , Xn are n mutually independent random variables. Suppose, forall i = 1, 2, . . . , n, Xi has mgf Mi (t), for −hi < t < hi , where n hi > 0. Let T = i=1 ki Xi , where k1 , k2 , . . . , kn are constants. Then T has the mgf given by MT (t) =

n #

Mi (ki t),

i=1

− min{hi } < t < min{hi }. i

i

(6.7)

Proof. Assume t is in the interval (− mini {hi }, mini {hi }). Then, by independence, n n # MT (t) = E e i=1 tki Xi = E e(tki )Xi i=1

=

n n # # E etki Xi = Mi (ki t), i=1

i=1

which completes the proof. Example 6.2. Let X1 , X2 , and X3 be three mutually independent random variables and let each have the pdf 2x 0 < x < 1 f (x) = (6.8) 0 elsewhere.

123

Multivariate Distributions The joint pdf of X1 , X2 , X3 is f (x1 )f (x2 )f (x3 ) = 8x1 x2 x3 , 0 < xi < 1, i = 1, 2, 3, zero elsewhere. Then, for illustration, the expected value of 5X1 X23 + 3X2 X34 is 1 1 1 (5x1 x32 + 3x2 x43 )8x1 x2 x3 dx1 dx2 dx3 = 2. 0

0

0

Let Y be the maximum of X1 , X2 , and X3 . Then, for instance, we have P (Y ≤ 12 )

=

P (X1 ≤ 12 , X2 ≤ 12 , X3 ≤ 12 ) 1/2 1/2 1/2 8x1 x2 x3 dx1 dx2 dx3

=

( 12 )6 =

=

0

0

0

1 64 .

In a similar manner, we ﬁnd that the cdf of Y is ⎧ ⎨ 0 y 0.80 and 0.20 > ( 23 )n . Either by inspection or by use of logarithms, we see that n = 4 is the solution. That is, the probability of at least one success throughout n = 4 independent repetitions of a random experiment with probability of success p = 13 is greater than 0.80. Example 1.4. Let the random variable Y be equal to the number of successes throughout n independent repetitions of a random experiment with probability p of success. That is, Y is b(n, p). The ratio Y /n is called the relative frequency of success. Recall, the second version of Chebyshev’s inequality. Applying this result, we have for all > 0 that Y Var(Y /n) p(1 − p) = P − p ≥ ≤ n 2 n2 [Exercise 1.3 asks for the determination of Var(Y /n)]. Now, for every ﬁxed > 0, the right-hand member of the preceding inequality is close to zero for suﬃciently large n. That is, Y lim P − p ≥ = 0 n→∞ n and lim P

n→∞

Y − p < = 1. n

Since this is true for every ﬁxed > 0, we see, in a certain sense, that the relative frequency of success is for large values of n, close to the probability of p of success. This result is one form of the Weak Law of Large Numbers. Example 1.5. Let the independent random variables X1 , X2 , X3 have the same cdf F (x). Let Y be the middle value of X1 , X2 , X3 . To determine the cdf of Y , say FY (y) = P (Y ≤ y), we note that Y ≤ y if and only if at least two of the random variables X1 , X2 , X3 are less than or equal to y. Let us say that the ith “trial” is a success if Xi ≤ y, i = 1, 2, 3; here each “trial” has the probability of success F (y). In this terminology, FY (y) = P (Y ≤ y) is then the probability of at least two successes in three independent trials. Thus 3 [F (y)]2 [1 − F (y)] + [F (y)]3 . FY (y) = 2 If F (x) is a continuous cdf so that the pdf of X is F (x) = f (x), then the pdf of Y is fY (y) = FY (y) = 6[F (y)][1 − F (y)]f (y). Example 1.6. Consider a sequence of independent repetitions of a random experiment with constant probability p of success. Let the random variable Y denote the total number of failures in this sequence before the rth success, that is,

144

Some Special Distributions Y + r is equal to the number of trials necessary to produce exactly r successes. Here r is a ﬁxed positive integer. To determine the pmf of Y , let y be an element of {y : y = 0, 1, 2, . . .}. Then, by the multiplication rule of probabilities, P (Y = y) = g(y) is equal to the product of the probability y + r − 1 r−1 p (1 − p)y r−1 of obtaining exactly r − 1 successes in the ﬁrst y + r − 1 trials and the probability p of a success on the (y + r)th trial. Thus the pmf of Y is y+r−1 r y r−1 p (1 − p) pY (y) = 0

y = 0, 1, 2, . . . elsewhere.

(1.4)

A distribution with a pmf of the form pY (y) is called a negative binomial distribution; and any such pY (y) is called a negative binomial pmf. The distribution derives its name from the fact that pY (y) is a general term in the expansion of pr [1 − (1 − p)]−r . It is left as an exercise to show that the mgf of this distribution is M (t) = pr [1 − (1 − p)et ]−r , for t < − log(1 − p). If r = 1, then Y has the pmf pY (y) = p(1 − p)y ,

y = 0, 1, 2, . . . ,

(1.5)

zero elsewhere, and the mgf M (t) = p[1 − (1 − p)et ]−1 . In this special case, r = 1, we say that Y has a geometric distribution. Suppose we have several independent binomial distributions with the same probability of success. Then it makes sense that the sum of these random variables is binomial, as shown in the following theorem. Theorem 1.1. Let X1 , X2 , . . . , Xm be independent random variables

m such that Xi has binomial b(ni

, p) distribution, for i = 1, 2, . . . , m. Let Y = i=1 Xi . Then Y m has a binomial b( i=1 ni , p) distribution. Proof: The mgf of Xi is MXi (t) = (1 − p + pet )ni . By independence it follows that MY (t) =

m

(1 − p + pet )ni = (1 − p + pet )

m

i=1

ni

.

i=1

m Hence, Y has a binomial b( i=1 ni , p) distribution. The binomial distribution is generalized to the multinomial distribution as follows. Let a random experiment be repeated n independent times. On each repetition, the experiment results in but one of k mutually exclusive and exhaustive ways, say C1 , C2 , . . . , Ck . Let pi be the probability that the outcome is an element of Ci and let pi remain constant throughout the n independent repetitions, i = 1, 2, . . . , k. Deﬁne the random variable Xi to be equal to the number of outcomes that are elements of Ci , i = 1, 2, . . . , k − 1. Furthermore, let x1 , x2 , . . . , xk−1 be nonnegative

145

Some Special Distributions integers so that x1 + x2 + · · · + xk−1 ≤ n. Then the probability that exactly x1 terminations of the experiment are in C1 , . . . , exactly xk−1 terminations are in Ck−1 , and hence exactly n − (x1 + · · · + xk−1 ) terminations are in Ck is n! xk−1 xk px1 1 · · · pk−1 pk , x1 ! · · · xk−1 !xk ! where xk is merely an abbreviation for n − (x1 + · · · + xk−1 ). This is the multinomial pmf of k − 1 random variables X1 , X2 , . . . , Xk−1 of the discrete type. To see that this is correct, note that the number of distinguishable arrangements of x1 C1 s, x2 C2 s, . . . , xk Ck s is n n − x1 n − x1 − · · · − xk−2 n! ··· = x1 x2 xk−1 x1 !x2 ! · · · xk ! and the probability of each of these distinguishable arrangements is px1 1 px2 2 · · · pxkk . Hence the product of these two latter expressions gives the correct probability, which is an agreement with the formula for the multinomial pmf. When k = 3, we often let X = X1 and Y = X2 ; then n − X − Y = X3 . We say that X and Y have a trinomial distribution. The joint pmf of X and Y is p(x, y) =

n! px py pn−x−y , x!y!(n − x − y)! 1 2 3

where x and y are nonnegative integers with x + y ≤ n, and p1 , p2 , and p3 are positive proper fractions with p1 + p2 + p3 = 1; and let p(x, y) = 0 elsewhere. Accordingly, p(x, y) satisﬁes the conditions of being a joint pmf of two random variables X and Y of the discrete type; that is, p(x, y) is nonnegative and its sum over all points (x, y) at which p(x, y) is positive is equal to (p1 + p2 + p3 )n = 1. If n is a positive integer and a1 , a2 , a3 are ﬁxed constants, we have n n−x x=0 y=0

n! ax ay an−x−y x!y!(n − x − y)! 1 2 3

n

n−x n!ax1 (n − x)! ay2 an−x−y 3 x!(n − x)! y!(n − x − y)! x=0 y=0

=

n

=

n! ax1 (a2 + a3 )n−x x!(n − x)! x=0

=

(a1 + a2 + a3 )n .

(1.6)

Consequently, the mgf of a trinomial distribution, in accordance with Equation (1.6), is given by M (t1 , t2 )

=

n n−x x=0 y=0

=

146

n! (p1 et1 )x (p2 et2 )y pn−x−y 3 x!y!(n − x − y)!

(p1 et1 + p2 et2 + p3 )n ,

Some Special Distributions for all real values of t1 and t2 . The moment-generating functions of the marginal distributions of X and Y are, respectively, M (t1 , 0) = (p1 et1 + p2 + p3 )n = [(1 − p1 ) + p1 et1 ]n and M (0, t2 ) = (p1 + p2 et2 + p3 )n = [(1 − p2 ) + p2 et2 ]n . We see immediately that X and Y are dependent random variables. In addition, X is b(n, p1 ) and Y is b(n, p2 ). Accordingly, the means and variances of X and Y are, respectively, μ1 = np1 , μ2 = np2 , σ12 = np1 (1 − p1 ), and σ22 = np2 (1 − p2 ). Consider next the conditional pmf of Y , given X = x. We have

y n−x−y (n−x)! p2 p3 y = 0, 1, . . . , n − x y!(n−x−y)! 1−p1 1−p1 p2|1 (y|x) = 0 elsewhere. Thus the conditional distribution of Y , given X = x, is b[n − x, p2 /(1 − p1 )]. Hence the conditional mean of Y , given X = x, is the linear function p2 . E(Y |x) = (n − x) 1 − p1 Also, the conditional distribution of X, given Y = y, is b[n − y, p1 /(1 − p2 )] and thus p1 . E(X|y) = (n − y) 1 − p2 Now recall that the square of the correlation coeﬃcient ρ2 is equal to the product of −p2 /(1 − p1 ) and −p1 /(1 − p2 ), the coeﬃcients of x and y in the respective conditional means. Since both of these coeﬃcients are negative (and thus ρ is negative), we have p1 p2 . ρ=− (1 − p1 )(1 − p2 ) In general, the mgf of a multinomial distribution is given by M (t1 , . . . , tk−1 ) = (p1 et1 + · · · + pk−1 etk−1 + pk )n for all real values of t1 , t2 , . . . , tk−1 . Thus each one-variable marginal pmf is binomial, each two-variable marginal pmf is trinomial, and so on. The hypergeometric distribution is related to the binomial distribution. Example 1.7 (Hypergeometric Distribution). A frequently used application of the hypergeometric distribution is in acceptance sampling. Suppose we have a lot of N items of which D are defective. Let X denote the number of defective items

147

Some Special Distributions in a sample of size n. If the sampling is done with replacement and the items are chosen at random, then X has a binomial distribution with parameters n and D/N . In this case the mean and variance of X are n(D/N ) and n(D/N )[(N − D)/N ], respectively. Suppose, however, that the sampling is without replacement, which is often the case in practice. The pmf of X follows by Nnoting in this case that each of −D D samples are equilikely and that there are the N n n−x x samples which have x defective items. Hence, the pmf of X is N −DD p(x) =

n−x N x n

,

x = 0, 1, . . . , n,

(1.7)

where, as usual, a binomial coeﬃcient is taken to be 0 when the top value is less than the bottom value. We say that X has a hypergeometric distribution with parameters (N, D, n). The mean of X is N −D n n [D(D − 1)!]/[x(x − 1)!(D − x)!] xp(x) = x n−x E(X) = [N (N − 1)!]/[(N − n)!n(n − 1)!] x=0 x=1 −1 n D (N − 1) − (D − 1) D − 1 N − 1 D = n =n . n−1 x−1 N x=1 (n − 1) − (x − 1) N In the next-to-last step, we used the fact that the probabilities of a hypergeometric (N − 1, D − 1, n − 1) distribution summed over its entire range is 1. So the mean for both types of sampling is the same. The variances, though, diﬀer. As Exercise 1.28 shows, the variance of a hypergeometric (N, D, n) is Var(X) = n

D N −DN −n . N N N −1

(1.8)

The last term is often thought of as the correction term when sampling without replacement. Note that it is close to 1 if N is much larger than n. EXERCISES 1.1. If the mgf of a random variable X is ( 13 + 23 et )5 , ﬁnd P (X = 2 or 3). 1.2. The mgf of a random variable X is ( 23 + 13 et )9 . Show that P (μ − 2σ < X < μ + 2σ) =

5 x 9−x 1 9 2 x=1

x

3

3

.

1.3. If X is b(n, p), show that E

X n

=p

and

E

X −p n

2 =

p(1 − p) . n

1.4. Let the independent random variables X1 , X2 , X3 have the same pdf f (x) = 3x2 , 0 < x < 1, zero elsewhere. Find the probability that exactly two of these three variables exceed 12 .

148

Some Special Distributions 1.5. Let Y be the number of successes in n independent repetitions of a random experiment having the probability of success p = 23 . If n = 3, compute P (2 ≤ Y ); if n = 5, compute P (3 ≤ Y ). 1.6. Let Y be the number of successes throughout n independent repetitions of a random experiment with probability of success p = 14 . Determine the smallest value of n so that P (1 ≤ Y ) ≥ 0.70. 1.7. Let the independent random variables X1 and X2 have binomial distribution with parameters n1 = 3, p = 23 and n2 = 4, p = 12 , respectively. Compute P (X1 = X2 ). Hint: List the four mutually exclusive ways that X1 = X2 and compute the probability of each. 1.8. For this exercise, the reader must have access to a statistical package that obtains the binomial distribution. Hints are given for R code, but other packages can be used too. (a) Obtain the plot of the pmf for the b(15, 0.2) distribution. Using R, the following commands return the plot: x 0. Since m > 0, then p(x) ≥ 0 and x

p(x) =

∞ mx e−m x=0

x!

= e−m

∞ mx x=0

x!

= e−m em = 1;

that is, p(x) satisﬁes the conditions of being a pmf of a discrete type of random variable. A random variable that has a pmf of the form p(x) is said to have a Poisson distribution with parameter m, and any such p(x) is called a Poisson pmf with parameter m. Remark 2.1. Experience indicates that the Poisson pmf may be used in a number of applications with quite satisfactory results. For example, let the random variable X denote the number of alpha particles emitted by a radioactive substance that enter a prescribed region during a prescribed interval of time. With a suitable value of m, it is found that X may be assumed to have a Poisson distribution. Again let the random variable X denote the number of defects on a manufactured article, such as a refrigerator door. Upon examining many of these doors, it is found, with an appropriate value of m, that X may be said to have a Poisson distribution. The number of automobile accidents in a unit of time (or the number of insurance claims in some unit of time) is often assumed to be a random variable which has a Poisson distribution. Each of these instances can be thought of as a process that generates a number of changes (accidents, claims, etc.) in a ﬁxed interval (of time or space, etc.). A process which leads to a Poisson distribution is called a Poisson process. Some assumptions that ensure a Poisson process are now enumerated. Let g(x, w) denote the probability of x changes in each interval of length w. Let the symbol o(h) represent any function such that limh→0 [o(h)/h] = 0; for example, h2 = o(h) and o(h) + o(h) = o(h). The Poisson postulates are the following: 1. g(1, h) = λh + o(h), where λ is a positive constant and h > 0.

152

Some Special Distributions 2.

∞ x=2

g(x, h) = o(h).

3. The numbers of changes in nonoverlapping intervals are independent. Postulates 1 and 3 state, in eﬀect, that the probability of one change in a short interval h is independent of changes in other nonoverlapping intervals and is approximately proportional to the length of the interval. The substance of postulate 2 is that the probability of two or more changes in the same short interval h is essentially equal to zero. If x = 0, we take g(0, 0) = 1. In accordance with postulates 1 and 2, the probability of at least one change in an interval h is λh+o(h)+o(h) = λh+o(h). Hence the probability of zero changes in this interval of length h is 1 − λh − o(h). Thus the probability g(0, w + h) of zero changes in an interval of length w + h is, in accordance with postulate 3, equal to the product of the probability g(0, w) of zero changes in an interval of length w and the probability [1 − λh − o(h)] of zero changes in a nonoverlapping interval of length h. That is, g(0, w + h) = g(0, w)[1 − λh − o(h)]. Then

o(h)g(0, w) g(0, w + h) − g(0, w) = −λg(0, w) − . h h If we take the limit as h → 0, we have Dw [g(0, w)] = −λg(0, w).

(2.2)

The solution of this diﬀerential equation is g(0, w) = ce−λw ; that is, the function g(0, w) = ce−λw satisﬁes Equation (2.2). g(0, 0) = 1 implies that c = 1; thus

The condition

g(0, w) = e−λw . If x is a positive integer, we take g(x, 0) = 0. The postulates imply that g(x, w + h) = [g(x, w)][1 − λh − o(h)] + [g(x − 1, w)][λh + o(h)] + o(h). Accordingly, we have o(h) g(x, w + h) − g(x, w) = −λg(x, w) + λg(x − 1, w) + h h and Dw [g(x, w)] = −λg(x, w) + λg(x − 1, w), for x = 1, 2, 3, . . . . It can be shown, by mathematical induction, that the solutions to these diﬀerential equations, with boundary conditions g(x, 0) = 0 for x = 1, 2, 3, . . ., are, respectively, (λw)x e−λw , x = 1, 2, 3, . . . . g(x, w) = x! Hence the number of changes in X in an interval of length w has a Poisson distribution with parameter m = λw.

153

Some Special Distributions The mgf of a Poisson distribution is given by M (t)

=

etx p(x) =

x

=

etx

x=0

e−m

mx e−m x!

∞ (met )x

x!

x=0

=

∞

t

t

e−m eme = em(e

−1)

for all real values of t. Since t

M (t) = em(e

−1)

(met )

and t

M (t) = em(e

−1)

t

(met ) + em(e

−1)

(met )2 ,

then μ = M (0) = m and σ 2 = M (0) − μ2 = m + m2 − m2 = m. That is, a Poisson distribution has μ = σ 2 = m > 0. On this account, a Poisson pmf is frequently written as p(x) =

μx e−μ x!

0

x = 0, 1, 2, . . . elsewhere.

Thus the parameter m in a Poisson pmf is the mean μ. Table I in Appendix: Tables of Distributions gives approximately the distribution for various values of the parameter m = μ. On the other hand, if X has a Poisson distribution with parameter m = μ, then the R command dpois(k,m) returns the value that P (X = k). The cumulative probability P (X ≤ k) is given by ppois(k,m). Example 2.1. Suppose that X has a Poisson distribution with μ = 2. Then the pmf of X is 2x e−2 x = 0, 1, 2, . . . x! p(x) = 0 elsewhere. The variance of this distribution is σ 2 = μ = 2. If we wish to compute P (1 ≤ X), we have P (1 ≤ X)

= =

1 − P (X = 0) 1 − p(0) = 1 − e−2 = 0.865,

approximately, by Table I of Appendix: Tables of Distributions.

154

Some Special Distributions Example 2.2. If the mgf of a random variable X is t

M (t) = e4(e

−1)

,

then X has a Poisson distribution with μ = 4. Accordingly, by way of example, P (X = 3) =

32 −4 43 e−4 = e , 3! 3

or, by Table I, P (X = 3) = P (X ≤ 3) − P (X ≤ 2) = 0.433 − 0.238 = 0.195. Example 2.3. Let the probability of exactly one blemish in 1 foot of wire be 1 and let the probability of two or more blemishes in that length be, about 1000 for all practical purposes, zero. Let the random variable X be the number of blemishes in 3000 feet of wire. If we assume the independence of the number of blemishes in nonoverlapping intervals, then the postulates of the Poisson process 1 and w = 3000. Thus X has an approximate are approximated, with λ = 1000 1 ) = 3. For example, the probability that Poisson distribution with mean 3000( 1000 there are ﬁve or more blemishes in 3000 feet of wire is P (X ≥ 5) =

∞ 3k e−3 k=5

k!

and by Table I, P (X ≥ 5) = 1 − P (X ≤ 4) = 1 − 0.815 = 0.185, approximately. The Poisson distribution satisﬁes the following important additive property. Theorem 2.1. Suppose X1 , . . . , Xn are independent random variables and suppose n Xi has a Poisson distribution with parameter mi . Then Y = i=1 Xi has a Poisson n distribution with parameter i=1 mi . Proof: We obtain the result by determining the mgf of Y , which is given by MY (t)

= =

n t E etY = emi (e −1)

e

n

i=1

i=1 mi (et −1)

.

By the uniqueness of mgfs, we conclude that Y has a Poisson distribution with

n parameter i=1 mi .

155

Some Special Distributions Example 2.4 (Example 2.3, Continued). Suppose in Example 2.3 that a bail of wire consists of 3000 feet. Based on the information in the example, we expect three blemishes in a bail of wire, and the probability of ﬁve or more blemishes is 0.185. Suppose in a sampling plan, three bails of wire are selected at random and we compute the mean number of blemishes in the wire. Now suppose we want to determine the probability that the mean of the three observations has ﬁve or more blemishes. Let Xi be the number of blemishes in the ith bail of wire for i = 1, 2, 3. , X2 , and X3 Then Xi has a Poisson distribution with parameter 3. The mean of X1

3 3 is X = 3−1 i=1 Xi , which can also be expressed as Y /3, where Y = i=1 Xi . By the last theorem, because the

bails are independent of one another, Y has a Poisson 3 distribution with parameter i=1 3 = 9. Hence, by Table I, the desired probability is P (X ≥ 5) = P (Y ≥ 15) = 1 − P (Y ≤ 14) = 1 − 0.959 = 0.041. Hence, while it is not too odd that a bail has ﬁve or more blemishes (probability is 0.185), it is unusual (probability is 0.041) that three independent bails of wire average ﬁve or more blemishes. EXERCISES 2.1. If the random variable X has a Poisson distribution such that P (X = 1) = P (X = 2), ﬁnd P (X = 4). t

2.2. The mgf of a random variable X is e4(e μ + 2σ) = 0.931.

−1)

. Show that P (μ − 2σ < X

μ1 , respectively. Find the distribution of X2 .

157

Some Special Distributions

3

The Γ, χ2 , and β Distributions

In this section we introduce the gamma (Γ), chi-square (χ2 ), and beta (β) distributions. It is proved in books on advanced calculus that the integral ∞ y α−1 e−y dy 0

exists for α > 0 and that the value of the integral is a positive number. The integral is called the gamma function of α, and we write ∞ y α−1 e−y dy. Γ(α) = 0

If α = 1, clearly

Γ(1) =

∞

e−y dy = 1.

0

If α > 1, an integration by parts shows that ∞ Γ(α) = (α − 1) y α−2 e−y dy = (α − 1)Γ(α − 1). 0

Accordingly, if α is a positive integer greater than 1, Γ(α) = (α − 1)(α − 2) · · · (3)(2)(1)Γ(1) = (α − 1)!. Since Γ(1) = 1, this suggests we take 0! = 1, as we have done. In the integral that deﬁnes Γ(α), let us introduce a new variable by writing y = x/β, where β > 0. Then ∞ α−1 x 1 dx, e−x/β Γ(α) = β β 0 or, equivalently,

∞

1= 0

1 xα−1 e−x/β dx. Γ(α)β α

Since α > 0, β > 0, and Γ(α) > 0, we see that 1 α−1 −x/β e Γ(α)β α x f (x) = 0

0 w, for w > 0, is equivalent to the event in which there are fewer than k changes in a time interval of length w. That is, if the random variable X is the number of changes in an interval of length w, then P (W > w) =

k−1

P (X = x) =

x=0

k−1

(λw)x e−λw . x! x=0

In Exercise 3.5, the reader is asked to prove that

∞ λw

(λw)x e−λw z k−1 e−z dz = . (k − 1)! x! x=0 k−1

If, momentarily, we accept this result, we have, for w > 0, G(w) = 1 −

∞

λw

z k−1 e−z dz = Γ(k)

λw 0

z k−1 e−z dz, Γ(k)

and for w ≤ 0, G(w) = 0. If we change the variable of integration in the integral that deﬁnes G(w) by writing z = λy, then

w

G(w) = 0

λk y k−1 e−λy dy, Γ(k)

w > 0,

and G(w) = 0 for w ≤ 0. Accordingly, the pdf of W is

g(w) = G (w) =

λk wk−1 e−λw Γ(k)

0

0 0, then

βy/2

G(y) = 0

1 xr/2−1 e−x/β dx. Γ(r/2)β r/2

Accordingly, the pdf of Y is g(y)

β/2 (βy/2)r/2−1 e−y/2 Γ(r/2)β r/2

=

G (y) =

=

1 y r/2−1 e−y/2 Γ(r/2)2r/2

if y > 0. That is, Y is χ2 (r). One of the most important properties of the gamma distribution is its additive property. variables. Suppose, for Theorem 3.2. Let X1 , . . . , Xn be independent random

n i = 1, . . . , n, that Xi has a Γ(αi , β) distribution. Let Y = i=1 Xi . Then Y has a n Γ( i=1 αi , β) distribution. Proof: Using the assumed independence and the mgf of a gamma distribution, we have that for t < 1/β, MY (t) =

n

(1 − βt)−αi = (1 − βt)−

n

i=1

αi

,

i=1

which is the mgf of a Γ(

n i=1

αi , β) distribution.

In the sequel, we often use this property for the χ2 distribution.

For convenience,

we state the result as a corollary, since here β = 2 and αi = ri /2. Corollary 3.1. Let X1 , . . . , Xn be independent random

variables. Suppose, for n i =

1, . . . , n, that Xi has a χ2 (ri ) distribution. Let Y = i=1 Xi . Then Y has a n 2 χ ( i=1 ri ) distribution. The following remark on Poisson processes proves useful in the simulation of these processes. Remark 3.3 (Poisson Processes, Continued). We continue the discussion of Remark 3.1 concerning the Poisson process with parameter λ. Recall that the Poisson process is counting the number of occurrences of an event over an interval of time.

163

Some Special Distributions Let T1 , T2 , T3 , . . . denote the interarrival times of these events. For instance, T1 is the time until the ﬁrst occurrence, T2 is the time between the ﬁrst and second occurrences, and so on. From Remark 3.1, we know that T1 has an exponential distribution with parameter λ. Note that Postulates (1) and (2) of the Poisson process only depend on λ and the length of the interval; in particular, they do not depend on the endpoints of the interval. Further, occurrences in nonoverlapping intervals are independent of one another. Hence, the same reasoning found in Remark 3.1 can be applied to show that Tj , j ≥ 2, also has an exponential distribution with parameter λ and that, further, T1 , T2 , T3 , . . . are independent. Let Wn be the waiting time until the nth occurrence. Then Wn = T1 + · · · + Tn . Thus by Theorem 3.2, Wn has a Γ(n, λ) distribution, conﬁrming the derivation of its distribution given in Remark 3.1. Although this discussion has been intuitive, it can be made rigorous; see, for example, Parzen (1962). We conclude this section with another important distribution called the beta distribution, which we derive from a pair of independent Γ random variables. Let X1 and X2 be two independent random variables that have Γ distributions and the joint pdf h(x1 , x2 ) =

1 xα−1 xβ−1 e−x1 −x2 , 0 < x1 < ∞, 0 < x2 < ∞, 2 Γ(α)Γ(β) 1

zero elsewhere, where α > 0, β > 0. Let Y1 = X1 + X2 and Y2 = X1 /(X1 + X2 ). We next show that Y1 and Y2 are independent. The space S is, exclusive of the points on the coordinate axes, the ﬁrst quadrant of the x1 x2 -plane. Now y1 = u1 (x1 , x2 ) = x1 + x2 x1 y2 = u2 (x1 , x2 ) = x1 + x2 may be written x1 = y1 y2 , x2 = y1 (1 − y2 ), so y2 y1 = −y1 ≡ 0. J = 1 − y2 −y1 The transformation is one-to-one, and it maps S onto T = {(y1 , y2 ) : 0 < y1 < ∞, 0 < y2 < 1} in the y1 y2 -plane. The joint pdf of Y1 and Y2 is then g(y1 , y2 )

= =

1 (y1 y2 )α−1 [y1 (1 − y2 )]β−1 e−y1 (y1 ) Γ(α)Γ(β) α−1 y2 (1−y2 )β−1 α+β−1 −y1 y1 e 0 < y1 < ∞, 0 < y2 < 1 Γ(α)Γ(β) 0 elsewhere.

The random variables are independent. The marginal pdf of Y2 is

164

Some Special Distributions

g2 (y2 )

= =

y2α−1 (1 − y2 )β−1 ∞ α+β−1 −y1 y1 e dy1 Γ(α)Γ(β) 0 Γ(α+β) α−1 (1 − y2 )β−1 0 < y2 < 1 Γ(α)Γ(β) y2 0 elsewhere.

(3.5)

This pdf is that of the beta distribution with parameters α and β. Since g(y1 , y2 ) ≡ g1 (y1 )g2 (y2 ), it must be that the pdf of Y1 is α+β−1 −y1 1 e 0 < y1 < ∞ Γ(α+β) y1 g1 (y1 ) = 0 elsewhere, which is that of a gamma distribution with parameter values of α + β and 1. It is an easy exercise to show that the mean and the variance of Y2 , which has a beta distribution with parameters α and β, are, respectively, μ=

α , α+β

σ2 =

αβ . (α + β + 1)(α + β)2

The program R calculates probabilities for the beta distribution. If X has a beta distribution with parameters α = a and β = b, then the command pbeta(x,a,b) returns P (X ≤ x) and the command dbeta(x,a,b) returns the value of the pdf of X at x. We close this section with another example of a random variable whose distribution is derived from a transformation of gamma random variables. Example 3.7 (Dirichlet Distribution). Let X1 , X2 , . . . , Xk+1 be independent random variables, each having a gamma distribution with β = 1. The joint pdf of these variables may be written as k+1 αi −1 −xi 1 e 0 < xi < ∞ i=1 Γ(αi ) xi h(x1 , x2 , . . . , xk+1 ) = 0 elsewhere. Let

Xi , i = 1, 2, . . . , k, X1 + X2 + · · · + Xk+1 and Yk+1 = X1 +X2 +· · ·+Xk+1 denote k+1 new random variables. The associated transformation maps A = {(x1 , . . . , xk+1 ) : 0 < xi < ∞, i = 1, . . . , k + 1} onto the space: Yi =

B = {(y1 , . . . , yk , yk+1 ) : 0 < yi , i = 1, . . . , k, y1 + · · · + yk < 1, 0 < yk+1 < ∞}. The single-valued inverse functions are x1 = y1 yk+1 , . . . , xk = yk yk+1 , xk+1 = yk+1 (1 − y1 − · · · − yk ), so that the Jacobian is yk+1 0 ··· 0 y1 0 y · · · 0 y k+1 2 . . . . k .. .. .. .. . J = = yk+1 0 0 ··· yk+1 yk −yk+1 −yk+1 · · · −yk+1 (1 − y1 − · · · − yk )

165

Some Special Distributions Hence the joint pdf of Y1 , . . . , Yk , Yk+1 is given by α +···+αk+1 −1 α1 −1 y1

· · · ykαk −1 (1 − y1 − · · · − yk )αk+1 −1 e−yk+1 , Γ(α1 ) · · · Γ(αk )Γ(αk+1 )

1 yk+1

provided that (y1 , . . . , yk , yk+1 ) ∈ B and is equal to zero elsewhere. By integrating out yk+1 , the joint pdf of Y1 , . . . , Yk is seen to be Γ(α1 + · · · + αk+1 ) α1 −1 y · · · ykαk −1 (1 − y1 − · · · − yk )αk+1 −1 , (3.6) Γ(α1 ) · · · Γ(αk+1 ) 1

g(y1 , . . . , yk ) =

when 0 < yi , i = 1, . . . , k, y1 + · · · + yk < 1, while the function g is equal to zero elsewhere. Random variables Y1 , . . . , Yk that have a joint pdf of this form are said to have a Dirichlet pdf. It is seen, in the special case of k = 1, that the Dirichlet pdf becomes a beta pdf. Moreover, it is also clear from the joint pdf of Y1 , . . . , Yk , Yk+1 that Yk+1 has a gamma distribution with parameters α1 +· · ·+αk +αk+1 and β = 1 and that Yk+1 is independent of Y1 , Y2 , . . . , Yk .

EXERCISES 3.1. If (1 − 2t)−6 , t < 12 , is the mgf of the random variable X, ﬁnd P (X < 5.23). 3.2. If X is χ2 (5), determine the constants c and d so that P (c < X < d) = 0.95 and P (X < c) = 0.025. 3.3. Find P (3.28 < X < 25.2) if X has a gamma distribution with α = 3 and β = 4. Hint: Consider the probability of the equivalent event 1.64 < Y < 12.6, where Y = 2X/4 = X/2. 3.4. Let X be a random variable such that E(X m ) = (m + 1)!2m , m = 1, 2, 3, . . . . Determine the mgf and the distribution of X. 3.5. Show that

∞

μ

μx e−μ 1 k−1 −z z , e dz = Γ(k) x! x=0 k−1

k = 1, 2, 3, . . . .

This demonstrates the relationship between the cdfs of the gamma and Poisson distributions. Hint: Either integrate by parts k−1 times or obtain the “antiderivative” by showing that ⎡ ⎤ k−1 Γ(k) d ⎣ −z −e z k−j−1 ⎦ = z k−1 e−z . dz (k − j − 1)! j=0 3.6. Let X1 , X2 , and X3 be iid random variables, each with pdf f (x) = e−x , 0 < x < ∞, zero elsewhere.

166

Some Special Distributions (a) Find the distribution of Y = minimum(X1 , X2 , X3 ). Hint: P (Y ≤ y) = 1 − P (Y > y) = 1 − P (Xi > y, i = 1, 2, 3). (b) Find the distribution of Y = maximum(X1 , X2 , X3 ). 3.7. Let X have a gamma distribution with pdf f (x) =

1 −x/β xe , 0 < x < ∞, β2

zero elsewhere. If x = 2 is the unique mode of the distribution, ﬁnd the parameter β and P (X < 9.49). 3.8. Compute the measures of skewness and kurtosis of a gamma distribution which has parameters α and β. 3.9. Let X have a gamma distribution with parameters α and β. Show that P (X ≥ 2αβ) ≤ (2/e)α . 3.10. Give a reasonable deﬁnition of a chi-square distribution with zero degrees of freedom. Hint: Work with the mgf of a distribution that is χ2 (r) and let r = 0. 3.11. Using the computer, obtain plots of the pdfs of chi-squared distributions with degrees of freedom r = 1, 2, 5, 10, 20. Comment on the plots. 3.12. Using the computer, plot the cdf of Γ(5, 4) and use it to guess the median. Conﬁrm it with a computer command which returns the median [In R, use the command qgamma(.5,shape=5,scale=4)]. 3.13. Using the computer, obtain plots of beta pdfs for α = 1, 5, 10 and β = 1, 2, 5, 10, 20. 3.14. In the Poisson postulates of Remark 2.1, let λ be a nonnegative function of w, say λ(w), such that Dw [g(0, w)] = −λ(w)g(0, w). Suppose that λ(w) = krwr−1 , r ≥ 1. (a) Find g(0, w), using the boundary condition g(0, 0) = 1. (b) Let W be the time that is needed to obtain exactly one change. Find the distribution function of W , i.e., G(w) = P (W ≤ w) = 1 − P (W > w) = 1 − g(0, w), 0 ≤ w, and then ﬁnd the pdf of W . This pdf is that of the Weibull distribution, which is used in the study of breaking strengths of materials. 3.15. Let X have a Poisson distribution with parameter m. If m is an experimental value of a random variable having a gamma distribution with α = 2 and β = 1, compute P (X = 0, 1, 2). Hint: Find an expression that represents the joint distribution of X and m. Then integrate out m to ﬁnd the marginal distribution of X.

167

Some Special Distributions 3.16. Let X have the uniform distribution with pdf f (x) = 1, 0 < x < 1, zero elsewhere. Find the cdf of Y = −2 log X. What is the pdf of Y ? 3.17. Find the uniform distribution of the continuous type on the interval (b, c) that has the same mean and the same variance as those of a chi-square distribution with 8 degrees of freedom. That is, ﬁnd b and c. 3.18. Find the mean and variance of the β distribution. Hint: From the pdf, we know that

1 0

y α−1 (1 − y)β−1 dy =

Γ(α)Γ(β) Γ(α + β)

for all α > 0, β > 0. 3.19. Determine the constant c in each of the following so that each f (x) is a β pdf: (a) f (x) = cx(1 − x)3 , 0 < x < 1, zero elsewhere. (b) f (x) = cx4 (1 − x)5 , 0 < x < 1, zero elsewhere. (c) f (x) = cx2 (1 − x)8 , 0 < x < 1, zero elsewhere. 3.20. Determine the constant c so that f (x) = cx(3 − x)4 , 0 < x < 3, zero elsewhere, is a pdf. 3.21. Show that the graph of the β pdf is symmetric about the vertical line through x = 12 if α = β. 3.22. Show, for k = 1, 2, . . . , n, that

1 p

k−1 n n! px (1 − p)n−x . z k−1 (1 − z)n−k dz = x (k − 1)!(n − k)! x=0

This demonstrates the relationship between the cdfs of the β and binomial distributions. 3.23. Let X1 and X2 be independent random variables. Let X1 and Y = X1 + X2 have chi-square distributions with r1 and r degrees of freedom, respectively. Here r1 < r. Show that X2 has a chi-square distribution with r − r1 degrees of freedom. Hint: Write M (t) = E(et(X1 +X2 ) ) and make use of the independence of X1 and X2 . 3.24. Let X1 , X2 be two independent random variables having gamma distributions with parameters α1 = 3, β1 = 3 and α2 = 5, β2 = 1, respectively. (a) Find the mgf of Y = 2X1 + 6X2 . (b) What is the distribution of Y ?

168

Some Special Distributions 3.25. Let X have an exponential distribution. (a) For x > 0 and y > 0, show that P (X > x + y | X > x) = P (X > y).

(3.7)

Hence, the exponential distribution has the memoryless property. Recall from (1.9) that the discrete geometric distribution had a similar property. (b) Let F (x) be the cdf of a continuous random variable Y . Assume that F (0) = 0 and 0 < F (y) < 1 for y > 0. Suppose property (3.7) holds for Y . Show that FY (y) = 1 − e−λy for y > 0. Hint: Show that g(y) = 1 − FY (y) satisﬁes the equation g(y + z) = g(y)g(z), 3.26. Consider a random variable X of the continuous type with cdf F (x) and pdf f (x). The hazard rate (or failure rate or force of mortality) is deﬁned by r(x) = lim

Δ→0

P (x ≤ X < x + Δ | X ≥ x) . Δ

(3.8)

In the case that X represents the failure time of an item, the above conditional probability represents the failure of an item in the interval [x, x + Δ] given that it has survived until time x. Viewed this way, r(x) is the rate of instantaneous failure at time x > 0. (a) Show that r(x) = f (x)/(1 − F (x)). (b) If r(x) = c, where c is a positive constant, show that the underlying distribution is exponential. Hence, exponential distributions have constant failure rates over all time. (c) If r(x) = cxb ; where c and b are positive constants, show that X has a Weibull distribution; i.e., b+1 0 0. Suppose we need to compute Φ(−z), where z > 0. Because the pdf of Z is symmetric about 0, we have Φ(−z) = 1 − Φ(z); (4.11) see Exercise 4.1. In the examples ahead, we illustrate the computation of normal probabilities and quantiles. Most computer packages oﬀer functions for computation of these probabilities. For example, the R command pnorm(x,a,b) calculates P (X ≤ x) when X has

173

Some Special Distributions a normal distribution with mean a and standard deviation b, while the command dnorm(x,a,b) returns the value of the pdf of X at x.

φ(z)

Φ(zp) = p

zp

z (0,0)

Figure 4.2: The standard normal density: p = Φ(zp ) is the area under the curve to the left of zp . Example 4.3. Let X be N (2, 25). Then, by Table III, 10 − 2 0−2 P (0 < X < 10) = Φ −Φ 5 5 = Φ(1.6) − Φ(−0.4) = 0.945 − (1 − 0.655) = 0.600 and

P (−8 < X < 1)

=

1−2 −8 − 2 Φ −Φ 5 5 Φ(−0.2) − Φ(−2)

=

(1 − 0.579) − (1 − 0.977) = 0.398.

=

Example 4.4. Let X be N (μ, σ 2 ). Then, by Table III, μ + 2σ − μ μ − 2σ − μ P (μ − 2σ < X < μ + 2σ) = Φ −Φ σ σ = Φ(2) − Φ(−2) =

174

0.977 − (1 − 0.977) = 0.954.

Some Special Distributions Example 4.5. Suppose that 10% of the probability for a certain distribution that is N (μ, σ 2 ) is below 60 and that 5% is above 90. What are the values of μ and σ? We are given that the random variable X is N (μ, σ 2 ) and that P (X ≤ 60) = 0.10 and P (X ≤ 90) = 0.95. Thus Φ[(60 − μ)/σ] = 0.10 and Φ[(90 − μ)/σ] = 0.95. From Table III we have 90 − μ 60 − μ = −1.28, = 1.64. σ σ These conditions require that μ = 73.1 and σ = 10.2 approximately. Remark 4.1. In this chapter we have illustrated three types of parameters associated with distributions. The mean μ of N (μ, σ 2 ) is called a location parameter because changing its value simply changes the location of the middle of the normal pdf; that is, the graph of the pdf looks exactly the same except for a shift in location. The standard deviation σ of N (μ, σ 2 ) is called a scale parameter because changing its value changes the spread of the distribution. That is, a small value of σ requires the graph of the normal pdf to be tall and narrow, while a large value of σ requires it to spread out and not be so tall. No matter what the values of μ and σ, however, the graph of the normal pdf is that familiar “bell shape.” Incidentally, the β of the gamma distribution is also a scale parameter. On the other hand, the α of the gamma distribution is called a shape parameter, as changing its value modiﬁes the shape of the graph of the pdf, as can be seen by referring to Figure 3.1. The parameters p and μ of the binomial and Poisson distributions, respectively, are also shape parameters. We close this part of the section with two important theorems. Theorem 4.1. If the random variable X is N (μ, σ 2 ), σ 2 > 0, then the random variable V = (X − μ)2 /σ 2 is χ2 (1). Proof. Because V = W 2 , where W = (X − μ)/σ is N (0, 1), the cdf G(v) for V is, for v ≥ 0, √ √ G(v) = P (W 2 ≤ v) = P (− v ≤ W ≤ v). That is,

G(v) = 2 0

√

v

2 1 √ e−w /2 dw, 2π

0 ≤ v,

and G(v) = 0,

v < 0.

√ If we change the variable of integration by writing w = y, then v 1 √ √ e−y/2 dy, 0 ≤ v. G(v) = 2π y 0 Hence the pdf g(v) = G (v) of the continuous-type random variable V is √ 1√ v 1/2−1 e−v/2 0 < v < ∞ π 2 g(v) = 0 elsewhere.

175

Some Special Distributions Since g(v) is a pdf and hence

∞

g(v) dv = 1, 0

it must be that Γ( 12 ) =

√ π and thus V is χ2 (1).

One of the most important properties of the normal distribution is its additivity under independence. Theorem 4.2. Let X1 , . . . , Xn be independent random

n variables such that, for i = Y = 1, . . . , n, Xi has a N (μi , σi2 ) distribution. Let i=1

ani Xi , where a1 , . . . , an are

n constants. Then the distribution of Y is N ( i=1 ai μi , i=1 a2i σi2 ). Proof: For t ∈ R, the mgf of Y is MY (t)

=

n i=1

=

exp tai μi + (1/2)t2 a2i σi2

exp t

n

ai μi + (1/2)t

i=1

which is the mgf of a N (

n i=1

ai μi ,

n i=1

2

n

a2i σi2

,

i=1

a2i σi2 ) distribution.

A simple corollary to this result gives the distribution of the sample mean X = n n−1 i=1 Xi when X1 , X2 , . . . Xn represents a random sample from a N (μ, σ 2 ). . , Xn be iid random variables with a common N (μ, σ 2 ) Corollary 4.2. Let X1 , . .

n −1 2 distribution. Let X = n i=1 Xi . Then X has a N (μ, σ /n) distribution. To prove this corollary, simply take ai = (1/n), μi = μ, and σi2 = σ 2 , for i = 1, 2, . . . , n, in Theorem 4.2.

4.1

Contaminated Normals

We next discuss a random variable whose distribution is a mixture of normals. As with the normal, we begin with a standardized random variable. Suppose we are observing a random variable that most of the time follows a standard normal distribution but occasionally follows a normal distribution with a larger variance. In applications, we might say that most of the data are “good” but that there are occasional outliers. To make this precise let Z have a N (0, 1) distribution; let I1− be a discrete random variable deﬁned by 1 with probability 1 − I1− = 0 with probability , and assume that Z and I1− are independent. Let W = ZI1− + σc Z(1 − I1− ). Then W is the random variable of interest.

176

Some Special Distributions The independence of Z and I1− imply that the cdf of W is FW (w) = P [W ≤ w]

= =

P [W ≤ w, I1− = 1] + P [W ≤ w, I1− = 0] P [W ≤ w|I1− = 1]P [I1− = 1]

= =

+ P [W ≤ w|I1− = 0]P [I1− = 0] P [Z ≤ w](1 − ) + P [Z ≤ w/σc ]. Φ(w)(1 − ) + Φ(w/σc )

(4.12)

Therefore, we have shown that the distribution of W is a mixture of normals. Further, because W = ZI1− + σc Z(1 − I1− ), we have E(W ) = 0 and Var(W ) = 1 + (σc2 − 1);

(4.13)

see Exercise 4.25. Upon diﬀerentiating (4.12), the pdf of W is fW (w) = φ(w)(1 − ) + φ(w/σc )

, σc

(4.14)

where φ is the pdf of a standard normal. Suppose, in general, that the random variable of interest is X = a + bW , where b > 0. Based on (4.13), the mean and variance of X are E(X) = a and Var(X) = b2 (1 + (σc2 − 1)). From expression (4.12), the cdf of X is x−a x−a FX (x) = Φ (1 − ) + Φ , b bσc

(4.15)

(4.16)

which is a mixture of normal cdfs. Based on expression (4.16) it is easy to obtain probabilities for contaminated normal distributions using R. For example, suppose, as above, W has cdf (4.12). Then P (W ≤ w) is obtained by the R command (1-eps)*pnorm(w) + eps*pnorm(w/sigc), where eps and sigc denote and σc , respectively. Similarly, the pdf of W at w is returned by (1-eps)*dnorm(w) + eps*dnorm(w/sigc)/sigc. In Section 7, we explore mixture distributions in general. EXERCISES

4.1. If

z

Φ(z) = −∞

2 1 √ e−w /2 dw, 2π

show that Φ(−z) = 1 − Φ(z). 4.2. If X is N (75, 100), ﬁnd P (X < 60) and P (70 < X < 100) by using either Table III or, if R is available, the command pnorm. 4.3. If X is N (μ, σ 2 ), ﬁnd b so that P [−b < (X − μ)/σ < b] = 0.90, by using either Table III of Appendix: Tables of Distributions or, if R is available, the command pnorm.

177

Some Special Distributions 4.4. Let X be N (μ, σ 2 ) so that P (X < 89) = 0.90 and P (X < 94) = 0.95. Find μ and σ 2 . 2

4.5. Show that the constant c can be selected so that f (x) = c2−x , −∞ < x < ∞, satisﬁes the conditions of a normal pdf. Hint: Write 2 = elog 2 . 4.6. If X is N (μ, σ 2 ), show that E(|X − μ|) = σ 2/π. 4.7. Show that the graph of a pdf N (μ, σ 2 ) has points of inﬂection at x = μ − σ and x = μ + σ. 3 2

4.8. Evaluate

exp[−2(x − 3)2 ] dx.

4.9. Determine the 90th percentile of the distribution, which is N (65, 25). 2

4.10. If e3t+8t is the mgf of the random variable X, ﬁnd P (−1 < X < 9). 4.11. Let the random variable X have the pdf 2 2 f (x) = √ e−x /2 , 0 < x < ∞, 2π

zero elsewhere.

Find the mean and the variance of X. Hint: Compute E(X) directly and E(X 2 ) by comparing the integral with the integral representing the variance of a random variable that is N (0, 1). 4.12. Let X be N (5, 10). Find P [0.04 < (X − 5)2 < 38.4]. 4.13. If X is N (1, 4), compute the probability P (1 < X 2 < 9). 4.14. If X is N (75, 25), ﬁnd the conditional probability that X is greater than 80 given that X is greater than 77. 4.15. Let X be a random variable such that E(X 2m ) = (2m)!/(2m m!), m = 1, 2, 3, . . . and E(X 2m−1 ) = 0, m = 1, 2, 3, . . . . Find the mgf and the pdf of X. 4.16. Let the mutually independent random variables X1 , X2 , and X3 be N (0, 1), N (2, 4), and N (−1, 1), respectively. Compute the probability that exactly two of these three variables are less than zero. 4.17. Let X have a N (μ, σ 2 ) distribution. Use expression (4.9) to derive the third and fourth moments of X. 4.18. Compute the measures of skewness and kurtosis of a distribution which is N (μ, σ 2 ). 4.19. Let the random variable X have a distribution that is N (μ, σ 2 ). (a) Does the random variable Y = X 2 also have a normal distribution?

178

Some Special Distributions (b) Would the random variable Y = aX + b, a and b nonzero constants have a normal distribution? Hint: In each case, ﬁrst determine P (Y ≤ y). 4.20. Let the random variable X be N (μ, σ 2 ). What would this distribution be if σ 2 = 0? Hint: Look at the mgf of X for σ 2 > 0 and investigate its limit as σ 2 → 0. 4.21. Let Y have a truncated distribution with pdf g(y) = φ(y)/[Φ(b) − Φ(a)], for a < y < b, zero elsewhere, where φ(x) and Φ(x) are, respectively, the pdf and distribution function of a standard normal distribution. Show then that E(Y ) is equal to [φ(a) − φ(b)]/[Φ(b) − Φ(a)]. 4.22. Let f (x) and F (x) be the pdf and the cdf, respectively, of a distribution of the continuous type such that f (x) exists for all x. Let the mean of the truncated distribution that has pdf g(y) = f (y)/F (b), −∞ < y < b, zero elsewhere, be equal to −f (b)/F (b) for all real b. Prove that f (x) is a pdf of a standard normal distribution. 4.23. Let X and Y be independent random variables, each with a distribution that is N (0, 1). Let Z = X + Y . Find the integral that represents the cdf G(z) = P (X + Y ≤ z) of Z. Determine the pdf of Z. ∞ Hint: We have that G(z) = −∞ H(x, z) dx, where

z−x

H(x, z) = −∞

Find G (z) by evaluating

1 exp[−(x2 + y 2 )/2] dy. 2π

∞ [∂H(x, z)/∂z] dx. −∞

4.24. Suppose X is a random variable with the pdf f (x) which is symmetric about 0; i.e., f (−x) = f (x). Show that F (−x) = 1 − F (x), for all x in the support of X. 4.25. Derive the mean and variance of a contaminated normal random variable. They are given in expression (4.13). 4.26. Assuming a computer is available, investigate the probabilities of an “outlier” for a contaminated normal random variable and a normal random variable. Specifically, determine the probability of observing the event {|X| ≥ 2} for the following random variables: (a) X has a standard normal distribution. (b) X has a contaminated normal distribution with cdf (4.12), where = 0.15 and σc = 10. (c) X has a contaminated normal distribution with cdf (4.12), where = 0.15 and σc = 20. (d) X has a contaminated normal distribution with cdf (4.12), where = 0.25 and σc = 20.

179

Some Special Distributions 4.27. Assuming a computer is available, plot the pdfs of the random variables deﬁned in parts (a)–(d) of the last exercise. Obtain an overlay plot of all four pdfs also. In R the domain values of the pdfs can easily be obtained by using the seq command. For instance, the command x X2 ). Hint: Write P (X1 > X2 ) = P (X1 − X2 > 0) and determine the distribution of X 1 − X2 . 4.29. Compute P (X1 + 2X2 − 2X3 > 7) if X1 , X2 , X3 are iid with common distribution N (1, 4). 4.30. A certain job is completed in three steps in series. The means and standard deviations for the steps are (in minutes)

Step

Mean

Standard Deviation

1

17

2

2

13

1

3

13

2

Assuming independent steps and normal distributions, compute the probability that the job takes less than 40 minutes to complete. 4.31. Let X be N (0, 1). Use the moment generating function technique to show that Y = X 2 is χ2 (1). √ 2 Hint: Evaluate the integral that represents E(etX ) by writing w = x 1 − 2t, t < 12 . 4.32. Suppose X1 , X2 are iid with a common standard normal distribution. Find the joint pdf of Y1 = X12 + X22 and Y2 = X2 and the marginal pdf of Y1 . √ √ Hint: Note that the space of Y1 and Y2 is given by − y1 < y2 < y1 , 0 < y1 < ∞.

5

The Multivariate Normal Distribution

In this section we present the multivariate normal distribution. We introduce it in general for an n-dimensional random vector, but we oﬀer detailed examples for the bivariate case when n = 2. As with Section 4 on the normal distribution, the derivation of the distribution is simpliﬁed by ﬁrst discussing the standard case and then proceeding to the general case. Also, vector and matrix notation is used.

180

Some Special Distributions Consider the random vector Z = (Z1 , . . . , Zn ) , where Z1 , . . . , Zn are iid N (0, 1) random variables. Then the density of Z is n/2 n n 1 2 1 1 2 1 √ exp − zi = exp − z fZ (z) = 2 2π 2 i=1 i 2π i=1 n/2 1 1 exp − z z , (5.1) = 2π 2 for z ∈ Rn . Because the Zi s have mean 0, have variance 1, and are uncorrelated, the mean and covariance matrix of Z are E[Z] = 0 and Cov[Z] = In ,

(5.2)

where In denotes the identity matrix of order n. Recall that the mgf of Zi evaluated at ti is exp{t2i /2}. Hence, because the Zi s are independent, the mgf of Z is n n MZ (t) = E [exp{t Z}] = E exp{ti Zi } = E [exp{ti Zi }] i=1

=

exp

n 1

2

i=1

t2i

= exp

i=1

1 tt , 2

(5.3)

for all t ∈ Rn . We say that Z has a multivariate normal distribution with mean vector 0 and covariance matrix In . We abbreviate this by saying that Z has an Nn (0, In ) distribution. For the general case, suppose Σ is an n×n, symmetric, and positive semi-deﬁnite matrix. Then from linear algebra, we can always decompose Σ as Σ = Γ ΛΓ,

(5.4)

where Λ is the diagonal matrix Λ = diag(λ1 , λ2 , . . . , λn ), λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0 are the eigenvalues of Σ, and the columns of Γ , v1 , v2 , . . . , vn , are the corresponding eigenvectors. This decomposition is called the spectral decomposition of Σ. The matrix Γ is orthogonal, i.e., Γ−1 = Γ , and, hence, ΓΓ = I. As Exercise 5.19 shows, we can write the spectral decomposition in another way, as Σ = Γ ΛΓ =

n

λi vi vi .

(5.5)

i=1

Because the√λi s are nonnegative, we can deﬁne the diagonal matrix Λ1/2 = √ diag { λ1 , . . . , λn }. Then the orthogonality of Γ implies Σ = Γ Λ1/2 ΓΓ Λ1/2 Γ. Deﬁne the square root of the positive semi-deﬁnite matrix Σ as Σ1/2 = Γ Λ1/2 Γ.

(5.6)

181

Some Special Distributions Note that Σ1/2 is symmetric and positive semi-deﬁnite. Suppose Σ is positive deﬁnite; that is, all of its eigenvalues are strictly positive. Based on this, it is then easy to show that −1

Σ1/2 = Γ Λ−1/2 Γ; (5.7) see Exercise 5.11. We write the left side of this equation as Σ−1/2 . These matrices enjoy many additional properties of the law of exponents for numbers; see, for example, Arnold (1981). Here, though, all we need are the properties given above. Let Z have a Nn (0, In ) distribution. Let Σ be a positive semi-deﬁnite, symmetric matrix and let μ be an n × 1 vector of constants. Deﬁne the random vector X by X = Σ1/2 Z + μ.

(5.8)

E[X] = μ and Cov[X] = Σ1/2 Σ1/2 = Σ.

(5.9)

By (5.2), we immediately have

Further, the mgf of X is given by MX (t) = E [exp{t X}]

=

! " E exp{t Σ1/2 Z + t μ} # $ Σ1/2 t Z exp{t μ}E exp

exp{t μ} exp (1/2) Σ1/2 t Σ1/2 t

=

exp{t μ} exp{(1/2)t Σt}.

= =

(5.10)

This leads to the following deﬁnition: Deﬁnition 5.1 (Multivariate Normal). We say an n-dimensional random vector X has a multivariate normal distribution if its mgf is MX (t) = exp {t μ + (1/2)t Σt} ,

(5.11)

for all t ∈ Rn and where Σ is a symmetric, positive semi-definite matrix and μ ∈ Rn . We abbreviate this by saying that X has a Nn (μ, Σ) distribution. Note that our deﬁnition is for positive semi-deﬁnite matrices Σ. Usually Σ is positive deﬁnite, in which case we can further obtain the density of X. If Σ is positive deﬁnite, then so is Σ1/2 and, as discussed above, its inverse is given by expression (5.7). Thus the transformation between X and Z, (5.8), is one-to-one with the inverse transformation Z = Σ−1/2 (X − μ), with Jacobian |Σ−1/2 | = |Σ|−1/2 . Hence, upon simpliﬁcation, the pdf of X is given by 1 1 −1 exp − (x − μ) Σ (x − μ) , for x ∈ Rn . (5.12) fX (x) = 2 (2π)n/2 |Σ|1/2

182

Some Special Distributions The following two theorems are very useful. The ﬁrst says that a linear transformation of a multivariate normal random vector has a multivariate normal distribution. Theorem 5.1. Suppose X has a Nn (μ, Σ) distribution. Let Y = AX+b, where A is an m × n matrix and b ∈ Rm . Then Y has a Nm (Aμ + b, AΣA ) distribution. Proof: From (5.11), for t ∈ Rm , the mgf of Y is MY (t)

= =

E [exp {t Y}] E [exp {t (AX + b)}]

= exp {t b} E [exp {(A t) X}] = exp {t b} exp {(A t) μ + (1/2)(A t) Σ(A t)} =

exp {t (Aμ + b) + (1/2)t AΣA t} ,

which is the mgf of an Nm (Aμ + b, AΣA ) distribution. A simple corollary to this theorem gives marginal distributions of a multivariate normal random variable. Let X1 be any subvector of X, say of dimension m < n. Because we can always rearrange means and correlations, there is no loss in generality in writing X as $ # X1 , (5.13) X= X2 where X2 is of dimension p = n − m. In the same way, partition the mean and covariance matrix of X; that is, $ # $ # Σ11 Σ12 μ1 and Σ = (5.14) μ= μ2 Σ21 Σ22 with the same dimensions as in expression (5.13). Note, for instance, that Σ11 is the covariance matrix of X1 and Σ12 contains all the covariances between the components of X1 and X2 . Now deﬁne A to be the matrix . A = [Im ..Omp ], where Omp is an m×p matrix of zeroes. Then X1 = AX. Hence, applying Theorem 5.1 to this transformation, along with some matrix algebra, we have the following corollary: Corollary 5.1. Suppose X has a Nn (μ, Σ) distribution, partitioned as in expressions (5.13) and (5.14). Then X1 has a Nm (μ1 , Σ11 ) distribution. This is a useful result because it says that any marginal distribution of X is also normal and, further, its mean and covariance matrix are those associated with that partial vector.

183

Some Special Distributions Example 5.1. In this example, we explore the multivariate normal case when n = 2. The distribution in this case is called the bivariate normal. We also use the customary notation of (X, Y ) instead of (X1 , X2 ). So, suppose (X, Y ) has a N2 (μ, Σ) distribution, where $ # 2 $ # σ1 σ12 μ1 and Σ = . (5.15) μ= μ2 σ12 σ22 Hence, μ1 and σ12 are the mean and variance, respectively, of X; μ2 and σ22 are the mean and variance, respectively, of Y ; and σ12 is the covariance between X and Y . Recall that σ12 = ρσ1 σ2 , where ρ is the correlation coeﬃcient between X and Y . Substituting ρσ1 σ2 for σ12 in Σ, it is easy to see that the determinant of Σ is σ12 σ22 (1 − ρ2 ). Recall that ρ2 ≤ 1. For the remainder of this example, assume that ρ2 < 1. In this case, Σ is invertible (it is also positive deﬁnite). Further, since Σ is a 2 × 2 matrix, its inverse can easily be determined to be # $ 1 σ22 −ρσ1 σ2 −1 . (5.16) Σ = 2 2 σ12 σ1 σ2 (1 − ρ2 ) −ρσ1 σ2 Using this expression, the pdf of (X, Y ), expression (5.12), can be written as f (x, y) = where 1 q= 1 − ρ2

2πσ1 σ2

1

1 − ρ2

x − μ1 σ1

e−q/2 ,

2

− 2ρ

−∞ < x < ∞,

x − μ1 σ1

y − μ2 σ2

−∞ < y < ∞,

+

y − μ2 σ2

(5.17)

2 ;

(5.18)

see Exercise 5.12. Recall in general, that if X and Y are independent random variables then their correlation coeﬃcient is 0. If they are normal, by Corollary 5.1, X has a N (μ1 , σ12 ) distribution and Y has a N (μ2 , σ22 ) distribution. Further, based on the expression (5.17) for the joint pdf of (X, Y ), we see that if the correlation coeﬃcient is 0, then X and Y are independent. That is, for the bivariate normal case, independence is equivalent to ρ = 0. The generalization is true for the multivariate normal as shown by Theorem 5.2. If two random variables are independent then their covariance is 0. In general, the converse is not true. However, as the following theorem shows, it is true for the multivariate normal distribution. Theorem 5.2. Suppose X has a Nn (μ, Σ) distribution, partitoned as in the expressions (5.13) and (5.14). Then X1 and X2 are independent if and only if Σ12 = O. Proof: First note that Σ21 = Σ12 . The joint mgf of X1 and X2 is given by 1 MX1 ,X2 (t1 , t2 ) = exp t1 μ1 + t2 μ2 + (t1 Σ11 t1 + t2 Σ22 t2 + t2 Σ21 t1 + t1 Σ12 t2 ) 2 (5.19)

184

Some Special Distributions where t = (t1 , t2 ) is partitioned the same as μ. By Corollary 5.1, X1 has a Nm (μ1 , Σ11 ) distribution and X2 has a Np (μ2 , Σ22 ) distribution. Hence, the product of their marginal mgfs is 1 (5.20) MX1 (t1 )MX2 (t2 ) = exp t1 μ1 + t2 μ2 + (t1 Σ11 t1 + t2 Σ22 t2 ) . 2 X1 and X2 are independent if and only if the expressions (5.19) and (5.20) are the same. If Σ12 = O and, hence, Σ21 = O, then the expressions are the same and X1 and X2 are independent. If X1 and X2 are independent, then the covariances between their components are all 0; i.e., Σ12 = O and Σ21 = O. Corollary 5.1 showed that the marginal distributions of a multivariate normal are themselves normal. This is true for conditional distributions, too. As the following proof shows, we can combine the results of Theorems 5.1 and 5.2 to obtain the following theorem. Theorem 5.3. Suppose X has a Nn (μ, Σ) distribution, which is partitioned as in expressions (5.13) and (5.14). Assume that Σ is positive definite. Then the conditional distribution of X1 | X2 is −1 Nm (μ1 + Σ12 Σ−1 22 (X2 − μ2 ), Σ11 − Σ12 Σ22 Σ21 ).

(5.21)

Proof: Consider ﬁrst the joint distribution of the random vector W = X1 − Σ12 Σ−1 22 X2 and X2 . This distribution is obtained from the transformation $ # $# $ # X1 W Im −Σ12 Σ−1 22 = . X2 X2 O Ip Because this is a linear transformation, it follows from Theorem 5.1 that the joint distribution is multivariate normal, with E[W] = μ1 − Σ12 Σ−1 22 μ2 , E[X2 ] = μ2 , and covariance matrix $# $# $ # Im O Σ11 Σ12 Im −Σ12 Σ−1 22 = Σ21 Σ22 O Ip −Σ−1 Ip 22 Σ21 # $ Σ11 − Σ12 Σ−1 O 22 Σ21 . O Σ22 Hence, by Theorem 5.2 the random vectors W and X2 are independent. Thus the conditional distribution of W | X2 is the same as the marginal distribution of W; that is, −1 W | X2 is Nm (μ1 − Σ12 Σ−1 22 μ2 , Σ11 − Σ12 Σ22 Σ21 ). Further, because of this independence, W + Σ12 Σ−1 22 X2 given X2 is distributed as −1 −1 Nm (μ1 − Σ12 Σ−1 22 μ2 + Σ12 Σ22 X2 , Σ11 − Σ12 Σ22 Σ21 ),

(5.22)

which is the desired result.

185

Some Special Distributions Example 5.2 (Continuation of Example 5.1). Consider once more the bivariate normal distribution that was given in Example 5.1. For this case, reversing the roles so that Y = X1 and X = X2 , expression (5.21) shows that the conditional distribution of Y given X = x is $ # σ2 (5.23) N μ2 + ρ (x − μ1 ), σ22 (1 − ρ2 ) . σ1 Thus, with a bivariate normal distribution, the conditional mean of Y , given that X = x, is linear in x and is given by E(Y |x) = μ2 + ρ

σ2 (x − μ1 ). σ1

Since the coeﬃcient of x in this linear conditional mean E(Y |x) is ρσ2 /σ1 , and since σ1 and σ2 represent the respective standard deviations, ρ is the correlation coeﬃcient of X and Y . This follows from the result that the coeﬃcient of x in a general linear conditional mean E(Y |x) is the product of the correlation coeﬃcient and the ratio σ2 /σ1 . Although the mean of the conditional distribution of Y , given X = x, depends upon x (unless ρ = 0), the variance σ22 (1 − ρ2 ) is the same for all real values of x. Thus, by way of example, given that X = x, the conditional probability that Y is within (2.576)σ2 1 − ρ2 units of the conditional mean is 0.99, whatever the value of x may be. In this sense, most of the probability for the distribution of X and Y lies in the band σ2 μ2 + ρ (x − μ1 ) ± 2.576σ2 1 − ρ2 σ1 about the graph of the linear conditional mean. For every ﬁxed positive σ2 , the width of this band depends upon ρ. Because the band is narrow when ρ2 is nearly 1, we see that ρ does measure the intensity of the concentration of the probability for X and Y about the linear conditional mean. In a similar manner we can show that the conditional distribution of X, given Y = y, is the normal distribution $ # σ1 2 2 N μ1 + ρ (y − μ2 ), σ1 (1 − ρ ) . σ2 Example 5.3. Let us assume that in a certain population of married couples the height X1 of the husband and the height X2 of the wife have a bivariate normal distribution with parameters μ1 = 5.8 feet, μ2 = 5.3 feet, σ1 = σ2 = 0.2 foot, and 6.3, is normal, with mean 5.3 + ρ = 0.6. The conditional pdf of X2 , given X1 = (0.6)(6.3 − 5.8) = 5.6 and standard deviation (0.2) (1 − 0.36) = 0.16. Accordingly, given that the height of the husband is 6.3 feet, the probability that his wife has a height between 5.28 and 5.92 feet is P (5.28 < X2 < 5.92|X1 = 6.3) = Φ(2) − Φ(−2) = 0.954. The interval (5.28, 5.92) could be thought of as a 95.4% prediction interval for the wife’s height, given X1 = 6.3.

186

Some Special Distributions Recall that if the random variable X has a N (μ, σ 2 ) distribution, then the random variable [(X − μ)/σ]2 has a χ2 (1) distribution. The multivariate analog of this fact is given in the next theorem. Theorem 5.4. Suppose X has a Nn (μ, Σ) distribution, where Σ is positive definite. Then the random variable W = (X − μ) Σ−1 (X − μ) has a χ2 (n) distribution. is deﬁned as in (5.6). Then Z = Proof: Write Σ = Σ1/2 Σ1/2 , where Σ1/2

n Σ−1/2 (X − μ) is Nn (0, In ). Let W = Z Z = i=1 Zi2 . Because, for i = 1, 2, . . . , n, Zi has a N (0, 1) distribution, it follows from Theorem 4.1 that Zi2 has a χ2 (1) distribution. Because

Z1 , . . . , Zn are independent standard normal random variables, by Corollary 3.1 i=1 Zi2 = W has a χ2 (n) distribution.

5.1

∗

Applications

In this section, we consider several applications of the multivariate normal distribution. These the reader may have already encountered in an applied course in statistics. The ﬁrst is principal components, which results in a linear function of a multivariate normal random vector that has independent components and preserves the “total” variation in the problem. Let the random vector X have the multivariate normal distribution Nn (μ, Σ) where Σ is positive deﬁnite. As in (5.4), write the spectral decomposition of Σ as Σ = Γ ΛΓ. Recall that the columns, v1 , v2 , . . . , vn , of Γ are the eigenvectors corresponding to the eigenvalues λ1 , λ2 , . . . , λn which form the main diagonal of the matrix Λ. Assume without loss of generality that the eigenvalues are decreasing; i.e., λ1 ≥ λ2 ≥ · · · ≥ λn > 0. Deﬁne the random vector Y = Γ(X − μ). Since ΓΣΓ = Λ, by Theorem 5.1 Y has a Nn (0, Λ) distribution. Hence the components Y1 , Y2 , . . . , Yn are independent random variables and, for i = 1, 2, . . . , n, Yi has a N (0, λi ) distribution. The random vector Y is called the vector of principal components. We say the total variation, (TV), of a random vector is the sum of the variances of its components. For the random vector X, because Γ is an orthogonal matrix TV(X) =

n

σi2 = tr Σ = tr Γ ΛΓ = tr ΛΓΓ =

i=1

n

λi = TV(Y).

i=1

Hence, X and Y have the same total variation. Next, consider the ﬁrst component of Y, which is given by Y1 = v1 (X − μ). This of the components of X − μ with the property v1 2 =

n is 2a linear combination j=1 v1j = 1, because Γ is orthogonal. Consider any other linear combination of (X − μ), say a (X − μ) such that a 2 = 1. Because a ∈ Rn and {v1 , . . . , vn } forms n a basis for Rn , we must have a = j=1 aj vj for some set of scalars a1 , . . . , an . Furthermore, because the basis {v1 , . . . , vn } is orthonormal ⎞ ⎛ n n aj v j ⎠ v i = aj vj vi = ai . a vi = ⎝ j=1

j=1

187

Some Special Distributions Using (5.5) and the fact that λi > 0, we have the inequality Var(a X)

= =

a Σa n λi (a vi )2 i=1

=

n i=1

λi a2i ≤ λ1

n

a2i = λ1 = Var(Y1 ).

(5.24)

i=1

Hence, Y1 has the maximum variance of any linear combination a (X − μ), such that a = 1. For this reason, Y1 is called the ﬁrst principal component of X. What about the other components, Y2 , . . . , Yn ? As the following theorem shows, they share a similar property relative to the order of their associated eigenvalue. For this reason, they are called the second, third, through the nth principal components, respectively. Theorem 5.5. Consider the situation described above. For j = 2, . . . , n and i = 1, 2, . . . , j − 1, Var[a X] ≤ λj = Var(Yj ), for all vectors a such that a ⊥ vi and

a = 1. The proof of this theorem is similar to that for the ﬁrst principal component and is left as Exercise 5.20. A second application concerning linear regression is oﬀered in Exercise 5.22. EXERCISES 5.1. Let X and Y have a bivariate normal distribution with respective parameters μx = 2.8, μy = 110, σx2 = 0.16, σy2 = 100, and ρ = 0.6. Compute (a) P (106 < Y < 124). (b) P (106 < Y < 124|X = 3.2). 5.2. Let X and Y have a bivariate normal distribution with parameters μ1 = 3, μ2 = 1, σ12 = 16, σ22 = 25, and ρ = 35 . Determine the following probabilities: (a) P (3 < Y < 8). (b) P (3 < Y < 8|X = 7). (c) P (−3 < X < 3). (d) P (−3 < X < 3|Y = −4). 5.3. If M (t1 , t2 ) is the mgf of a bivariate normal distribution, compute the covariance by using the formula ∂ 2 M (0, 0) ∂M (0, 0) ∂M (0, 0) − . ∂t1 ∂t2 ∂t1 ∂t2 Now let ψ(t1 , t2 ) = log M (t1 , t2 ). Show that ∂ 2 ψ(0, 0)/∂t1 ∂t2 gives this covariance directly.

188

Some Special Distributions 5.4. Let U and V be independent random variables, each having a standard normal distribution. Show that the mgf E(et(U V ) ) of the random variable U V is (1 − t2 )−1/2 , −1 < t < 1. Hint: Compare E(etU V ) with the integral of a bivariate normal pdf that has means equal to zero. 5.5. Let X and Y have a bivariate normal distribution with parameters μ1 = 5, μ2 = 10, σ12 = 1, σ22 = 25, and ρ > 0. If P (4 < Y < 16|X = 5) = 0.954, determine ρ. 5.6. Let X and Y have a bivariate normal distribution with parameters μ1 = 20, μ2 = 40, σ12 = 9, σ22 = 4, and ρ = 0.6. Find the shortest interval for which 0.90 is the conditional probability that Y is in the interval, given that X = 22. 5.7. Say the correlation coeﬃcient between the heights of husbands and wives is 0.70 and the mean male height is 5 feet 10 inches with standard deviation 2 inches, and the mean female height is 5 feet 4 inches with standard deviation 1 12 inches. Assuming a bivariate normal distribution, what is the best guess of the height of a woman whose husband’s height is 6 feet? Find a 95% prediction interval for her height. 5.8. Let

#

1 f (x, y) = (1/2π) exp − (x2 + y 2 ) 2

$ # $ 1 2 2 1 + xy exp − (x + y − 2) , 2

where −∞ < x < ∞, −∞ < y < ∞. If f (x, y) is a joint pdf, it is not a normal bivariate pdf. Show that f (x, y) actually is a joint pdf and that each marginal pdf is normal. Thus the fact that each marginal pdf is normal does not imply that the joint pdf is bivariate normal. 5.9. Let X, Y , and Z have the joint pdf

1 2π

3/2

# 2 $ 2 x + y2 + z2 x + y2 + z2 1 + xyz exp − , exp − 2 2

where −∞ < x < ∞, −∞ < y < ∞, and −∞ < z < ∞. While X, Y , and Z are obviously dependent, show that X, Y , and Z are pairwise independent and that each pair has a bivariate normal distribution. 5.10. Let X and Y have a bivariate normal distribution with parameters μ1 = μ2 = 0, σ12 = σ22 = 1, and correlation coeﬃcient ρ. Find the distribution of the random variable Z = aX + bY in which a and b are nonzero constants. 5.11. Establish formula (5.7) by a direct multiplication. 5.12. Show that the expression (5.12) becomes that of (5.17) in the bivariate case. 5.13. Show that expression (5.21) simpliﬁes to expression (5.23) for the bivariate normal case.

189

Some Special Distributions 5.14. Let X = (X1 , X2 , X3 ) have a multivariate normal distribution with mean vector 0 and variance-covariance matrix ⎡ ⎤ 1 0 0 Σ = ⎣ 0 2 1 ⎦. 0 1 2 Find P (X1 > X2 + X3 + 2). Hint: Find the vector a so that aX = X1 − X2 − X3 and make use of Theorem 5.1.

n 5.15. Suppose X is distributed Nn (μ, Σ). Let X = n−1 i=1 Xi . (a) Write X as aX for an appropriate vector a and apply Theorem 5.1 to ﬁnd the distribution of X. (b) Determine the distribution of X if all of its component random variables Xi have the same mean μ. 5.16. Suppose X is distributed N2 (μ, Σ). Determine the distribution of the random vector (X1 + X2 , X1 − X2 ). Show that X1 + X2 and X1 − X2 are independent if Var(X1 ) = Var(X2 ). 5.17. Suppose X is distributed N3 (0, Σ), ⎡ 3 Σ=⎣ 2 1

where 2 2 1

⎤ 1 1 ⎦. 3

Find P ((X1 − 2X2 + X3 )2 > 15.36). 5.18. Let X1 , X2 , X3 be iid random variables each having a standard normal distribution. Let the random variables Y1 , Y2 , Y3 be deﬁned by X1 = Y1 cos Y2 sin Y3 ,

X2 = Y1 sin Y2 sin Y3 ,

X3 = Y1 cos Y3 ,

where 0 ≤ Y1 < ∞, 0 ≤ Y2 < 2π, 0 ≤ Y3 ≤ π. Show that Y1 , Y2 , Y3 are mutually independent. 5.19. Show that expression (5.5) is true. 5.20. Prove Theorem 5.5. 5.21. Suppose X has a multivariate normal distribution with mean 0 and covariance matrix ⎡ ⎤ 283 215 277 208 ⎢ 215 213 217 153 ⎥ ⎥ Σ=⎢ ⎣ 277 217 336 236 ⎦ . 208 153 236 194 (a) Find the total variation of X.

190

Some Special Distributions (b) Find the principal component vector Y. (c) Show that the ﬁrst principal component accounts for 90% of the total variation. (d) Show that the ﬁrst principal component Y1 is essentially a rescaled X. Determine the variance of (1/2)X and compare it to that of Y1 . Note if R is available, the command eigen(amat) obtains the spectral decomposition of the matrix amat. 5.22. Readers may have encountered the multiple regression model in a previous course in statistics. We can brieﬂy write it as follows. Suppose we have a vector of n observations Y which has the distribution Nn (Xβ, σ 2 I), where X is an n × p matrix of known values, which has full column rank p, and β is a p × 1 vector of unknown parameters. The least squares estimator of β is + = (X X)−1 X Y. β + (a) Determine the distribution of β. + Determine the distribution of Y. + = Xβ. + (b) Let Y + Determine the distribution of + (c) Let + e = Y − Y. e. + , + (d) By writing the random vector (Y e ) as a linear function of Y, show that + and + the random vectors Y e are independent. (e) Show that β+ solves the least squares problem; that is, + 2 = min Y − Xb 2 .

Y − Xβ

p b∈R

6

t- and F -Distributions

It is the purpose of this section to deﬁne two additional distributions that are quite useful in certain problems of statistical inference. These are called, respectively, the (Student’s) t-distribution and the F -distribution.

6.1

The t-distribution

Let W denote a random variable that is N (0, 1); let V denote a random variable that is χ2 (r); and let W and V be independent. Then the joint pdf of W and V , say h(w, v), is the product of the pdf of W and that of V or 2 1 √1 e−w /2 v r/2−1 e−v/2 −∞ < w < ∞, 0 < v < ∞ Γ(r/2)2r/2 2π h(w, v) = 0 elsewhere.

191

Some Special Distributions Deﬁne a new random variable T by writing W T = . V /r The change-of-variable technique is used to obtain the pdf g1 (t) of T . The equations w t= v/r

and

u=v

deﬁne a transformation that maps S = {(w, v) : −∞ < w < ∞, 0 < v < ∞} one-to-one and onto T = {(t, u) : −∞ < t < ∞, 0 < u < ∞}. Since w = √ √ t√ u/√ r, v = u, the absolute value of the Jacobian of the transformation is |J| = u/ r. Accordingly, the joint pdf of T and U = V is given by √ t u g(t, u) = h √ , u |J| r

" √ ! 1 u t2 r/2−1 √ √u 1 + u exp − |t| < ∞ , 0 < u < ∞ 2 r r 2πΓ(r/2)2r/2 = 0 elsewhere. The marginal pdf of T is then ∞ g(t, u) du g1 (t) = −∞ ∞

=

0

$ # t2 u 1 √ 1+ du. u(r+1)/2−1 exp − 2 r 2πrΓ(r/2)2r/2

In this integral let z = u[1 + (t2 /r)]/2, and it is seen that (r+1)/2−1 2z 2 1 −z √ dz e 1 + t2 /r 2πrΓ(r/2)2r/2 1 + t2 /r 0 Γ[(r + 1)/2] 1 √ , −∞ < t < ∞ . (6.1) 2 πrΓ(r/2) (1 + t /r)(r+1)/2

g1 (t)

= =

∞

Thus, if W is N (0, 1), if V is χ2 (r), and if W and V are independent, then W T = V /r

(6.2)

has the immediately preceding pdf g1 (t). The distribution of the random variable T is usually called a t-distribution. It should be observed that a t-distribution is completely determined by the parameter r, the number of degrees of freedom of the random variable that has the chi-square distribution. Some approximate values of P (T ≤ t) =

192

t −∞

g1 (w) dw

Some Special Distributions for selected values of r and t can be found in Table IV in Appendix: Tables of Distributions. Note that the last line of the this table, which is labeled ∞, contains the N (0, 1) critical values. This is because as the degrees of freedom approach ∞, the t-distribution converges to the N (0, 1) distribution. The R computer package can also be used to obtain critical values as well as probabilities concerning the t-distribution. For instance, the command qt(.975,15) returns the 97.5th percentile of the t-distribution with 15 degrees of freedom; the command pt(2.0,15) returns the probability that a t-distributed random variable with 15 degrees of freedom is less that 2.0; and the command dt(2.0,15) returns the value of the pdf of this distribution at 2.0. Remark 6.1. This distribution was ﬁrst discovered by W. S. Gosset when he was working for an Irish brewery. Gosset published under the pseudonym Student. Thus this distribution is often known as Student’s t-distribution. Example 6.1 (Mean and Variance of the t-Distribution). Let the random variable T have a t-distribution with r degrees of freedom. Then, as in (6.2), we can write T = W (V /r)−1/2 , where W has a N (0, 1) distribution, V has a χ2 (r) distribution, and W and V are independent random variables. Independence of W and V and expression (3.4), provided (r/2) − (k/2) > 0 (i.e., k < r), implies the following: −k/2 −k/2 V V k k k = E(W )E (6.3) E(T ) = E W r r −k/2 Γ 2r − k2 k 2 = E(W ) if k < r. (6.4) Γ 2r r−k/2 For the mean of T , use k = 1. Because E(W ) = 0, as long as the degrees of freedom of T exceed 1, the mean of T is 0. For the variance, use k = 2. In this case the condition r > k becomes r > 2. Since E(W 2 ) = 1, by expression (6.4), the variance of T is given by r . (6.5) Var(T ) = E(T 2 ) = r−2 Therefore, a t-distribution with r > 2 degrees of freedom has a mean of 0 and a variance of r/(r − 2).

6.2

The F -distribution

Next consider two independent chi-square random variables U and V having r1 and r2 degrees of freedom, respectively. The joint pdf h(u, v) of U and V is then 1 ur1 /2−1 v r2 /2−1 e−(u+v)/2 0 < u, v < ∞ Γ(r1 /2)Γ(r2 /2)2(r1 +r2 )/2 h(u, v) = 0 elsewhere. We deﬁne the new random variable W =

U/r1 V /r2

193

Some Special Distributions and we propose ﬁnding the pdf g1 (w) of W . The equations w=

u/r1 , v/r2

z = v,

deﬁne a one-to-one transformation that maps the set S = {(u, v) : 0 < u < ∞, 0 < v < ∞} onto the set T = {(w, z) : 0 < w < ∞, 0 < z < ∞}. Since u = (r1 /r2 )zw, v = z, the absolute value of the Jacobian of the transformation is |J| = (r1 /r2 )z. The joint pdf g(w, z) of the random variables W and Z = V is then 1 g(w, z) = Γ(r1 /2)Γ(r2 /2)2(r1 +r2 )/2

r1 zw r2

r12−2 z

r2 −2 2

#

z exp − 2

$

r1 w +1 r2

r1 z , r2

provided that (w, z) ∈ T , and zero elsewhere. The marginal pdf g1 (w) of W is then g1 (w)

∞

=

g(w, z) dz −∞ ∞

= 0

# $ z r1 w (r1 /r2 )r1 /2 (w)r1 /2−1 (r1 +r2 )/2−1 z exp − +1 dz. 2 r2 Γ(r1 /2)Γ(r2 /2)2(r1 +r2 )/2

If we change the variable of integration by writing z r1 w +1 , y= 2 r2 it can be seen that

g1 (w)

∞

(r1 /r2 )r1 /2 (w)r1 /2−1 = Γ(r1 /2)Γ(r2 /2)2(r1 +r2 )/2 0 2 dy × r1 w/r2 + 1 r /2 =

Γ[(r1 +r2 )/2](r1 /r2 ) Γ(r1 /2)Γ(r2 /2)

1

2y r1 w/r2 + 1

(w)r1 /2−1 (1+r1 w/r2 )(r1 +r2 )/2

0

(r1 +r2 )/2−1

0 2k; i.e., the denominator degrees of freedom must exceed twice k. Assuming this is true, it follows from (3.4) that the mean of F is given by r2 2−1 Γ r22 − 1 r2 r . (6.8) E(F ) = r1 = 2 r1 r Γ 2 2−2 If r2 is large, then E(F ) is about 1. In Exercise 6.6, a general expression for E(F k ) is derived.

6.3

Student’s Theorem

Our ﬁnal note in this section concerns an important result for inference for normal random variables. It is a corollary to the t-distribution derived above and is often referred to as Student’s Theorem. Theorem 6.1. Let X1 , . . . , Xn be iid random variables each having a normal distribution with mean μ and variance σ 2 . Define the random variables

n

n 1 2 X = n1 i=1 Xi and S 2 = n−1 i=1 (Xi − X) . Then

2 (a) X has a N μ, σn distribution.

195

Some Special Distributions (b) X and S 2 are independent. (c) (n − 1)S 2 /σ 2 has a χ2 (n − 1) distribution. (d) The random variable X −μ √ S/ n has a Student t-distribution with n − 1 degrees of freedom. T =

(6.9)

Proof: Note that we have proved part (a) in Corollary 4.2. Let X = (X1 , . . . , Xn ) . Because X1 , . . . , Xn are iid N (μ, σ 2 ) random variables, X has a multivariate normal distribution N (μ1, σ 2 I), where 1 denotes a vector whose components are all 1. Let v = (1/n, . . . , 1/n) = (1/n)1 . Note that X = v X. Deﬁne the random vector Y by Y = (X1 − X, . . . , Xn − X) . Consider the following transformation: $ # $ # v X X. (6.10) W= = I − 1v Y Because W is a linear transformation of multivariate normal random vector, by Theorem 5.1 it has a multivariate normal distribution with mean $ # $ # μ v μ1 = , (6.11) E [W] = 0n I − 1v where 0n denotes a vector whose components are all 0, and covariance matrix # $ $ # v v 2 Σ = σ I I − 1v I − 1v # 1 $ 0n n . (6.12) = σ2 0n I − 1v Because X is the ﬁrst component of W, we can also obtain part (a) by Theorem 5.1. Next, because the covariances are 0, X is independent of Y. But S 2 = (n − 1)−1 Y Y. Hence, X is independent of S 2 , also. Thus part (b) is true. Consider the random variable 2 n Xi − μ . V = σ i=1 Each term in this sum is the square of a N (0, 1) random variable and, hence, has a χ2 (1) distribution (Theorem 4.1). Because the summands are independent, it follows from Corollary 3.1 that V is a χ2 (n) random variable. Note the following identity: 2 n (Xi − X) + (X − μ) V = σ i=1 2 2 n Xi − X X −μ √ = + σ σ/ n i=1 2 (n − 1)S 2 X −μ √ = + . (6.13) σ2 σ/ n

196

Some Special Distributions By part (b), the two terms on the right side of the last equation are independent. Further, the second term is the square of a standard normal random variable and, hence, has a χ2 (1) distribution. Taking mgfs of both sides, we have , (6.14) (1 − 2t)−n/2 = E exp{t(n − 1)S 2 /σ 2 } (1 − 2t)−1/2 . Solving for the mgf of (n − 1)S 2 /σ 2 on the right side we obtain part (c). Finally, part (d) follows immediately from parts (a)–(c) upon writing T , (6.9), as T =

√ (X − μ)/(σ/ n) (n − 1)S 2 /(σ 2 (n − 1))

.

EXERCISES 6.1. Let T have a t-distribution with 10 degrees of freedom. Find P (|T | > 2.228) from either Table IV or, if available, by using R. 6.2. Let T have a t-distribution with 14 degrees of freedom. Determine b so that P (−b < T < b) = 0.90. Use either Table IV or, if available, by using R. 6.3. Let T have a t-distribution with r > 4 degrees of freedom. Use expression (6.4) to determine the kurtosis of T . 6.4. Assuming a computer is available, plot the pdfs of the random variables deﬁned in parts (a)–(e) below. Obtain an overlay plot of all ﬁve pdfs, also. In R the domain values of the pdfs can easily be obtained by using the seq command. For instance, the command x 2k, continue with Example 6.2 and derive the E(F k ). 6.7. Let F have an F -distribution with parameters r1 and r2 . Using the results of the last exercise, determine the kurtosis of F , assuming that r2 > 8. 6.8. Let F have an F -distribution with parameters r1 and r2 . Argue that 1/F has an F -distribution with parameters r2 and r1 . 6.9. If F has an F -distribution with parameters r1 = 5 and r2 = 10, ﬁnd a and b so that P (F ≤ a) = 0.05 and P (F ≤ b) = 0.95, and, accordingly, P (a < F < b) = 0.90. Hint: Write P (F ≤ a) = P (1/F ≥ 1/a) = 1 − P (1/F ≤ 1/a), and use the result of Exercise 6.8 and Table V or, if available, use R. 6.10. Let T = W/ V /r, where the independent variables W and V are, respectively, normal with mean zero and variance 1 and chi-square with r degrees of freedom. Show that T 2 has an F -distribution with parameters r1 = 1 and r2 = r. Hint: What is the distribution of the numerator of T 2 ? 6.11. Show that the t-distribution with r = 1 degree of freedom and the Cauchy distribution are the same. 6.12. Show that Y =

1 , 1 + (r1 /r2 )W

where W has an F -distribution with parameters r1 and r2 , has a beta distribution. 6.13. Let X1 , X2 be iid with common distribution having the pdf f (x) = e−x , 0 < x < ∞, zero elsewhere. Show that Z = X1 /X2 has an F -distribution. 6.14. Let X1 , X2 , and X3 be three independent chi-square variables with r1 , r2 , and r3 degrees of freedom, respectively. (a) Show that Y1 = X1 /X2 and Y2 = X1 + X2 are independent and that Y2 is χ2 (r1 + r2 ). (b) Deduce that X1 /r1 X2 /r2 are independent F -variables.

198

and

X3 /r3 (X1 + X2 )/(r1 + r2 )

Some Special Distributions

7

Mixture Distributions

Recall the discussion on the contaminated normal distribution given in Section 4.1. This was an example of a mixture of normal distributions. In this section, we extend this to mixtures of distributions in general. Generally, we use continuoustype notation for the discussion, but discrete pmfs can be handled the same way. Suppose that we have k distributions with respective pdfs f1 (x), f2 (x), . . . , fk (x), with supports S1 , S2 , . . . , Sk , means μ1 , μ2 , . . . , μk , and variances σ12 , σ22 , . . . , σk2 , with positive mixing probabilities p1 , p2 , . . . , pk , where p1 + p2 + · · · + pk = 1. Let S = ∪ki=1 Si and consider the function f (x) = p1 f1 (x) + p2 f2 (x) + · · · + pk fk (x) =

k

pi fi (x),

x ∈ S.

(7.1)

i=1

Note that f (x) is nonnegative and it is easy to see that it integrates to one over (−∞, ∞); hence, f (x) is a pdf for some continuous-type random variable X. The mean of X is given by ∞ k k pi xfi (x) dx = pi μi = μ, (7.2) E(X) = i=1

−∞

i=1

a weighted average of μ1 , μ2 , . . . , μk , and the variance equals ∞ k var(X) = pi (x − μ)2 fi (x) dx −∞

i=1

=

k

pi

i=1

=

k i=1

−∞

pi

∞

∞ −∞

[(x − μi ) + (μi − μ)]2 fi (x) dx (x − μi )2 fi (x) dx +

k

pi (μi − μ)2

i=1

∞ −∞

fi (x) dx,

because the cross-product terms integrate to zero. That is, var(X) =

k i=1

pi σi2 +

k

pi (μi − μ)2 .

(7.3)

i=1

Note that the variance is not simply the weighted average of the k variances, but it also includes a positive term involving the weighted variance of the means. Remark 7.1. It is extremely important to note these characteristics are associated with

a mixture of k distributions and have nothing to do with a linear combination, say ai Xi , of k random variables. For the next example, we need the following distribution. We say that X has a loggamma pdf with parameters α > 0 and β > 0 if it has pdf 1 −(1+β)/β (log x)α−1 x > 1 Γ(α)β α x (7.4) f1 (x) = 0 elsewhere.

199

Some Special Distributions The derivation of this pdf is given in Exercise 7.1, where its mean and variance are also derived. We denote this distribution of X by log Γ(α, β). Example 7.1. Actuaries have found that a mixture of the loggamma and gamma distributions is an important model for claim distributions. Suppose, then, that X1 is log Γ(α1 , β1 ), X2 is Γ(α2 , β2 ), and the mixing probabilities are p and (1 − p). Then the pdf of the mixture distribution is ⎧ 1−p α2 −1 −x/β2 e 0 0, ﬁnd the unconditional pdf of X. 7.11. Let X have the conditional Weibull pdf τ

f (x|θ) = θτ xτ −1 e−θx , 0 < x < ∞, and let the pdf (weighting function) g(θ) be gamma with parameters α and β. Show that the compound (marginal) pdf of X is that of Burr. 7.12. If X has a Pareto distribution with parameters α and β and if c is a positive constant, show that Y = cX has a Pareto distribution with parameters α and β/c.

204

Some Special Distributions

Answers to Selected Exercises 1.1

40 81 .

1.4

147 512 .

3.20

3.24 (a) (1 − 6t)−8 , t < 16 ; (b) Γ(α = 8, β = 6).

1.6 5. 1.9 1.11 1.14

10 243 .

4.2 0.067; 0.685.

3 16 . 65 81 .

4.3 1.645.

1 2 x−3 3

3

, x = 3, 4, 5, . . . .

4.4 71.4; 189.4.

1.15

5 72 .

4.8 0.598.

1.18

1 6.

1.19

24 625 .

4.10 0.774. 3 2 π−2 4.11 π; π .

1.21 (a) 1.22

11 6 ;

(b)

x1 2 ;

(c)

11 6 .

25 4 .

1.27 (a) 0.0853; (b) 0.2637; (c) 0.0861, 0.2639. 2.1 0.09. 2.4 4x e−4 /x!, x = 0, 1, 2 . . . . 2.5 0.84. 2.8 About 6.7. 2.10 8.

4.12 0.90. 4.13 0.477. 4.14 0.461. 4.15 N (0, 1). 4.16 0.433. 4.18 0; 3. 4.23 N (0, 2). 4.28 0.24.

2.11 2.

4.29 0.159.

2.13 (a) e−2 exp{(1 + et1 )et2 }.

4.30 0.159.

3.1 0.05.

4.32 χ2 (2).

3.2 0.831; 12.8.

5.1 (a) 0.574; (b) 0.735.

3.3 0.90.

5.2 (a) 0.264; (b) 0.440; (c) 0.433; (d) 0.643.

2

3.4 χ (4). 3.6 pdf is 3e−3y , 0 < y < ∞. 3.7 2; 0.95. 3.15

11 16 .

3.16 χ2 (2). 3.18

αβ α α+β ; (α+β+1)(α+β)2 .

3.19 (a) 20; (b) 1260; (c) 495.

5.5

4 5.

5.6 (38.2, 43.4), . 5.17 0.05. 6.1 0.05. 6.2 1.761. 6.9

1 4.74 ; 3.33.

205

206

Some Elementary Statistical Inferences 1

Sampling and Statistics

You may be familiar with the concepts of samples and statistics. We continue to develop your skills in this chapter while introducing the main tools of inference: conﬁdence intervals and tests of hypotheses. In a typical statistical problem, we have a random variable X of interest, but its pdf f (x) or pmf p(x) is not known. Our ignorance about f (x) or p(x) can roughly be classiﬁed in one of two ways: 1. f (x) or p(x) is completely unknown. 2. The form of f (x) or p(x) is known down to a parameter θ, where θ may be a vector. For now, we consider the second classiﬁcation, although some of our discussion pertains to the ﬁrst classiﬁcation also. Some examples are the following: (a) X has an exponential distribution, Exp(θ), where θ is unknown. (b) X has a binomial distribution b(n, p), where n is known but p is unknown. (c) X has a gamma distribution Γ(α, β), where both α and β are unknown. (d) X has a normal distribution N (μ, σ 2 ), where both the mean μ and the variance σ 2 of X are unknown. We often denote this problem by saying that the random variable X has a density or mass function of the form f (x; θ) or p(x; θ), where θ ∈ Ω for a speciﬁed set Ω. For example, in (a) above, Ω = {θ | θ > 0}. We call θ a parameter of the distribution. Because θ is unknown, we want to estimate it.

From Chapter 4 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

207

Some Elementary Statistical Inferences In this process, our information about the unknown distribution of X or the unknown parameters of the distribution of X comes from a sample on X. The sample observations have the same distribution as X, and we denote them as the random variables X1 , X2 , . . . , Xn , where n denotes the sample size. When the sample is actually drawn, we use lower case letters x1 , x2 , . . . , xn as the values or realizations of the sample. Often we assume that the sample observations X1 , X2 , . . . , Xn are also mutually independent, in which case we call the sample a random sample, which we now formally deﬁne: Deﬁnition 1.1. If the random variables X1 , X2 , . . . , Xn are independent and identically distributed (iid), then these random variables constitute a random sample of size n from the common distribution. Often, functions of the sample are used to summarize the information in a sample. These are called statistics, which we deﬁne as Deﬁnition 1.2. Let X1 , X2 , . . . , Xn denote a sample on a random variable X. Let T = T (X1 , X2 , . . . , Xn ) be a function of the sample. Then T is called a statistic. Once the sample is drawn, then t is called the realization of T , where t = T (x1 , x2 , . . . , xn ) and x1 , x2 , . . . , xn is the realization of the sample. Using this terminology, the problem we discuss in this chapter is phrased as: Let X1 , X2 , . . . , Xn denote a random sample on a random variable X with a density or mass function of the form f (x; θ) or p(x; θ), where θ ∈ Ω for a speciﬁed set Ω. In this situation, it makes sense to consider a statistic T , which is an estimator of θ. More formally, T is called a point estimator of θ. While we call T an estimator of θ, we call its realization t an estimate of θ. There are several properties of point estimators. We begin with a simple one, unbiasedness. Deﬁnition 1.3 (Unbiasedness). Let X1 , X2 , . . . , Xn denote a sample on a random variable X with pdf f (x; θ), θ ∈ Ω. Let T = T (X1 , X2 , . . . , Xn ) be a statistic. We say that T is an unbiased estimator of θ if E(T ) = θ. The purpose of this chapter is an introduction to inference, so we brieﬂy discuss the maximum likelihood estimator, (mle) and then use it to obtain point estimators for some of the examples cited above. Our discussion is for the continuous case. For the discrete case, simply replace the pdf with the pmf. In our problem, the information in the sample and nthe parameter θ are involved in the joint distribution of the random sample; i.e., i=1 f (xi ; θ). We want to view this as a function of θ, so we write it as L(θ) = L(θ; x1 , x2 , . . . , xn ) =

n i=1

208

f (xi ; θ).

(1.1)

Some Elementary Statistical Inferences This is called the likelihood function of the random sample. As an estimate of θ, a measure of the center of L(θ) seems appropriate. An often-used estimate is that value of θ which provides a maximum of L(θ). If it is unique, this is called the i.e., maximum likelihood estimator (mle), and we denote it as θ; θ = ArgmaxL(θ).

(1.2)

In practice, it is often much easier to work with the log of the likelihood, that is, the function l(θ) = log L(θ). Because the log is a strictly increasing function, the value which maximizes l(θ) is the same as the value which maximizes L(θ). Furthermore, for most of the models we discuss, the pdf (or pmf) is a diﬀerentiable function of θ, and frequently θ solves the equation ∂l(θ) = 0. ∂θ

(1.3)

If θ is a vector of parameters, this results in a system of equations to be solved simultaneously; see Example 1.3. Under general conditions, mles have some good properties. One property that we need at the moment concerns the situation where, besides the parameter θ, we are also interested in the parameter η = g(θ) for a speciﬁed function g. Then the where θ is the mle of θ. We now proceed with some examples. mle of η is η = g(θ), Example 1.1 (Exponential Distribution). Suppose the common pdf of the random sample X1 , X2 , . . . , Xn is the Γ(1, θ) density. The log of the likelihood function is given by n n 1 −xi /θ e = −n log θ − θ−1 xi . l(θ) = log θ i=1 i=1 The ﬁrst partial of the log-likelihood with respect to θ is n ∂l(θ) = −nθ−1 + θ−2 xi . ∂θ i=1

Setting this partial to 0 and solving for θ, we obtain the solution x. There is only one critical value and, furthermore, the second partial of the log-likelihood evaluated at x is strictly negative, verifying that it provides a maximum. Hence, for this example, the statistic θ = X is the mle of θ. Because E(X) = θ, we have that E(X) = θ and, hence, θ is an unbiased estimator of θ. Example 1.2 (Binomial Distribution). Let X be one or zero if, respectively, the outcome of a Bernoulli experiment is success or failure. Let θ, 0 < θ < 1, denote the probability of success. The pmf of X is p(x; θ) = θx (1 − θ)1−x ,

x = 0 or 1.

209

Some Elementary Statistical Inferences If X1 , X2 , . . . , Xn is a random sample on X, then the likelihood function is L(θ) =

n

p(xi ; θ) = θ

n

xi

i=1

(1 − θ)n−

n

i=1

xi

,

xi = 0 or 1.

i=1

Taking logs, we have l(θ) =

n

n−

xi log θ +

n

i=1

xi

log(1 − θ),

xi = 0 or 1.

i=1

The partial derivative of l(θ) is n n − i=1 xi − . θ 1−θ n Setting this to 0 and solving for θ, we obtain θ = n−1 i=1 Xi = X; i.e., the mle is the proportion of successes in the n trials. Because E(X) = θ, θ is an unbiased estimator of θ. ∂l(θ) = ∂θ

n

i=1

xi

Example 1.3 (Normal Distribution). Let X have a N (μ, σ 2 ) distribution with the pdf given. In this case, θ is the vector θ = (μ, σ). If X1 , X2 , . . . , Xn is a random sample on X, then the log of the likelihood function simpliﬁes to

2 n 1 xi − μ n . (1.4) l(μ, σ) = − log 2π − n log σ − 2 2 i=1 σ The two partial derivatives simplify to ∂l(μ, σ) ∂μ

=

n 1 xi − μ − − σ σ i=1

∂l(μ, σ) ∂σ

=

−

n n 1 + 3 (xi − μ)2 . σ σ i=1

(1.5) (1.6)

Setting these to 0 and solving simultaneously, we see that the mles are μ = σ 2

=

X n−1

(1.7) n

(Xi − X)2 .

(1.8)

i=1

Notice that we have used the property that the mle of σ 2 is the mle of σ squared. 2 We see that μ is an unbiased estimator of μ, while σ is a biased estimator of σ 2 . 2 2 2 σ − σ ) = −σ 2 /n, which converges to 0 as By (8.4), though, the bias of σ is E( n → ∞. In all three of these examples, standard diﬀerential calculus methods led us to the solution. For the next example, the support of the random variable involves θ and, hence, it is not surprising that for this case diﬀerential calculus is not useful.

210

Some Elementary Statistical Inferences Example 1.4 (Uniform Distribution). Let X1 , . . . , Xn be iid with the uniform (0, θ) density; i.e., f (x) = 1/θ for 0 < x < θ, 0 elsewhere. Because θ is in the support, diﬀerentiation is not helpful here. The likelihood function can be written as L(θ) = θ−n I(max{xi }, θ),

for all θ > 0,

where I(a, b) is 1 or 0 if a ≤ b or a > b, respectively. The function L(θ) is a decreasing function of θ for all θ ≥ max{xi } and is 0 otherwise [sketch the graph of L(θ)]. So the maximum occurs at the smallest value that θ can assume; i.e., the mle is θ = max{Xi }.

1.1

Histogram Estimates of pmfs and pdfs

Let X1 , . . . , Xn be a random sample on a random variable X with cdf F (x). In this section, we brieﬂy discuss a histogram of the sample, which is an estimate of the pmf, p(x), or the pdf, f (x), of X depending on whether X is discrete or continuous. Other than X being a discrete or continuous random variable, we make no assumptions on the form of the distribution of X. In particular, we do not assume a parametric form of the distribution as we did for the above discussion on maximum likelihood estimates; hence, the histogram that we present is often called a nonparametric estimator. We discuss the discrete situation ﬁrst. The Distribution of X Is Discrete Assume that X is a discrete random variable with pmf p(x). Suppose ﬁrst that the space of X is ﬁnite, say, D = {a1 , . . . , am }. An intuitive estimator of p(aj ) is the relative frequency of sample observations, which are equal to aj . For j = 1, 2, . . . , m, deﬁne the statistics 1 Xi = a j Ij (Xi ) = 0 Xi = aj . Then the intuitive estimate of p(aj ) can be expressed by the average 1 Ij (Xi ). n i=1 n

p(aj ) =

(1.9)

Thus the estimates { p(a1 ), . . . , p(am )} constitute the nonparametric estimate of the pmf p(x). Note that Ij (Xi ) has a Bernoulli distribution with probability of success p(aj ). As Exercise 1.6 shows, our estimator of the pmf is unbiased. Suppose next that the space of X is inﬁnite, say, D = {a1 , a2 , . . .}. In practice, we select a value, say, am , and make the groupings ˜m+1 = {am+1 , am+2 , . . .}. {a1 }, {a2 }, . . . , {am }, a

(1.10)

Let p(˜ am+1 ) be the proportion of sample items that are greater than or equal p(a1 ), . . . , p(am ), p(˜ am+1 )} form our estimate of to am+1 . Then the estimates {

211

Some Elementary Statistical Inferences p(x). For the merging of groups, a rule of thumb is to select m so that the frequency of the category am exceeds twice the combined frequencies of the categories am+1 , am+2 , . . . . A histogram is a barplot of p(aj ) versus aj . There are two cases to consider. For the ﬁrst case, suppose the values aj represent qualitative categories, for example, hair colors of a population. In this case, there is no ordinal information in the aj s. The usual histogram for such data are nonabutting bars with heights p(aj ) that are plotted in decreasing order of the p(a1 )s. Such histograms are usually called bar charts. An example is helpful here. Example 1.5 (Hair Color of Scottish Schoolchildren). Kendall and Sturat (1979) presented data on the hair color of Scottish schoolchildren in the early 1900s. Five hair colors were recorded for this sample of size n = 22,361. The frequency distribution of this sample and the estimate of the pmf are Count p(aj )

Fair 5789 0.259

Red 1319 0.059

Medium 9418 0.421

Dark 5678 0.254

Black 157 0.007

The bar chart of this sample is shown in Figure 1.1.

0.0

0.1

0.2

0.3

0.4

Bar Chart of Haircolor of Scottish Schoolchildren

Medium

Fair

Dark

Red

Black

Haircolor

Figure 1.1: Bar chart of the Scottish hair color data discussed in Example 1.5. For the second case, assume that the values in the space D are ordinal in nature; i.e., the natural ordering of the aj s is numerically meaningful. In this case, the usual histogram is an abutting bar chart with heights p(aj ) that are plotted in the natural order of the aj s, as in the following example.

212

Some Elementary Statistical Inferences Example 1.6 (Simulated Poisson Variates). The following 30 data points are simulated values drawn from a Poisson distribution with mean λ = 2; see Example 8.2 for the generation of Poisson variates. 2 2

1 1

1 2

1 2

1 6

5 5

1 2

1 3

3 2

0 4

2 1

1 3

1 1

3 3

4 0

The nonparametric estimate of the pmf is j p(j)

0 0.067

1 0.367

2 0.233

3 0.167

4 0.067

5 0.067

≥6 0.033

The histogram for this data set is given in Figure 1.2. Note that counts are used for the vertical axis.

0

2

4

6

8

10

Histogram of Poisson Variates

0

1

2

3

4

5

6

Number of events

Figure 1.2: Histogram of the Poisson variates of Example 1.6.

The Distribution of X Is Continuous For this section, assume that the random sample X1 , . . . , Xn is from a continuous random variable X with continuous pdf f (t). We ﬁrst sketch an estimate for this pdf at a speciﬁed value of x. Then we use this estimate to develop a histogram estimate of the pdf. For an arbitrary but ﬁxed point x and a given h > 0, consider the interval (x − h, x + h). By the mean value theorem for integrals, we have for some ξ, |x − ξ| < h, that x+h f (t) dt = f (ξ)2h ≈ f (x)2h. P (x − h < X < x + h) = x−h

213

Some Elementary Statistical Inferences The nonparametric estimate of the leftside is the proportion of the sample items that fall in the interval (x − h, x + h). This suggests the following nonparametric estimate of f (x) at a given x: #{x − h < Xi < x + h} . f(x) = 2hn

(1.11)

To write this more formally, consider the indicator statistic 1 x − h < Xi < x + h i = 1, . . . , n. Ii (x) = 0 otherwise, Then a nonparametric estimator of f (x) is 1 Ii (x). f(x) = 2hn i=1 n

(1.12)

Since the sample items are identically distributed, E[f(x)] =

1 nf (ξ)2h = f (ξ) → f (x), 2hn

as h → 0. Hence f(x) is approximately an unbiased estimator of the density f (x). In density estimation terminology, the indicator function Ii is called a rectangular kernel with bandwidth 2h. See Chapter 6 of Lehmann (1999) for a discussion of density estimation. Let x1 , . . . , xn be the realized values of the random sample. Our histogram estimate of f (x) is obtained as follows. For the discrete case, there are natural classes for the histogram, namely, the domain values. For the continuous case, though, classes must be selected. One way of doing this is to select a positive integer m, an h > 0, and a value a such that a < min xi , so that the m intervals (a−h, a+h], (a+h, a+3h], (a+3h, a+5h], . . . , (a+(2m−3)h, a+(2m−1)h] (1.13) cover the range of the sample [min xi , max xi ]. These intervals form our classes. For the histogram, over the interval (a + (2i − 3)h, a + (2i − 1)h], i = 1, . . . , m, let the height of the bar be the density estimate given in expression (1.12) at the midpoint of the interval, i.e., f[a + 2(i − 1)h]. The height of the bar is thus proportional to the number of xi s that fall in the interval (a + (2i − 3)h, a + (2i − 1)h]. Over the interval (a + (2i − 3)h, a + (2i − 1)h], our histigram estimate of the density is f[a + 2(i − 1)h]. To complete the histogram estimate of f (x), take it to be 0 for x ≤ a and for x > a + (2m − 1)h. Denote the intervals of the partition by Ii = (a + (2i − 3)h, a + (2i − 1)h], i = 1, . . . , m. Then we can summarize our histogram estimate of the pdf by #{a + (2i − 3)h < Xi ≤ a + (2i − 1)h}/(2hn) x ∈ Ii , i = 1, . . . , m f(x) = 0 elsewehere. (1.14) Hence the estimator is nonnegative and, as Exercise 1.9 shows, it integrates to 1 over (−∞, ∞). So it satisﬁes the properties of a pdf.

214

Some Elementary Statistical Inferences Example 1.7 (Histogram for Normally Generated Data). The following data are the rounded values of a generated set of data from a N (50, 100) distribution. 63 52

58 48

60 44

60 19

39 42

41 67

57 44

49 64

44 34

36 46

See Example 8.5 for details on generating a sample from a normal distribution. To construct a histogram for this data, we selected six intervals of length 10, by setting a = 15 and h = 5. The resulting histogram is displayed in Figure 1.3. Counts, not relative frequency, are used for the vertical axis. Note that the two values of 60 are included in the interval (50, 60]. The sample size is too small to guess the probability model which generated the data.

0

2

4

6

8

Twenty Simulated Normal Variates

10

20

30

40

50

60

70

x

Figure 1.3: Histogram of the normal variates discussed in Example 1.7. For the discrete case, except when classes are merged, the histogram is unique. For the continuous case, though, the histogram depends on the classes chosen. The resulting picture can be quite diﬀerent if the classes are changed. If the histogram is not appealing, then a diﬀerent set of classes may be used. This would seem to mean obtaining a new frequency distribution. There is a simple way, however. First choose classes so that the numbers in each class begin with the same digits. These digits are the stems and should be thought of as the classes. The trailing digits are called the leaves. A histogram can be constructed by writing the stems in a single column and then attaching the leaves. This is called a stem-leaf plot; see Tukey (1977). Consider, for instance, the data of Example 1.7. As stems we take 1, 2, . . . , 6. Then the stem-leaf plot is

215

Some Elementary Statistical Inferences

1 2 3 4 5 6

| | | | | |

9 469 12444689 278 00347

The leaf 9 in the ﬁrst row represents the data point 19. The leaves are in order because this stem-leaf plot was computed by the R command hist. If the stem-leaf plot is done by hand, then the leaves should be attached in the order that the data are read, and, hence, they may not necessarily be in order. Notice that if we rotate it 90◦ , it is similar to the histogram given in Figure 1.3, except that three values of 60 were placed in the interval with midpoint 55 in the histogram. For our histogram, although there appear to be enough stems (classes), suppose we think that there are too few. Then we can easily split each stem to obtain a new histogram. For example, the stem 4 splits into low-4 (leaves: 0–4) and high-4 (leaves: 5–9). Thus we need not obtain a new frequency distribution. EXERCISES 1.1. Twenty motors were put on test under a high-temperature setting. The lifetimes in hours of the motors under these conditions are given below. Suppose we assume that the lifetime of a motor under these conditions, X, has a Γ(1, θ) distribution. 1 58

4 67

5 95

21 124

22 124

28 160

40 202

42 260

51 303

53 363

(a) Obtain a frequency distribution and a histogram or a stem-leaf plot of the data. Use the intervals [0, 50), [50, 100), . . . . Based on this plot, do you think that the Γ(1, θ) model is credible? (b) Obtain the maximum likelihood estimate of θ and locate it on your plot. (c) Obtain the sample median of the data, which is an estimate of the median lifetime of a motor. What parameter is it estimating (i.e., determine the median of X)? (d) Based on the mle, what is another estimate of the median of X? 1.2. The weights of 26 professional baseball pitchers are given below; [see page 76 of Hettmansperger and McKean (2011) for the complete data set]. Suppose we assume that the weight of a professional baseball pitcher is normally distributed with mean μ and variance σ 2 . 160 200

216

175 200

180 205

185 205

185 210

185 210

190 218

190 219

195 220

195 222

195 225

200 225

200 232

Some Elementary Statistical Inferences (a) Obtain a frequency distribution and a histogram or a stem-leaf plot of the data. Use 5-pound intervals. Based on this plot, is a normal probability model credible? (b) Obtain the maximum likelihood estimates of μ, σ 2 , σ, and μ/σ. Locate your estimate of μ on your plot in part (a). (c) Using the binomial model, obtain the maximum likelihood estimate of the proportion p of professional baseball pitchers who weigh over 215 pounds. (d) Determine the mle of p assuming that the weight of a professional baseball player follows the normal probability model N (μ, σ 2 ) with μ and σ unknown. 1.3. Suppose the number of customers X that enter a store between the hours 9:00 a.m. and 10:00 a.m. follows a Poisson distribution with parameter θ. Suppose a random sample of the number of customers that enter the store between 9:00 a.m. and 10:00 a.m. for 10 days results in the values 9

7

9

15

10

13

11

7

2

12

(a) Determine the maximum likelihood estimate of θ. Show that it is an unbiased estimator. (b) Based on these data, obtain the realization of your estimator in part (a). Explain the meaning of this estimate in terms of the number of customers. 1.4. For Example 1.3, verify equations (1.4)-(1.8). 1.5. Let X1 , X2 , . . . , Xn be a random sample from a continuous-type distribution. (a) Find P (X1 ≤ X2 ), P (X1 ≤ X2 , X1 ≤ X3 ), . . . , P (X1 ≤ Xi , i = 2, 3, . . . , n). (b) Suppose the sampling continues until X1 is no longer the smallest observation (i.e., Xj < X1 ≤ Xi , i = 2, 3, . . . , j − 1). Let Y equal the number of trials, not including X1 , until X1 is no longer the smallest observation (i.e., Y = j − 1). Show that the distribution of Y is P (Y = y) =

1 , y(y + 1)

y = 1, 2, 3, . . . .

(c) Compute the mean and variance of Y if they exist. 1.6. Show that the estimate of the pmf in expression (1.9) is an unbiased estimate. Find the variance of the estimator also. 1.7. The data set on Scottish schoolchildren discussed in Example 1.5 included the eye colors of the children also. The frequencies of their eye colors are Blue 2978

Light 6697

Medium 7511

Dark 5175

217

Some Elementary Statistical Inferences Use these frequencies to obtain a bar chart and an estimate of the associated pmf. where θ is the mle 1.8. Recall that for the parameter η = g(θ), the mle of η is g(θ), of θ. Assuming that the data in Example 1.6 were drawn from a Poisson distribution with mean λ, obtain the mle of λ and then use it to obtain the mle of the pmf. Compare the mle of the pmf to the nonparametric estimate. Note: For the domain value 6, obtain the mle of P (X ≥ 6). 1.9. Show that the nonparametric estimate of a pdf f (x) given in expression (1.14) integrates to 1 over (−∞, ∞). 1.10. Consider the histogram for the sample of size 20 in Example 1.7. (a) Compute the nonparametric estimator (1.12) of the density at the point x = 45. (b) Assuming a normal, N (μ, σ 2 ), distribution, compute the mles of μ and σ. (c) Compute the mle of the density at the point x = 45 and compare it with your answer in part (b). (d) Compute f (45), where f is the density from a N (50, 100) distribution and compare the nonparametric and mle estimates with it. 1.11. For the nonparametric estimator (1.12) of a pdf, (a) Obtain its mean and determine the bias of the estimator. (b) Obtain its variance.

2

Conﬁdence Intervals

Let us continue with the statistical problem that we were discussing in Section 1. Recall that the random variable of interest X has density f (x; θ), θ ∈ Ω, where θ is un 1 , . . . , Xn ), known. In that section, we discussed estimating θ by a statistic θ = θ(X where X1 , . . . , Xn is a sample from the distribution of X. When the sample is drawn, it is unlikely that the value of θ is the true value of the parameter. In fact, if θ has a continuous distribution, then Pθ (θ = θ) = 0. What is needed is an estimate of the error of the estimation i.e., by how much did θ miss θ? In this section, we embody this estimate of error in terms of a conﬁdence interval, which we now formally deﬁne: Deﬁnition 2.1 (Conﬁdence Interval). Let X1 , X2 , . . . , Xn be a sample on a random variable X, where X has pdf f (x; θ), θ ∈ Ω. Let 0 < α < 1 be speciﬁed. Let L = L(X1 , X2 , . . . , Xn ) and U = U (X1 , X2 , . . . , Xn ) be two statistics. We say that the interval (L, U ) is a (1 − α)100% conﬁdence interval for θ if 1 − α = Pθ [θ ∈ (L, U )].

(2.1)

That is, the probability that the interval includes θ is 1 − α, which is called the conﬁdence coeﬃcient of the interval.

218

Some Elementary Statistical Inferences Once the sample is drawn, the realized value of the conﬁdence interval is (l, u), an interval of real numbers. Either the interval (l, u) traps θ or it does not. One way of thinking of a conﬁdence interval is in terms of Bernoulli trials with probability of success 1 − α. If one makes, say, M independent conﬁdence intervals over a period of time, then one would expect to have (1 − α)M successful conﬁdence intervals (those that trap θ) over this period of time. Hence one feels (1 − α)100% conﬁdent that the true value of θ lies in the interval (l, u). A measure of eﬃciency based on conﬁdence intervals is their expected length. Suppose (L1 , U1 ) and (L2 , U2 ) are two conﬁdence intervals for θ that have the same conﬁdence coeﬃcient. Then we say that (L1 , U1 ) is more eﬃcient than (L2 , U2 ) if Eθ (U1 − L1 ) ≤ Eθ (U2 − L2 ) for all θ ∈ Ω. There are several procedures for obtaining conﬁdence intervals. We explore one of them in this section. It is based on a pivot random variable. The pivot is usually a function of an estimator of θ and the parameter and, further, the distribution of the pivot is known. Using this information, an algebraic derivation can often be used to obtain a conﬁdence interval. The next several examples illustrate the pivot method. A second way to obtain a conﬁdence interval involves distribution free techniques, as used in Section 4.2 to determine conﬁdence intervals for quantiles. Example 2.1 (Conﬁdence Interval for μ Under Normality). Suppose the random variables X1 , . . . , Xn are a random sample from a N (μ, σ 2 ) distribution. Let X and S 2 denote the sample mean and sample variance, respectively. Recall from the last of μ and [(n − 1)/n]S 2 is the mle of σ 2 . The random section that X is the mle √ variable T = (X − μ)/(S/ n) has a t-distribution with n − 1 degrees of freedom. The random variable T is our pivot variable. For 0 < α < 1, deﬁne tα/2,n−1 to be the upper α/2 critical point of a tdistribution with n − 1 degrees of freedom; i.e., α/2 = P (T > tα/2,n−1 ). Using a simple algebraic derivation, we obtain 1−α

= = = =

P (−tα/2,n−1 < T < tα/2,n−1 )

X −μ √ < tα/2,n−1 P −tα/2,n−1 < S/ n

S S P −tα/2,n−1 √ < X − μ < tα/2,n−1 √ n n

S S . P X − tα/2,n−1 √ < μ < X + tα/2,n−1 √ n n

(2.2)

Once the sample is drawn, let x and s denote the realized values of the statistics X and S, respectively. Then a (1 − α)100% conﬁdence interval for μ is given by √ √ (x − tα/2,n−1 s/ n, x + tα/2,n−1 s/ n).

(2.3)

This interval is often referred to as√ the (1 − α)100% t-interval for μ. The estimate of the standard deviation of X, s/ n, is referred to as the standard error of X.

219

Some Elementary Statistical Inferences √ The distribution of the pivot random variable T = (X − μ)/(s/ n) of the last example depends on the normality of the sampled items; however, this is approximately true even if the sampled items are not drawn from a normal distribution. The Central Limit Theorem (CLT) shows that the distribution of T is approximately N (0, 1). In order to use this result now, we state the CLT. Theorem 2.1 (Central Limit Theorem). Let X1 , X2 , . . . , Xn denote the observations of a random sample from a distribution that has mean μ and ﬁnite variance √ σ 2 . Then the distribution function of the random variable Wn = (X − μ)/(σ/ n) converges to Φ, the distribution function of the N (0, 1) distribution, as n → ∞. The result stays the same if we replace σ by the sample standard deviation S; that is, under the assumptions of Theorem 2.1, the distribution of Zn =

X −μ √ S/ n

(2.4)

is approximately N (0, 1). For the nonnormal case, as the next example shows, with this result we can obtain an approximate conﬁdence interval for μ. Example 2.2 (Large Sample Conﬁdence Interval for the Mean μ). Suppose X1 , X2 , . . . , Xn is a random sample on a random variable X with mean μ and variance σ 2 , but, unlike the last example, the distribution of X is not normal. However, from the above discussion we know that the distribution of Zn , (2.4), is approximately N (0, 1). Hence

X −μ √ < zα/2 . 1 − α ≈ P −zα/2 < S/ n Using the same algebraic derivation as in the last example, we obtain

S S 1 − α ≈ P X − zα/2 √ < μ < X + zα/2 √ . n n

(2.5)

Again, letting x and s denote the realized values of the statistics X and S, respectively, after the sample is drawn, an approximate (1 − α)100% conﬁdence interval for μ is given by √ √ (2.6) (x − zα/2 s/ n, x + zα/2 s/ n). This is called a large sample conﬁdence interval for μ. In practice, we often do not know if the population is normal. Which conﬁdence interval should we use? Generally, for the same α, the intervals based on tα/2,n−1 are larger than those based on zα/2 . Hence the interval (2.3) is generally more conservative than the interval (2.6). So in practice, statisticians generally prefer the interval (2.3). Occasionally in practice, the standard deviation σ is assumed known. In this case, the conﬁdence interval generally used for μ is (2.6) with s replaced by σ.

220

Some Elementary Statistical Inferences Example 2.3 (Large Sample Conﬁdence Interval for p). Let X be a Bernoulli random variable with probability of success p, where X is 1 or 0 if the outcome is success or failure, respectively. Suppose X1 , . . . , Xn is a random sample from the distribution nof X. Let p = X be the sample proportion of successes. Note that p) = p(1 − p)/n. It follows p = n−1 i=1 Xi is a sample average and that Var(

immediately from the CLT that the distribution of Z = ( p − p)/ p(1 − p)/n is approximately N (0, 1). We replace p(1 − p) with its estimate p(1 − p). Then proceeding as in the last example, an approximate (1 − α)100% conﬁdence interval for p is given by

(2.7) ( p − zα/2 p(1 − p)/n, p + zα/2 p(1 − p)/n),

where p(1 − p)/n is called the standard error of p.

2.1

Conﬁdence Intervals for Diﬀerence in Means

A practical problem of interest is the comparison of two distributions, that is, comparing the distributions of two random variables, say X and Y . In this section, we compare the means of X and Y . Denote the means of X and Y by μ1 and μ2 , respectively. In particular, we obtain conﬁdence intervals for the diﬀerence Δ = μ1 − μ2 . Assume that the variances of X and Y are ﬁnite and denote them as σ12 = Var(X) and σ22 = Var(Y ). Let X1 , . . . , Xn1 be a random sample from the distribution of X and let Y1 , . . . , Yn2 be a random sample from the distribution of Y . Assume were gathered independently of one another. Let n1 that the samples −1 n2 X = n−1 X and Y = n i 1 2 i=1 i=1 Yi be the sample means. Let Δ = X −Y . The is an unbiased estimator of Δ. This diﬀerence, Δ − Δ, is the numerator statistic Δ of the pivot random variable. By independence of the samples, = Var(Δ)

σ2 σ12 + 2. n1 n2

n1 n2 (Xi − X)2 and S22 = (n2 − 1)−1 i=1 (Yi − Y )2 be the Let S12 = (n1 − 1)−1 i=1 sample variances. Then estimating the variances by the sample variances, consider the random variable −Δ Δ . (2.8) Z= 2 S1 S22 + n1 n2 By the independence of the samples and Theorem 2.1, this pivot variable has an approximate N (0, 1) distribution. This leads to the approximate (1 − α)100% conﬁdence interval for Δ = μ1 − μ2 given by ⎛ ⎞ 2 2 2 2 s s s s 1 1 ⎝(x − y) − zα/2 + 2 , (x − y) + zα/2 + 2 ⎠, (2.9) n1 n2 n1 n2

where (s21 /n1 ) + (s22 /n2 ) is the standard error of X − Y . This is a large sample (1 − α)100% conﬁdence interval for μ1 − μ2 .

221

Some Elementary Statistical Inferences The above conﬁdence interval is approximate. In this situation we can obtain exact conﬁdence intervals if we assume that the distributions of X and Y are normal with the same variance; i.e., σ12 = σ22 . Thus the distributions can diﬀer only in location, i.e., a location model. Assume then that X is distributed N (μ1 , σ 2 ) and Y is distributed N (μ2 , σ 2 ), where σ 2 is the common variance of X and Y . As above, let X1 , . . . , Xn1 be a random sample from the distribution of X, let Y1 , . . . , Yn2 be a random sample from the distribution of Y , and assume that the samples are independent of one another. Let n = n1 + n2 be the total sample size. Our estimator of Δ is X − Y . Our goal is to show that a pivot random variable, deﬁned below, has a t-distribution. Because X is distributed N (μ1 , σ 2 /n1 ), Y is distributed N (μ2 , σ 2 /n2 ), and X and Y are independent, we have the result (X−Y )−(μ1 −μ2 ) σ n1 + n1 1

has a N (0, 1) distribution.

(2.10)

2

This serves as the numerator of our T -statistic. Let (n1 − 1)S12 + (n2 − 1)S22 . Sp2 = n1 + n2 − 2

(2.11)

Note that Sp2 is a weighted average of S12 and S22 . It is easy to see that Sp2 is an unbiased estimator of σ 2 . It is called the pooled estimator of σ 2 . Also, because (n1 − 1)S12 /σ 2 has a χ2 (n1 − 1) distribution, (n2 − 1)S22 /σ 2 has a χ2 (n2 − 1) distribution, and S12 and S22 are independent, we have that (n−2)Sp2 /σ 2 has a χ2 (n− 2) distribution. Finally, because S12 is independent of X and S22 is independent of Y , and the random samples are independent of each other, it follows that Sp2 is independent of expression (2.10). Therefore, we have that −1 [(X − Y ) − (μ1 − μ2 )]/σ n−1 1 + n2 T = (n − 2)Sp2 /(n − 2)σ 2 =

(X − Y ) − (μ1 − μ2 ) Sp n11 + n12

(2.12)

has a t-distribution with n − 2 degrees of freedom. From this last result, it is easy to see that the following interval is an exact (1 − α)100% conﬁdence interval for Δ = μ 1 − μ2 :

1 1 1 1 . (2.13) + , (x − y) + t(α/2,n−2) sp + (x − y) − t(α/2,n−2) sp n1 n2 n1 n2 A consideration of the diﬃculty encountered when the unknown variances of the two normal distributions are not equal is assigned to one of the exercises. Example 2.4. Suppose X1 , . . . , X10 is a random sample from a N (μ1 , σ 2 ) distribution, Y1 , . . . , Y7 is a random sample from a N (μ2 , σ 2 ) distribution, and the

222

Some Elementary Statistical Inferences samples are independent. Suppose the realizations of the samples result in the sample means x = 4.2 and y = 3.4 and the sample standard deviations s21 = 49 and s22 = 32. Then, using (2.13), a 90% conﬁdence interval for μ1 − μ2 is (−4.81, 6.41). Remark 2.1. Suppose X and Y are not normally distributed but that their distributions diﬀer only in location. The above interval, (2.13), is then approximate and not exact.

2.2

Conﬁdence Interval for Diﬀerence in Proportions

Let X and Y be two independent random variables with Bernoulli distributions b(1, p1 ) and b(1, p2 ), respectively. Let us now turn to the problem of ﬁnding a conﬁdence interval for the diﬀerence p1 − p2 . Let X1 , . . . , Xn1 be a random sample from the distribution of X and let Y1 , . . . , Yn2 be a random sample from the distribution of Y . As above, assume that the samples are independent of one another and let n = n1 + n2 be the total sample size. Our estimator of p1 − p2 is the diﬀerence in sample proportions, which, of course, is given by X − Y . We use the traditional notation and write pˆ1 and pˆ2 instead of X and Y , respectively. Hence, from the above discussion, an interval such as (2.9) serves as an approximate conﬁdence interval for p1 − p2 . Here, σ12 = p1 (1 − p1 ) and σ22 = p2 (1 − p2 ). In the interval, we estimate these by pˆ1 (1 − pˆ1 ) and pˆ2 (1 − pˆ2 ), respectively. Thus our approximate (1 − α)100% conﬁdence interval for p1 − p2 is pˆ1 (1 − pˆ1 ) pˆ2 (1 − pˆ2 ) + . (2.14) pˆ1 − pˆ2 ± zα/2 n1 n2 Example 2.5. If, in the preceding discussion, we take n1 = 100, n2 = 400, y1 = 30, 80, then the observed values of Y1 /n1 − Y2 /n2 and its standard error are and y2 = 0.1 and (0.3)(0.7)/100 + (0.2)(0.8)/400 = 0.05, respectively. Thus the interval (0, 0.2) is an approximate 95.4% conﬁdence interval for p1 − p2 .

EXERCISES 2.1. Let the observed value of the mean X and of the sample variance of a random sample of size 20 from a distribution that is N (μ, σ 2 ) be 81.2 and 26.5, respectively. Find respectively 90%, 95% and 99% conﬁdence intervals for μ. Note how the lengths of the conﬁdence intervals increase as the conﬁdence increases. 2.2. Consider the data on the lifetimes of motors given in Exercise 1.1. Obtain a large sample conﬁdence interval for the mean lifetime of a motor. 2.3. As in the last exercise, refer to Exercise 1.1. Using expression (4.8), obtain a conﬁdence interval (with conﬁdence close to 90%) for the median lifetime of a motor. 2.4. Suppose we assume that X1 , X2 , . . . , Xn is a random sample from a Γ(1, θ) distribution.

223

Some Elementary Statistical Inferences (a) Show that the random variable (2/θ) degrees of freedom.

n i=1

Xi has a χ2 -distribution with 2n

(b) Using the random variable in part (a) as a pivot random variable, ﬁnd a (1 − α)100% conﬁdence interval for θ. (c) Obtain the conﬁdence interval in part (b) for the data of Exercise 1.1 and compare it with the interval you obtained in Exercise 2.2. 2.5. In Exercise 1.2, the weights of 26 professional baseball pitchers were given. From the same data set, the weights of 33 professional baseball hitters (not pitchers) are given below. Assume that the data sets are independent of one another. 155 185 195

155 185 200

160 185 205

160 185 207

160 185 210

166 190 211

170 190 230

175 190

175 190

175 190

180 195

185 195

185 195

Use expression (2.13) to ﬁnd a 95% conﬁdence interval for the diﬀerence in mean weights between the pitches and the hitters. Which group (on the average) appears to be heavier? Why would this be so? (The sample means and variances for the weights of the pitchers and hitters are, respectively, Pitchers 201, 305.68 and Hitters 185.4, 298.13.) 2.6. In the baseball data set discussed in the last exercise, it was found that out of the 59 baseball players, 15 were left-handed. Is this odd, since the proportion of left-handed males in America is about 11%? Answer by using (2.7) to construct a 95% approximate conﬁdence interval for p, the proportion of left-handed baseball players. 2.7. Let X be the mean of a random sample of size n from a distribution that is N (μ, 9). Find n such that P (X − 1 < μ < X + 1) = 0.90, approximately. 2.8. Let a random sample of size 17 from the normal distribution N (μ, σ 2 ) yield x = 4.7 and s2 = 5.76. Determine a 90% conﬁdence interval for μ. 2.9. Let X denote the mean of a random sample of size n from a distribution that has mean μ and variance σ 2 = 10. Find n so that the probability is approximately 0.954 that the random interval (X − 12 , X + 12 ) includes μ. 2.10. Let X1 , X2 , . . . , X9 be a random sample of size 9 from a distribution that is N (μ, σ 2 ). (a) If σ is known, ﬁnd the length of a√95% conﬁdence interval for μ if this interval is based on the random variable 9(X − μ)/σ. (b) If σ is unknown, ﬁnd the expected value of the length of a √ 95% conﬁdence interval for μ if this interval is based on the random variable 9(X − μ)/S. √ Hint: Write E(S) = (σ/ n − 1)E[((n − 1)S 2 /σ 2 )1/2 ]. (c) Compare these two answers.

224

Some Elementary Statistical Inferences 2.11. Let X1 , X2 , . . . , Xn , Xn+1 be a random sample of sizen + 1, n > 1, from a n n distribution that is N (μ, σ 2 ). Let X = 1 Xi /n and S 2 = 1 (Xi − X)2 /(n − 1). Find the constant c so that the statistic c(X − Xn+1 )/S has a t-distribution. If n = 8, determine k such that P (X − kS < X9 < X + kS) = 0.80. The observed interval (x − ks, x + ks) is often called an 80% prediction interval for X9 . 2.12. Let Y be b(300, p). If the observed value of Y is y = 75, ﬁnd an approximate 90% conﬁdence interval for p. 2.13. Let X be the mean of a random sample of size n from a distribution that is N (μ, σ 2 ), where the positive variance σ 2 is known. Because Φ(2) − Φ(−2) = 0.954, ﬁnd, for each μ, c1 (μ) and c2 (μ) such that P [c1 (μ) < X < c2 (μ)] = 0.954. Note that c1 (μ) and c2 (μ) are increasing functions of μ. Solve for the respective functions d1 (x) and d2 (x); thus, we also have that P [d2 (X) < μ < d1 (X)] = 0.954. Compare this with the answer obtained previously in the text. 2.14. Let X denote the mean of a random sample of size 25 from a gamma-type distribution with α = 4 and β > 0. Use the Central Limit Theorem to ﬁnd an approximate 0.954 conﬁdence interval for μ, the mean of the gamma distribution. Hint: Use the random variable (X − 4β)/(4β 2 /25)1/2 = 5X/2β − 10. 2.15. Let x be the observed mean of a random sample of size n from a distribution having mean μ and known variance σ 2 . Find n so that x − σ/4 to x + σ/4 is an approximate 95% conﬁdence interval for μ. 2.16. Assume a binomial model for a certain random variable. If we desire a 90% conﬁdence interval for p that is at most 0.02 in length, ﬁnd n.

Hint: Note that (y/n)(1 − y/n) ≤ ( 12 )(1 − 12 ). 2.17. It is known that a random variable X has a Poisson distribution with parameter μ. A sample of 200 observations from this distribution has a mean equal to 3.4. Construct an approximate 90% conﬁdence interval for μ. 2.18. Let X1 , X2 , . . . , Xn be a random sample from N (μ, σ 2 ), where both parameters μ and σ 2 are unknown. A conﬁdence interval for σ 2 can be found as follows. We know that (n − 1)S 2 /σ 2 is a random variable with a χ2 (n − 1) distribution. Thus we can ﬁnd constants a and b so that P ((n − 1)S 2 /σ 2 < b) = 0.975 and P (a < (n − 1)S 2 /σ 2 < b) = 0.95. (a) Show that this second probability statement can be written as P ((n − 1)S 2 /b < σ 2 < (n − 1)S 2 /a) = 0.95. (b) If n = 9 and s2 = 7.93, ﬁnd a 95% conﬁdence interval for σ 2 . (c) If μ is known, how would you modify the preceding procedure for ﬁnding a conﬁdence interval for σ 2 ?

225

Some Elementary Statistical Inferences 2.19. Let X1 , X2 , . . . , Xn be a random sample from a gamma distribution with known parameter α = 3 and unknown β > 0. Discuss the construction of a conﬁdence interval for β. n Hint: What is the distribution of 2 1 Xi /β? Follow the procedure outlined in Exercise 2.18. 2.20. When 100 tacks were thrown on a table, 60 of them landed point up. Obtain a 95% conﬁdence interval for the probability that a tack of this type lands point up. Assume independence. 2.21. Let two independent random samples, each of size 10, from two normal distributions N (μ1 , σ 2 ) and N (μ2 , σ 2 ) yield x = 4.8, s21 = 8.64, y = 5.6, s22 = 7.88. Find a 95% conﬁdence interval for μ1 − μ2 . 2.22. Let two independent random variables, Y1 and Y2 , with binomial distributions that have parameters n1 = n2 = 100, p1 , and p2 , respectively, be observed to be equal to y1 = 50 and y2 = 40. Determine an approximate 90% conﬁdence interval for p1 − p2 . 2.23. Discuss the problem of ﬁnding a conﬁdence interval for the diﬀerence μ1 − μ2 between the two means of two normal distributions if the variances σ12 and σ22 are known but not necessarily equal. 2.24. Discuss Exercise 2.23 when it is assumed that the variances are unknown and unequal. This is a very diﬃcult problem, and the discussion should point out exactly where the diﬃculty lies. If, however, the variances are unknown but their ratio σ12 /σ22 is a known constant k, then a statistic that is a T random variable can again be used. Why? 2.25. To illustrate Exercise 2.24, let X1 , X2 , . . . , X9 and Y1 , Y2 , . . . , Y12 represent two independent random samples from the respective normal distributions N (μ1 , σ12 ) and N (μ2 , σ22 ). It is given that σ12 = 3σ22 , but σ22 is unknown. Deﬁne a random variable that has a t-distribution that can be used to ﬁnd a 95% conﬁdence interval for μ1 − μ2 . 2.26. Let X and Y be the means of two independent random samples, each of size n, from the respective distributions N (μ1 , σ 2 ) and N (μ2 , σ 2 ), where the common variance is known. Find n such that P (X − Y − σ/5 < μ1 − μ2 < X − Y + σ/5) = 0.90. 2.27. Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be two independent random samples from the respective normal distributions N (μ1 , σ12 ) and N (μ2 , σ22 ), where the four parameters are unknown. To construct a conﬁdence interval for the ratio, σ12 /σ22 , of the variances, form the quotient of the two independent χ2 variables, each divided by its degrees of freedom, namely, F =

(m−1)S22 /(m σ22 (n−1)S12 /(n σ12

− 1)

− 1)

=

S22 /σ22 , S12 /σ12

where S12 and S22 are the respective sample variances.

226

Some Elementary Statistical Inferences (a) What kind of distribution does F have? (b) From the appropriate table, a and b can be found so that P (F < b) = 0.975 and P (a < F < b) = 0.95. (c) Rewrite the second probability statement as 2 S σ2 S2 P a 12 < 12 < b 12 = 0.95. S2 σ2 S2 The observed values, s21 and s22 , can be inserted in these inequalities to provide a 95% conﬁdence interval for σ12 /σ22 .

3

Conﬁdence Intervals for Parameters of Discrete Distributions

In this section, we outline a procedure that can be used to obtain exact conﬁdence intervals for the parameters of discrete random variables. Let X1 , X2 , . . . , Xn be a random sample on a discrete random variable X with pmf p(x; θ), θ ∈ Ω, where Ω is an interval of real numbers. Let T = T (X1 , X2 , . . . , Xn ) be an estimator of θ with cdf FT (t; θ). Assume that FT (t; θ) is a nonincreasing and continuous function of θ for every t in the support of T . Let α1 > 0 and α2 > 0 be given such that α = α1 + α2 < 0.50. Let θ and θ be the solutions of the equations FT (T−; θ) = 1 − α2 and FT (T ; θ) = α1 ,

(3.1)

where T− is the statistic whose support lags by one value of T ’s support. For instance, if ti < ti+1 are consecutive support values of T , then T = ti+1 if and only if T− = ti . Under these conditions, the interval (θ, θ) is a conﬁdence interval for θ with conﬁdence coeﬃcient of at least 1 − α. We sketch a proof of this below, but, for now, we present two examples. In general, iterative algorithms are needed to solve equations (3.1). In practice, the function FT (T ; θ) is often strictly decreasing and continuous in θ, so a simple algorithm often suﬃces. We illustrate the examples below by using the simple bisection algorithm, which we now brieﬂy discuss. Remark 3.1 (Bisection Algorithm). Suppose we want to solve the equation g(x) = d, where g(x) is strictly decreasing. Assume on a given step of the algorithm that a < b bracket the solution; i.e., g(a) > d and g(b) < d. Let c = (a + b)/2. Then on the next step of the algorithm, the new bracket values a and b are determined by if(g(c) > d)

then

{a ← c and b ← b}

if(g(c) < d)

then

{a ← a and b ← c}.

The algorithm continues until a − b < , where > 0 is a speciﬁed tolerance.

227

Some Elementary Statistical Inferences Example 3.1 (Conﬁdence Interval for a Bernoulli Proportion). Let X have a Bernoulli distribution with θ as the probability of success. Let Ω = (0, 1). Suppose X1 , X2 , . . . , Xn is a random sample on X. As our point estimator of θ, we consider X, which is the sample proportion of successes. The cdf of nX is binomial(n.θ). Thus FX (x; θ)

= =

=

=

P (nX ≤ nx) nx

n j θ (1 − θ)n−j j j=0

n n j θ (1 − θ)n−j 1− j j=nx+1 θ n! z nx (1 − z)n−(nx+1) dz, 1− (nx)![n − (nx + 1]! 0

(3.2)

where the last equality, involving the incomplete β-function, follows from Exercise 3.1. By the fundamental theorem of calculus and expression (3.2), n! d F (x; θ) = − θnx (1 − θ)n−(nx+1) < 0; dθ X (nx)![n − (nx + 1]! hence, FX (x; θ) is a strictly decreasing function of θ, for each x. Next, let α1 , α2 > 0 be speciﬁed constants such that α1 + α2 < 1/2 and let θ and θ solve the equations FX (X−; θ) = 1 − α2 and FX (X; θ) = α1 .

(3.3)

Then (θ, θ) is a conﬁdence interval for θ with conﬁdence coeﬃcient at least 1 − α, where α = α1 + α2 . These equations can be solved iteratively, as discussed in the following numerical illustration. Numerical Illustration. Suppose n = 30, x = 0.60, and α1 = α2 = 0.05 Because the support of the binomial consists of integers and nx = 18, we can write the ﬁrst equation in (3.3) as 17

n j θ (1 − θ)n−j = 0.95. j j=0 Let bin(n, p) denote a random variable with binomial distribution with parameters n and p. Because P (bin(30, 0.4) ≤ 17) = 0.9787 and P (bin(30, 0.45) ≤ 17) = 0.9286, the values 0.4 and 0.45 bracket the solution to the ﬁrst equation. We used the R command pbinom to do these computations. Using these bracket values as input to the R function binomci.r (see Appendix: R Functions) the solution to the ﬁrst equation is θ = 0.434. In the same way, because P (bin(30, 0.7) ≤ 18) = 0.1593 and P (bin(30, 0.8) ≤ 18) = 0.0094, the values 0.7 and 0.8 bracket the solution to the second equation. This leads to the solution θ = 0.750. Thus the conﬁdence interval is (0.434, 0.750), with a conﬁdence of at least 90%. For comparison, the asymptotic 90% conﬁdence interval of expression (2.7) is (0.453, 0.747); see Exercise 3.2.

228

Some Elementary Statistical Inferences Example 3.2 (Conﬁdence Interval for the Mean of a Poisson Distribution). Let X1 , X2 , . . . , Xn be a random sample on arandom variable X which has a Poisson n distribution with mean θ. Let X = n−1 i=1 Xi be our point estimator of θ. As with the Bernoulli conﬁdence interval in the last example, we can work with nX, which, in this case, has a Poisson distribution with mean nθ. The cdf of X is FX (x; θ)

=

nx

e−nθ

j=0

=

(nθ)j j! ∞

1 Γ(nx + 1)

xnx e−x dx,

(3.4)

nθ

where the integral equation is obtained in Exercise 3.4. From expression (3.4), we immediately have d −n F (x; θ) = (nθ)nx e−nθ < 0. dθ X Γ(nx + 1) Therefore, FX (x; θ) is a strictly decreasing function of θ for every ﬁxed x. Hence, as discussed above, for α1 , α2 > 0 such that α1 + α2 < 1/2, the conﬁdence interval is given by (θ, θ), where nX−1

e−nθ

(nθ)j j!

=

1 − α2

(3.5)

e−nθ

(nθ)j j!

=

α1 .

(3.6)

j=0 nX j=0

The conﬁdence coeﬃcient of the interval (θ, θ) is at least 1 − α = 1 − (α1 + α2 ). As with the Bernoulli proportion, these equations can be solved iteratively. Numerical Illustration. Suppose n = 25 and the realized value of X is x = 5; hence, nx = 125. We select α1 = α2 = 0.05. Then, by (3.5) and (3.6), our conﬁdence interval solves the equations 124

e−nθ

(nθ)j j!

=

0.95

(3.7)

e−nθ

(nθ)j j!

=

0.05.

(3.8)

j=0 125 j=0

As with the Bernoulli conﬁdence interval, we use the simple bisection algorithm to solve these equations. Let poi(x, θ) denote the probability that a Poisson random variable with mean θ is less than or equal to x. Using a simple computer package, we have poi(124, 25 · 4) = 0.9912 and poi(124, 25 · 4.4) = 0.9145. Thus, θ = 4 and θ = 4.4 bracket the solution to the ﬁrst equation. The simple R function

229

Some Elementary Statistical Inferences poissonci.r given in Appendix: R Functions returns the solution θ = 4.287. For the second equation, poi(125, 25 · 5.5) = 0.1528 and poi(125, 25 · 6) = 0.0204. Using the bracket values 5.5 and 6, poissonci.r obtains the solution θ = 5.8. So the conﬁdence interval is (4.287, 5.8), with conﬁdence at least 90%. Note that the conﬁdence interval is right-skewed, similar to the Poisson distribution. A brief sketch of the theory behind this conﬁdence interval follows. Consider the general setup in the ﬁrst paragraph of this section, where T is an estimator of the unknown parameter θ and FT (t; θ) is the cdf of T . Deﬁne θ θ

= =

sup{θ : FT (T ; θ) ≥ α1 } inf{θ : FT (T−; θ) ≤ 1 − α2 }.

(3.9) (3.10)

Hence, we have θ>θ

⇒

FT (T ; θ) ≤ α1

θ θ}] 1 − P [θ < θ] − P [θ > θ] 1 − P [FT (T−; θ) ≥ 1 − α2 ] − P [FT (T ; θ) ≤ α1 ] 1 − α1 − α2 ,

where the last inequality is evident from equations (3.9) and (3.10). A rigorous proof can be based on Exercise 8.13; see page 425 of Shao (1998) for details. EXERCISES 3.1. Show that p 0

n

n w n! k−1 n−k p (1 − p)n−w , z (1 − z) dz = w (k − 1)!(n − k)! w=k

where 0 < p < 1, and k and n are positive integers such that k ≤ n. 3.2. In Example 3.1, verify the result for the asymptotic conﬁdence interval for θ. 3.3. Suppose X1 , X2 , . . . , X10 is a random sample on a random variable X which has a Poisson distribution with mean θ. Say the realized value of the sample mean is 0.5; i.e., nx = 5. Suppose we want to compute the conﬁdence interval (θ, θ) as determined by equations (3.5) and (3.6). Using Table I in Appendix: Tables of Distributions, show that 0.2 and 0.3 bracket θ and that 0.9 and 1.0 bracket θ. If R is available, use the R function poissonci.r to compute the solutions to the equations. 3.4. This exercise obtains a useful identity for the cdf of a Poisson cdf.

230

Some Elementary Statistical Inferences (a) Show that this identity is true: λn Γ(n)

∞

x

n−1 −xλ

e

dx =

1

n−1 j=0

e−λ

λj , j!

for λ > 0 and n a positive integer. Hint: Just consider a Poisson process on the unit interval with mean λ. Let Wn be the waiting time until the nth event. Then the left side is P (Wn > 1). Why? (b) Obtain the identity used in Example 3.2, by making the transformation z = λx in the above integral.

4

Order Statistics

In this section the notion of an order statistic is deﬁned and some of its simple properties are investigated. These statistics have in recent times come to play an important role in statistical inference partly because some of their properties do not depend upon the distribution from which the random sample is obtained. Let X1 , X2 , . . . , Xn denote a random sample from a distribution of the continuous type having a pdf f (x) that has support S = (a, b), where −∞ ≤ a < b ≤ ∞. Let Y1 be the smallest of these Xi , Y2 the next Xi in order of magnitude, . . . , and Yn the largest of Xi . That is, Y1 < Y2 < · · · < Yn represent X1 , X2 , . . . , Xn when the latter are arranged in ascending order of magnitude. We call Yi , i = 1, 2, . . . , n, the ith order statistic of the random sample X1 , X2 , . . . , Xn . Then the joint pdf of Y1 , Y2 , . . . , Yn is given in the following theorem. Theorem 4.1. Using the above notation, let Y1 < Y2 < · · · < Yn denote the n order statistics based on the random sample X1 , X2 , . . . , Xn from a continuous distribution with pdf f (x) and support (a, b). Then the joint pdf of Y1 , Y2 , . . . , Yn is given by n!f (y1 )f (y2 ) · · · f (yn ) a < y1 < y2 < · · · < yn < b g(y1 , y2 , . . . , yn ) = (4.1) 0 elsewhere. Proof: Note that the support of X1 , X2 , . . . , Xn can be partitioned into n! mutually disjoint sets which map onto the support of Y1 , Y2 , . . . , Yn , namely, {(y1 , y2 , . . . , yn ) : a < y1 < y2 < · · · < yn < b}. One of these n! sets is a < x1 < x2 < · · · < xn < b, and the others can be found by permuting the n xs in all possible ways. The transformation associated with the one listed is x1 = y1 , x2 = y2 , . . . , xn = yn , which has a Jacobian equal to 1. However, the Jacobian of each of the other transformations is either ±1. Thus g(y1 , y2 , . . . , yn )

=

n!

|Ji |f (y1 )f (y2 ) · · · f (yn )

i=1

=

n!f (y1 )f (y2 ) · · · f (yn ) 0

a < y1 < y2 < · · · < yn < b elsewhere,

231

Some Elementary Statistical Inferences as was to be proved. Example 4.1. Let X denote a random variable of the continuous type with a pdf f (x) that is positive and continuous, with support S = (a, b), −∞ ≤ a < b ≤ ∞. The distribution function F (x) of X may be written x f (w) dw, a < x < b. F (x) = a

If x ≤ a, F (x) = 0; and if b ≤ x, F (x) = 1. Thus there is a unique median m of the distribution with F (m) = 12 . Let X1 , X2 , X3 denote a random sample from this distribution and let Y1 < Y2 < Y3 denote the order statistics of the sample. Note that Y2 is the sample median. We compute the probability that Y2 ≤ m. The joint pdf of the three order statistics is 6f (y1 )f (y2 )f (y3 ) a < y1 < y2 < y3 < b g(y1 , y2 , y3 ) = 0 elsewhere. The pdf of Y2 is then h(y2 )

= =

b y2 6f (y2 ) f (y1 )f (y3 ) dy1 dy3 y2 a 6f (y2 )F (y2 )[1 − F (y2 )] a < y2 < b 0 elsewhere.

Accordingly,

P (Y2 ≤ m)

=

m

{F (y2 )f (y2 ) − [F (y2 )]2 f (y2 )} dy2 m [F (y2 )]3 [F (y2 )]2 1 − 6 = . 2 3 2 a

6

a

=

Hence, for this situation, the median of the sample median Y2 is the population median m. Once it is observed that x [F (x)]α , [F (w)]α−1 f (w) dw = α a and that

b

[1 − F (w)]β−1 f (w) dw = y

α > 0,

[1 − F (y)]β , β

β > 0,

it is easy to express the marginal pdf of any order statistic, say Yk , in terms of F (x) and f (x). This is done by evaluating the integral yk y2 b b ··· ··· n!f (y1 )f (y2 ) · · · f (yn ) dyn · · · dyk+1 dy1 · · · dyk−1 . gk (yk ) = a

232

a

yk

yn−1

Some Elementary Statistical Inferences The result is

n! k−1 [1 (k−1)!(n−k)! [F (yk )]

gk (yk ) =

− F (yk )]n−k f (yk )

0

a < yk < b elsewhere.

(4.2)

Example 4.2. Let Y1 < Y2 < Y3 < Y4 denote the order statistics of a random sample of size 4 from a distribution having pdf 2x 0 < x < 1 f (x) = 0 elsewhere. We express the pdf of Y3 in terms of f (x) and F (x) and then compute P ( 12 < Y3 ). Here F (x) = x2 , provided that 0 < x < 1, so that 4! 2 2 2 2! 1! (y3 ) (1 − y3 )(2y3 ) 0 < y3 < 1 g3 (y3 ) = 0 elsewhere. Thus P ( 12

< Y3 )

∞

= 1/2 1

= 1/2

g3 (y3 ) dy3 24(y35 − y37 ) dy3 =

243 . 256

Finally, the joint pdf of any two order statistics, say Yi < Yj , is easily expressed in terms of F (x) and f (x). We have

yi

gij (yi , yj ) = a

y2

··· a

yj yi

···

yj yj−2

b

b

··· yj

n!f (y1 ) × · · · yn−1

× f (yn ) dyn · · · dyj+1 dyj−1 · · · dyi+1 dy1 · · · dyi−1 . Since, for γ > 0, y [F (y) − F (w)]γ−1 f (w) dw

=

x

=

y [F (y) − F (w)]γ γ x [F (y) − F (x)]γ , γ −

it is found that ⎧ n! ⎨ (i−1)!(j−i−1)!(n−j)! [F (yi )]i−1 [F (yj ) − F (yi )]j−i−1 gij (yi , yj ) = ×[1 − F (yj )]n−j f (yi )f (yj ) ⎩ 0

a < yi < y j < b elsewhere. (4.3)

Remark 4.1 (Heuristic Derivation). There is an easy method of remembering the pdf of a vector of order statistics such as the one given in formula (4.3). The probability P (yi < Yi < yi + Δi , yj < Yj < yj + Δj ), where Δi and Δj are small,

233

Some Elementary Statistical Inferences can be approximated by the following multinomial probability. In n independent trials, i − 1 outcomes must be less than yi [an event that has probability p1 = F (yi ) on each trial]; j − i − 1 outcomes must be between yi + Δi and yj [an event with approximate probability p2 = F (yj ) − F (yi ) on each trial]; n − j outcomes must be greater than yj + Δj [an event with approximate probability p3 = 1 − F (yj ) on each trial]; one outcome must be between yi and yi + Δi [an event with approximate probability p4 = f (yi )Δi on each trial]; and, ﬁnally, one outcome must be between yj and yj + Δj [an event with approximate probability p5 = f (yj )Δj on each trial]. This multinomial probability is n! pi−1 pj−i−1 pn−j p4 p5 , 3 (i − 1)!(j − i − 1)!(n − j)! 1! 1! 1 2 which is gi,j (yi , yj )Δi Δj , where gi,j (yi , yj ) is given in expression (4.3). Certain functions of the order statistics Y1 , Y2 , . . . , Yn are important statistics themselves. A few of these are (a) Yn − Y1 , which is called the range of the random sample; (b) (Y1 + Yn )/2, which is called the midrange of the random sample; and (c) if n is odd, Y(n+1)/2 , which is called the median of the random sample. Example 4.3. Let Y1 , Y2 , Y3 be the order statistics of a random sample of size 3 from a distribution having pdf 1 0 0, provided that x ≥ 0, and f (x) = 0 elsewhere. Show that the independence of Z1 = Y1 and Z2 = Y2 − Y1 characterizes the gamma pdf f (x), which has parameters α = 1 and β > 0. That is, show that Y1 and Y2 are independent if and only if f (x) is the pdf of a Γ(1, β) distribution. Hint: Use the change-of-variable technique to ﬁnd the joint pdf of Z1 and Z2 from that of Y1 and Y2 . Accept the fact that the functional equation h(0)h(x + y) ≡ h(x)h(y) has the solution h(x) = c1 ec2 x , where c1 and c2 are constants. 4.17. Let Y1 < Y2 < Y3 < Y4 be the order statistics of a random sample of size n = 4 from a distribution with pdf f (x) = 2x, 0 < x < 1, zero elsewhere. (a) Find the joint pdf of Y3 and Y4 . (b) Find the conditional pdf of Y3 , given Y4 = y4 . (c) Evaluate E(Y3 |y4 ). 4.18. Two numbers are selected at random from the interval (0, 1). If these values are uniformly and independently distributed, by cutting the interval at these numbers, compute the probability that the three resulting line segments can form a triangle.

241

Some Elementary Statistical Inferences 4.19. Let X and Y denote independent random variables with respective probability density functions f (x) = 2x, 0 < x < 1, zero elsewhere, and g(y) = 3y 2 , 0 < y < 1, zero elsewhere. Let U = min(X, Y ) and V = max(X, Y ). Find the joint pdf of U and V . Hint: Here the two inverse transformations are given by x = u, y = v and x = v, y = u. 4.20. Let the joint pdf of X and Y be f (x, y) = 12 7 x(x + y), 0 < x < 1, 0 < y < 1, zero elsewhere. Let U = min(X, Y ) and V = max(X, Y ). Find the joint pdf of U and V . 4.21. Let X1 , X2 , . . . , Xn be a random sample from a distribution of either type. A measure of spread is Gini’s mean diﬀerence G=

j−1 n j=2 i=1

n . 2

|Xi − Xj |/

(4.11)

10 (a) If n = 10, ﬁnd a1 , a2 , . . . , a10 so that G = i=1 ai Yi , where Y1 , Y2 , . . . , Y10 are the order statistics of the sample. √ (b) Show that E(G) = 2σ/ π if the sample arises from the normal distribution N (μ, σ 2 ). 4.22. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n from the exponential distribution with pdf f (x) = e−x , 0 < x < ∞, zero elsewhere. (a) Show that Z1 = nY1 , Z2 = (n − 1)(Y2 − Y1 ), Z3 = (n − 2)(Y3 − Y2 ), . . . , Zn = Yn − Yn−1 are independent and that each Zi has the exponential distribution. n (b) Demonstrate that all linear functions of Y1 , Y2 , . . . , Yn , such as 1 ai Yi , can be expressed as linear functions of independent random variables. 4.23. In the Program Evaluation and Review Technique (PERT), we are interested in the total time to complete a project that is comprised of a large number of subprojects. For illustration, let X1 , X2 , X3 be three independent random times for three subprojects. If these subprojects are in series (the ﬁrst one must be completed before the second starts, etc.), then we are interested in the sum Y = X1 + X2 + X3 . If these are in parallel (can be worked on simultaneously), then we are interested in Z = max(X1 , X2 , X3 ). In the case each of these random variables has the uniform distribution with pdf f (x) = 1, 0 < x < 1, zero elsewhere, ﬁnd (a) the pdf of Y and (b) the pdf of Z. 4.24. Let Yn denote the nth order statistic of a random sample of size n from a distribution of the continuous type. Find the smallest value of n for which the inequality P (ξ0.9 < Yn ) ≥ 0.75 is true. 4.25. Let Y1 < Y2 < Y3 < Y4 < Y5 denote the order statistics of a random sample of size 5 from a distribution of the continuous type. Compute:

242

Some Elementary Statistical Inferences (a) P (Y1 < ξ0.5 < Y5 ). (b) P (Y1 < ξ0.25 < Y3 ). (c) P (Y4 < ξ0.80 < Y5 ). 4.26. Compute P (Y3 < ξ0.5 < Y7 ) if Y1 < · · · < Y9 are the order statistics of a random sample of size 9 from a distribution of the continuous type. 4.27. Find the smallest value of n for which P (Y1 < ξ0.5 < Yn ) ≥ 0.99, where Y1 < · · · < Yn are the order statistics of a random sample of size n from a distribution of the continuous type. 4.28. Let Y1 < Y2 denote the order statistics of a random sample of size 2 from a distribution that is N (μ, σ 2 ), where σ 2 is known. (a) Show that P (Y1 < μ < Y2 ) = random length Y2 − Y1 .

1 2

and compute the expected value of the

(b) If X is the mean of this sample, ﬁnd the constant c that solves the equation P (X − cσ < μ < X + cσ) = 12 , and compare the length of this random interval with the expected value of that of part (a). 4.29. Let y1 < y2 < y3 be the observed values of the order statistics of a random sample of size n = 3 from a continuous type distribution. Without knowing these values, a statistician is given these values in a random order, and she wants to select the largest; but once she refuses an observation, she cannot go back. Clearly, if she selects the ﬁrst one, her probability of getting the largest is 1/3. Instead, she decides to use the following algorithm: She looks at the ﬁrst but refuses it and then takes the second if it is larger than the ﬁrst, or else she takes the third. Show that this algorithm has probability of 1/2 of selecting the largest. 4.30. Refer to Exercise 1.1. Using expression (4.8), obtain a conﬁdence interval (with conﬁdence close to 90%) for the median lifetime of a motor. What does the interval mean? 4.31. Let Y1 < Y2 < · · · < Yn denote the order statistics of a random sample of size n from a distribution that has pdf f (x) = 3x2 /θ3 , 0 < x < θ, zero elsewhere. (a) Show that P (c < Yn /θ < 1) = 1 − c3n , where 0 < c < 1. (b) If n is 4 and if the observed value of Y4 is 2.3, what is a 95% conﬁdence interval for θ? 4.32. In Exercises 1.2 and 2.5 samples of the weights of professional baseball pitchers and hitters are displayed. Obtain comparison (on the same real line) boxplots of the two data sets. Comment on the plots. In particular, how similar are the interquartile ranges?

243

Some Elementary Statistical Inferences

5

Introduction to Hypothesis Testing

Point estimation and conﬁdence intervals are useful statistical inference procedures. Another type of inference that is frequently used concerns tests of hypotheses. As in Sections 1 through 3, suppose our interest centers on a random variable X that has density function f (x; θ), where θ ∈ Ω. Suppose we think, due to theory or a preliminary experiment, that θ ∈ ω0 or θ ∈ ω1 , where ω0 and ω1 are disjoint subsets of Ω and ω0 ∪ ω1 = Ω. We label these hypotheses as H0 : θ ∈ ω0 versus H1 : θ ∈ ω1 .

(5.1)

The hypothesis H0 is referred to as the null hypothesis, while H1 is referred to as the alternative hypothesis. Often the null hypothesis represents no change or no diﬀerence from the past, while the alternative represents change or diﬀerence. The alternative is often referred to as the research worker’s hypothesis. The decision rule to take H0 or H1 is based on a sample X1 , . . . , Xn from the distribution of X and, hence, the decision could be wrong. For instance, we could decide that θ ∈ ω1 when really θ ∈ ω0 or we could decide that θ ∈ ω0 when, in fact, θ ∈ ω1 . We label these errors Type I and Type II errors, respectively, later in this section. A careful analysis of these errors can lead in certain situations to optimal decision rules. In this section, though, we simply want to introduce the elements of hypothesis testing. To set ideas, consider the following example. Example 5.1 (Zea mays Data). In 1878 Charles Darwin recorded some data on the heights of Zea mays plants to determine what eﬀect cross-fertilization or selffertilization had on the height of Zea mays. The experiment was to select one cross-fertilized plant and one self-fertilized plant, grow them in the same pot, and then later measure their heights. An interesting hypothesis for this example would be that the cross-fertilized plants are generally taller than the self-fertilized plants. This is the alternative hypothesis, i.e., the research worker’s hypothesis. The null hypothesis is that the plants generally grow to the same height regardless of whether they were self- or cross-fertilized. Data for 15 pots were recorded. We represent the data as (Y1 , Z1 ), . . . , (Y15 , Z15 ), where Yi and Zi are the heights of the cross-fertilized and self-fertilized plants, respectively, in the ith pot. Let Xi = Yi − Zi . Due to growing in the same pot, Yi and Zi may be dependent random variables, but it seems appropriate to assume independence between pots, i.e., independence between the paired random vectors. So we assume that X1 , . . . , X15 form a random sample. As a tentative model, consider X i = μ + ei ,

i = 1, . . . , 15,

where the random variables ei are iid with continuous density f (x). For this model, there is no loss in generality in assuming that the mean of ei is 0, for, otherwise, we can simply redeﬁne μ. Hence, E(Xi ) = μ. Further, the density of Xi is fX (x; μ) = f (x − μ). In practice, the goodness of the model is always a concern and diagnostics based on the data would be run to conﬁrm the quality of the model. If μ = E(Xi ) = 0, then E(Yi ) = E(Zi ); i.e., on average, the cross-fertilized plants grow to the same height as the self-fertilized plants. While, if μ > 0 then

244

Some Elementary Statistical Inferences Table 5.1: 2 × 2 Decision Table for a Hypothesis Test

Decision Reject H0 Accept H0

True State of Nature H0 is True H1 is True Type I Error Correct Decision Correct Decision Type II Error

E(Yi ) > E(Zi ); i.e., on average the cross-fertilized plants are taller than the selffertilized plants. Under this model, our hypotheses are H0 : μ = 0 versus H1 : μ > 0.

(5.2)

Hence, ω0 = {0} represents no diﬀerence in the treatments, while ω1 = (0, ∞) represents that the mean height of cross-fertilized Zea mays exceeds the mean height of self-fertilized Zea mays. To complete the testing structure for the general problem described at the beginning of this section, we need to discuss decision rules. Recall that X1 , . . . , Xn is a random sample from the distribution of a random variable X which has density f (x; θ), where θ ∈ Ω. Consider testing the hypotheses H0 : θ ∈ ω0 versus H1 : θ ∈ ω1 , where ω0 ∪ ω1 = Ω. Denote the space of the sample by D; that is, D = space {(X1 , . . . , Xn )}. A test of H0 versus H1 is based on a subset C of D. This set C is called the critical region and its corresponding decision rule (test) is Reject H0 (Accept H1 ) Retain H0 (Reject H1 )

if (X1 , . . . , Xn ) ∈ C

(5.3)

if (X1 , . . . , Xn ) ∈ C . c

For a given critical region, the 2×2 decision table as shown in Table 5.1, summarizes the results of the hypothesis test in terms of the true state of nature. Besides the correct decisions, two errors can occur. A Type I error occurs if H0 is rejected when it is true, while a Type II error occurs if H0 is accepted when H1 is true. The goal, of course, is to select a critical region from all possible critical regions which minimizes the probabilities of these errors. In general, this is not possible. The probabilities of these errors often have a see saw eﬀect. This can be seen immediately in an extreme case. Simply let C = φ. With this critical region, we would never reject H0 , so the probability of Type I error would be 0, but the probability of Type II error is 1. Often we consider Type I error to be the worse of the two errors. We then proceed by selecting critical regions which bound the probability of Type I error and then among these critical regions we try to select one which minimizes the probability of Type II error. Deﬁnition 5.1. We say a critical region C is of size α if α = max Pθ [(X1 , . . . , Xn ) ∈ C]. θ∈ω0

(5.4)

245

Some Elementary Statistical Inferences Over all critical regions of size α, we want to consider critical regions which have lower probabilities of Type II error. We also can look at the complement of a Type II error, namely, rejecting H0 when H1 is true, which is a correct decision, as marked in Table 5.1. Since we desire to maximize the probability of this latter decision, we want the probability of it to be as large as possible. That is, for θ ∈ ω1 , we want to maximize 1 − Pθ [Type II Error] = Pθ [(X1 , . . . , Xn ) ∈ C]. The probability on the right side of this equation is called the power of the test at θ. It is the probability that the test detects the alternative θ when θ ∈ ω1 is the true parameter. So minimizing the probability of Type II error is equivalent to maximizing power. We deﬁne the power function of a critical region to be γC (θ) = Pθ [(X1 , . . . , Xn ) ∈ C];

θ ∈ ω1 .

(5.5)

Hence, given two critical regions C1 and C2 , which are both of size α, C1 is better than C2 if γC1 (θ) ≥ γC2 (θ) for all θ ∈ ω1 . In this section, we want to illustrate these concepts of hypotheses testing with several examples. Example 5.2 (Test for a Binomial Proportion of Success). Let X be a Bernoulli random variable with probability of success p. Suppose we want to test, at size α, H0 : p = p0 versus H1 : p < p0 ,

(5.6)

where p0 is speciﬁed. As an illustration, suppose “success” is dying from a certain disease and p0 is the probability of dying with some standard treatment. A new treatment is used on several (randomly chosen) patients, and it is hoped that the probability of dying under this new treatment is less than p 0 . Let X1 , . . . , Xn be n a random sample from the distribution of X and let S = i=1 Xi be the total number of successes in the sample. An intuitive decision rule (critical region) is Reject H0 in favor of H1 if S ≤ k,

(5.7)

where k is such that α = PH0 [S ≤ k]. Since S has a b(n, p0 ) distribution under H0 , k is determined by α = Pp0 [S ≤ k]. Because the binomial distribution is discrete, however, it is likely that there is no integer k which solves this equation. For example, suppose n = 20, p0 = 0.7, and α = 0.15. Then under H0 , S has a binomial b(20, 0.7) distribution. Hence, computationally, PH0 [S ≤ 11] = 0.1133 and PH0 [S ≤ 12] = 0.2277. Hence, erring on the conservative side, we would probably choose k to be 11 and α = 0.1133. As n increases, this is less of a problem; see, also, the later discussion on p-values. In general, the power of the test for the hypotheses (5.6) is (5.8) γ(p) = Pp [S ≤ k] , p < p0 . The curve labeled Test 1 in Figure 5.1 is the power function for the case n = 20, p0 = 0.7, and α = 0.1133. Notice that the power function is decreasing. The

246

Some Elementary Statistical Inferences power is higher to detect the alternative p = 0.2 than p = 0.6. In general the monotonicity of the power function for binomial tests of these hypotheses be proven. Using this monotonicity, we extend our test to the more general null hypothesis H0 : p ≥ p0 rather than simply H0 : p = p0 . Using the same decision rule as we used for the hypotheses (5.6), the deﬁnition of the size of a test (5.4), and the monotonicity of the power curve, we have max Pp [S ≤ k] = Pp0 [S ≤ k] = α, p≥p0

i.e., the same size as for the original null hypothesis. (p)

0.8

Test 2: size

= 0.227

0.4

Test 1: size

0.2

= 0.113

p 0.4

0.5

0.7

0.8

Figure 5.1: Power curves for tests 1 and 2; see Example 5.2. Denote by Test 1 the test for the situation with n = 20, p0 = 0.70, and size α = 0.1133. Suppose we have a second test (Test 2) with an increased size. How does the power function of Test 2 compare to Test 1? As an example, suppose for Test 2, we select α = 0.2277. Hence, for Test 2, we reject H0 if S ≤ 12. Figure 5.1 displays the resulting power function. Note that while Test 2 has a higher probability of committing a Type I error, it also has a higher power at each alternative p < 0.7. Exercise 5.7 shows this is true for these binomial tests. It is true in general; that is, if the size of the test increases, power does too. For this example, the R function binpower.r of Appendix: R Functions produces a version of Figure 5.1. Remark 5.1 (Nomenclature). Since in Example 5.2, the ﬁrst null hypothesis H0 : p = p0 completely speciﬁes the underlying distribution, it is called a simple hypothesis. Most hypotheses, such as H1 : p < p0 , are composite hypotheses, because they are composed of many simple hypotheses and hence do not completely specify the distribution. As we study more and more statistics, we ﬁnd out that often other names are used for the size, α, of the critical region. Frequently, α is also called the signiﬁ-

247

Some Elementary Statistical Inferences cance level of the test associated with that critical region. Moreover, sometimes α is called the “maximum of probabilities of committing an error of Type I” and the “maximum of the power of the test when H0 is true.” It is disconcerting to the student to discover that there are so many names for the same thing. However, all of them are used in the statistical literature, and we feel obligated to point out this fact. The test in the last example is based on the exact distribution of its test statistic, i.e., the binomial distribution. Often we cannot obtain the distribution of the test statistic in closed form. As with approximate conﬁdence intervals, however, we can frequently appeal to the Central Limit Theorem to obtain an approximate test; see Theorem 2.1. Such is the case for the next example. Example 5.3 (Large Sample Test for the Mean). Let X be a random variable with mean μ and ﬁnite variance σ 2 . We want to test the hypotheses H0 : μ = μ0 versus H1 : μ > μ0 ,

(5.9)

where μ0 is speciﬁed. To illustrate, suppose μ0 is the mean level on a standardized test of students who have been taught a course by a standard method of teaching. Suppose it is hoped that a new method which incorporates computers has a mean level μ > μ0 , where μ = E(X) and X is the score of a student taught by the new method. This conjecture is tested by having n students (randomly selected) taught under this new method. Let X1 , . . . , Xn be a random sample from the distribution of X and denote the sample mean and variance by X and S 2 , respectively. Because X is an unbiased estimate of μ, an intuitive decision rule is given by Reject H0 in favor of H1 if X is much larger than μ0 .

(5.10)

In general, the distribution of the sample mean cannot be obtained in closed form. In Example 5.4, under the strong assumption of normality for the distribution of X, we obtain an exact test. For now, the√Central Limit Theorem (Theorem 2.1) shows that the distribution of (X − μ)/(S/ n) is approximately N (0, 1). Using this, we obtain a test with an approximate size α, with the decision rule Reject H0 in favor of H1 if

X−μ √0 S/ n

≥ zα .

(5.11) √ The test is intuitive. To reject H0 , X must exceed μ0 by at least zα S/ n. To approximate the power function of the test, we use the Central Limit Theorem. Upon substituting σ for S, it readily follows that the approximate power function is √ γ(μ) = Pμ (X ≥ μ0 + zα σ/ n)

μ0 − μ X −μ √ √ ≥ + zα = Pμ σ/ n σ/ n

√ n(μ0 − μ) ≈ 1 − Φ zα + σ

√ n(μ0 − μ) . (5.12) = Φ −zα − σ

248

Some Elementary Statistical Inferences So if we have some reasonable idea of what σ equals, we can compute the approximate power function. As Exercise 5.1 shows, this approximate power function is strictly increasing in μ, so as in the last example, we can change the null hypotheses to (5.13) H0 : μ ≤ μ0 versus H1 : μ > μ0 . Our asymptotic test has approximate size α for these hypotheses. Example 5.4 (Test for μ Under Normality). Let X have a N (μ, σ 2 ) distribution. As in Example 5.3, consider the hypotheses H0 : μ = μ0 versus H1 : μ > μ0 ,

(5.14)

where μ0 is speciﬁed. Assume that the desired size of the test is α, for 0 < α < 1, Suppose X1 , . . . , Xn is a random sample from a N (μ, σ 2 ) distribution. Let X and S 2 denote the sample mean and variance, respectively. Our intuitive rejection rule is to reject H0 in favor of H1 if X is much larger than μ0 . Unlike Example 5.3, we now know the distribution of the statistic X. In particular, under H0 the statistic √ T = (X − μ0 )/(S/ n) has a t-distribution with n − 1 degrees of freedom. Using the distribution of T , it follows that this rejection rule has exact level α: Reject H0 in favor of H1 if T =

X−μ √0 S/ n

≥ tα,n−1 ,

(5.15)

where tα,n−1 is the upper α critical point of a t-distribution with n − 1 degrees of freedom; i.e., α = P (T > tα,n−1 ). This is often called the t-test of H0 : μ = μ0 . Note the diﬀerences between this rejection rule and the large sample rule, (5.11). The large sample rule has approximate level α, while this has exact level α. Of course, we now have to assume that X has a normal distribution. In practice, we may not be willing to assume that the population is normal. Usually t-critical values are larger than z-critical values; hence, the t-test is conservative relative to the large sample test. So, in practice, many statisticians often use the t-test. Example 5.5 (Example 5.1, Continued). The data for Darwin’s experiment on Zea mays are recorded in Table 5.2. A boxplot and a normal q − q plot of the 15 diﬀerences, xi = yi − zi , are found in Figure 5.2. Based on these plots, we can see that there seem to be two outliers, Pots 2 and 15. In these two pots, the self-fertilized Zea mays are much taller than their cross-fertilized pairs. Except for these two outliers, the diﬀerences, yi − zi , are positive, indicating that the crossfertilization leads to taller plants. We proceed to conduct a test of hypotheses (5.2), as discussed in Example 5.4. We use the decision rule given by (5.15) with α = 0.05. As Exercise 5.2 shows, the values of the sample mean and standard deviation for the diﬀerences, xi , are x = 2.62 and sx = 4.72. Hence the t-test statistic is 2.15, which exceeds the t-critical value, t.05,14 = 1.76. Thus we reject H0 and conclude that cross-fertilized Zea mays are on the average taller than self-fertilized Zea mays. Because of the outliers, normality of the error distribution is somewhat dubious, and we use the test in a conservative manner, as discussed at the end of Example 5.4.

249

Some Elementary Statistical Inferences Table 5.2: Plant Growth

Pot Cross Self Pot Cross Self

1 23.500 17.375 9 18.250 16.500

2 12.000 20.375 10 21.625 18.000

3 21.000 20.000 11 23.250 16.250

4 22.000 20.000 12 21.000 18.000

5 19.125 18.375 13 22.125 12.750

6 21.500 18.625 14 23.000 15.500

7 22.125 18.625 15 12.000 18.000

8 20.375 15.250

Panel B 10 Panel A 10

Difference = Cross – Self

0

Difference = Cross – Self

5 5

0

–5 –5 –1.5

–1.0

–0.5 0.0 0.5 Normal quantiles

1.0

1.5

Figure 5.2: Boxplot and normal q−q plot for the data of Example 5.5. EXERCISES 5.1. Show that the approximate power function given in expression (5.12) of Example 5.3 is a strictly increasing function of μ. Show then that the test discussed in this example has approximate size α for testing H0 : μ ≤ μ0 versus H1 : μ > μ0 . 5.2. For the Darwin data tabled in Example 5.5, verify that the Student t-test statistic is 2.15. 5.3. Let X have a pdf of the form f (x; θ) = θxθ−1 , 0 < x < 1, zero elsewhere, where θ ∈ {θ : θ = 1, 2}. To test the simple hypothesis H0 : θ = 1 against the alternative simple hypothesis H1 : θ = 2, use a random sample X1 , X2 of size n = 2 and deﬁne the critical region to be C = {(x1 , x2 ) : 34 ≤ x1 x2 }. Find the power function of the test. 5.4. Let X have a binomial distribution with the number of trials n = 10 and with p either 1/4 or 1/2. The simple hypothesis H0 : p = 12 is rejected, and the alternative simple hypothesis H1 : p = 14 is accepted, if the observed value of X1 , a random sample of size 1, is less than or equal to 3. Find the signiﬁcance level and the power of the test.

250

Some Elementary Statistical Inferences 5.5. Let X1 , X2 be a random sample of size n = 2 from the distribution having pdf f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, zero elsewhere. We reject H0 : θ = 2 and accept H1 : θ = 1 if the observed values of X1 , X2 , say x1 , x2 , are such that 1 f (x1 ; 2)f (x2 ; 2) ≤ . f (x1 ; 1)f (x2 ; 1) 2 Here Ω = {θ : θ = 1, 2}. Find the signiﬁcance level of the test and the power of the test when H0 is false. 5.6. Consider the tests Test 1 and Test 2 for the situation discussed in Example 5.2. Consider the test which rejects H0 if S ≤ 10. Find the level of signiﬁcance for this test and sketch its power curve as in Figure 5.1. 5.7. Consider the situation described in Example 5.2. Suppose we have two tests A and B deﬁned as follows. For Test A, H0 is rejected if S ≤ kA , while for Test B, H0 is rejected if S ≤ kB . If Test A has a higher level of signiﬁcance than Test B, show that Test A has higher power than Test B at each alternative. 5.8. Let us say the life of a tire in miles, say X, is normally distributed with mean θ and standard deviation 5000. Past experience indicates that θ = 30,000. The manufacturer claims that the tires made by a new process have mean θ > 30,000. It is possible that θ = 35,000. Check his claim by testing H0 : θ = 30,000 against H1 : θ > 30,000. We observe n independent values of X, say x1 , . . . , xn , and we reject H0 (thus accept H1 ) if and only if x ≥ c. Determine n and c so that the power function γ(θ) of the test has the values γ(30,000) = 0.01 and γ(35,000) = 0.98. 5.9. Let X have a Poisson distribution with mean θ. Consider the simple hypothesis H0 : θ = 12 and the alternative composite hypothesis H1 : θ < 12 . Thus Ω = {θ : 0 < θ ≤ 12 }. Let X1 , . . . , X12 denote a random sample of size 12 from this distribution. We reject H0 if and only if the observed value of Y = X1 + · · · + X12 ≤ 2. If γ(θ) is 1 ). the power function of the test, ﬁnd the powers γ( 12 ), γ( 13 ), γ( 14 ), γ( 16 ), and γ( 12 Sketch the graph of γ(θ). What is the signiﬁcance level of the test? 5.10. Let Y have a binomial distribution with parameters n and p. We reject H0 : p = 12 and accept H1 : p > 12 if Y ≥ c. Find n and c to give a power function γ(p) which is such that γ( 12 ) = 0.10 and γ( 23 ) = 0.95, approximately. 5.11. Let Y1 < Y2 < Y3 < Y4 be the order statistics of a random sample of size n = 4 from a distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, where 0 < θ. The hypothesis H0 : θ = 1 is rejected and H1 : θ > 1 is accepted if the observed Y4 ≥ c. (a) Find the constant c so that the signiﬁcance level is α = 0.05. (b) Determine the power function of the test. 5.12. Let X1 , X2 , . . . , X8 be a random sample of size n = 8 from a Poisson distribution with mean μ. Reject the simple null hypothesis H0 : μ = 0.5 and accept 8 H1 : μ > 0.5 if the observed sum i=1 xi ≥ 8.

251

Some Elementary Statistical Inferences (a) Compute the signiﬁcance level α of the test. (b) Find the power function γ(μ) of the test as a sum of Poisson probabilities. (c) Using Table I of Appendix: Tables of Distributions, determine γ(0.75), γ(1), and γ(1.25). 5.13. Let p denote the probability that, for a particular tennis player, the ﬁrst serve is good. Since p = 0.40, this player decided to take lessons in order to increase p. When the lessons are completed, the hypothesis H0 : p = 0.40 is tested against H1 : p > 0.40 based on n = 25 trials. Let y equal the number of ﬁrst serves that are good, and let the critical region be deﬁned by C = {y : y ≥ 13}. (a) Determine α = P (Y ≥ 13; p = 0.40). (b) Find β = P (Y < 13) when p = 0.60; that is, β = P (Y ≤ 12; p = 0.60) so that 1 − β is the power at p = 0.60.

6

Additional Comments About Statistical Tests

All of the alternative hypotheses considered in Section 5 were one-sided hypotheses. For illustration, in Exercise 5.8 we tested H0 : μ = 30,000 against the one-sided alternative H1 : μ > 30,000, where μ is the mean of a normal distribution having standard deviation σ = 5000. Perhaps in this situation, though, we think the manufacturer’s process has changed but are unsure of the direction. That is, we are interested in the alternative H1 : μ = 30,000. In this section, we further explore hypotheses testing and we begin with the construction of a test for a two-sided alternative involving the mean of a random variable. Example 6.1 (Large Sample Two-Sided Test for the Mean). In order to see how to construct a test for a two-sided alternative, reconsider Example 5.3, where we constructed a large sample one-sided test for the mean of a random variable. As in Example 5.3, let X be a random variable with mean μ and ﬁnite variance σ 2 . Here, though, we want to test H0 : μ = μ0 versus H1 : μ = μ0 ,

(6.1)

where μ0 is speciﬁed. Let X1 , . . . , Xn be a random sample from the distribution of X and denote the sample mean and variance by X and S 2 , respectively. For the one-sided test, we rejected H0 if X was too large; hence, for the hypotheses (6.1), we use the decision rule Reject H0 in favor of H1 if X ≤ h or X ≥ k,

(6.2)

where h and k are such that α = PH0 [X ≤ h or X ≥ k]. Clearly, h < k; hence, we have α = PH0 [X ≤ h or X ≥ k] = PH0 [X ≤ h] + PH0 [X ≥ k].

252

Some Elementary Statistical Inferences Since, at least for large samples, the distribution of X is symmetrically distributed about μ0 , under H0 , an intuitive rule is to divide α equally between the two terms on the right side of the above expression; that is, h and k are chosen by (6.3) PH0 [X ≤ h] = α/2 and PH0 [X ≥ k] = α/2. √ From Theorem 2.1, it follows that (X − μ0 )/(S/ n) is approximately N (0, 1). This and (6.3) lead to the approximate decision rule √0 (6.4) ≥ zα/2 . Reject H0 in favor of H1 if X−μ S/ n Upon substituting σ for S, it readily follows that the approximate power function is √ √ γ(μ) = Pμ (X ≤ μ0 − zα/2 σ/ n) + Pμ (X ≥ μ0 + zα/2 σ/ n) √

√

n(μ0 − μ) n(μ0 − μ) − zα/2 + 1 − Φ + zα/2 , (6.5) = Φ σ σ where Φ(z) is the cdf of a standard normal random variable. So if we have some reasonable idea of what σ equals, we can compute the approximate power function. Note that the derivative of the power function is

√

√ √ n n(μ0 − μ) n(μ0 − μ) φ + zα/2 − φ − zα/2 , (6.6) γ (μ) = σ σ σ where φ(z) is the pdf of a standard normal random variable. Note that γ(μ) has a critical value at μ0 . As Exercise 6.2 shows, this gives the minimum of γ(μ). Further, γ(μ) is strictly decreasing for μ < μ0 and strictly increasing for μ > μ0 . Consider again the situation at the beginning of this section. Suppose we want to test 30, 000. (6.7) H0 : μ = 30, 000 versus H1 : μ = Suppose n = 20 and α = 0.01. Then the rejection rule (6.4) becomes √ Reject H0 in favor of H1 if X−30,000 ≥ 2.575. S/ 20

(6.8)

Figure 6.1 shows the power curve for this test when σ = 5000, as in Exercise 5.8, is substituted in for S. For comparison, the power curve for the test with level α = 0.05 is also shown; see Exercise 6.1. The two-sided test for the mean is approximate. If we assume that X has a normal distribution, then, as Exercise 6.3 shows, the following test has exact size α for testing H0 : μ = μ0 versus H1 : μ = μ0 : √0 Reject H0 in favor of H1 if X−μ ≥ tα/2,n−1 . (6.9) S/ n It too has a bowl-shaped power curve similar to Figure 6.1, although it is not as easy to show; see Lehmann (1986).

253

Some Elementary Statistical Inferences ( )

Test of size = 0.05 0.8

0.4 Test of size

= 0.01

0.05 26000

30000

34000

Figure 6.1: Power curves for the tests of the hypotheses (6.7). There exists a relationship between two-sided tests and conﬁdence intervals. Consider the two-sided t-test (6.9). Here, we use the rejection rule with “if and only if” replacing “if.” Hence, in terms of acceptance, we have √ √ Accept H0 if and only if μ0 − tα/2,n−1 S/ n < X < μ0 + tα/2,n−1 S/ n. But this is easily shown to be √ √ Accept H0 if and only if μ0 ∈ (X − tα/2,n−1 S/ n, X + tα/2,n−1 S/ n);

(6.10)

that is, we accept H0 at signiﬁcance level α if and only if μ0 is in the (1 − α)100% conﬁdence interval for μ. Equivalently, we reject H0 at signiﬁcance level α if and only if μ0 is not in the (1 − α)100% conﬁdence interval for μ. This is true for all the two-sided tests and hypotheses discussed in this text. There is also a similar relationship between one-sided tests and one-sided conﬁdence intervals. Once we recognize this relationship between conﬁdence intervals and tests of hypothesis, we can use all those statistics that we used to construct conﬁdence intervals to test hypotheses, not only against two-sided alternatives but one-sided ones as well. Without listing all of these in a table, we present enough of them so that the principle can be understood. Example 6.2. Let independent random samples be taken from N (μ1 , σ 2 ) and N (μ2 , σ 2 ), respectively. Say these have the respective sample characteristics n1 , X, S12 and n2 , Y , S22 . Let n = n1 + n2 denote the combined sample size and let Sp2 = [(n1 −1)S12 +(n2 −1)S22 ]/(n−2), (2.11), be the pooled estimator of the common variance. At α = 0.05, reject H0 : μ1 = μ2 and accept the one-sided alternative H1 : μ1 > μ2 if X −Y −0 T = ≥ t.05,n−2 , Sp n11 + n12

254

Some Elementary Statistical Inferences because, under H0 : μ1 = μ2 , T has a t(n − 2)-distribution. Example 6.3. Say X is b(1, p). Consider testing H0 : p = p0 against H1 : p < p0 . Let X1 . . . , Xn be a random sample from the distribution of X and let p = X. To test H0 versus H1 , we use either Z1 =

p − p0 p0 (1 − p0 )/n

≤c

or

p − p0 Z2 = ≤ c. p(1 − p)/n

If n is large, both Z1 and Z2 have approximate standard normal distributions provided that H0 : p = p0 is true. Hence, if c is set at −1.645, then the approximate signiﬁcance level is α = 0.05. Some statisticians use Z1 and others Z2 . We do not have strong preferences one way or the other because the two methods provide about the same numerical results. As one might suspect, using Z1 provides better probabilities for power calculations if the true p is close to p0 , while Z2 is better if H0 is clearly false. However, with a two-sided alternative hypothesis, Z2 does provide a better relationship with the conﬁdence interval for p. That is, |Z2 | < zα/2 is equivalent to p0 being in the interval from p(1 − p) p(1 − p) to p + zα/2 , p − zα/2 n n which is the interval that provides a (1 − α)100% approximate conﬁdence interval for p as considered in Section 2. In closing this section, we introduce the concepts of randomized tests and p-values through an example and remarks that follow the example. Example 6.4. Let X1 , X2 , . . . , X10 be a random sample of size n = 10 from a Poisson distribution with mean 10θ. A critical region for testing H0 : θ = 0.1 against H1 : θ > 0.1 is given by Y = 1 Xi ≥ 3. The statistic Y has a Poisson distribution with mean 10θ. Thus, with θ = 0.1 so that the mean of Y is 1, the signiﬁcance level of the test is P (Y ≥ 3) = 1 − P (Y ≤ 2) = 1 − 0.920 = 0.080. 10 If the critical region deﬁned by 1 xi ≥ 4 is used, the signiﬁcance level is α = P (Y ≥ 4) = 1 − P (Y ≤ 3) = 1 − 0.981 = 0.019. For instance, if a signiﬁcance level of about α = 0.05, say, is desired, most statisticians would use one of these tests; that is, they would adjust the signiﬁcance level to that of one of these convenient tests. However, a signiﬁcance level of α = 0.05 can be achieved in the following way. Let W have a Bernoulli distribution with probability of success equal to P (W = 1) =

31 0.050 − 0.019 = . 0.080 − 0.019 61

255

Some Elementary Statistical Inferences Assume that W is selected independently of the sample. Consider the rejection rule 10 10 Reject H0 if 1 xi ≥ 4 or if 1 xi = 3 and W = 1. The signiﬁcance level of this rule is PH0 (Y ≥ 4) + PH0 ({Y = 3} ∩ {W = 1})

= PH0 (Y ≥ 4) + PH0 (Y = 3)P (W = 1) 31 = 0.05; = 0.019 + 0.061 61

hence, the decision rule has exactly level 0.05. The process of performing the auxiliary experiment to decide whether to reject or not when Y = 3 is sometimes referred to as a randomized test. Remark 6.1 (Observed Signiﬁcance Level). Not many statisticians like randomized tests in practice, because the use of them means that two statisticians could make the same assumptions, observe the same data, apply the same test, and yet make diﬀerent decisions. Hence, they usually adjust their signiﬁcance level so as not to randomize. As a matter of fact, many statisticians report what are commonly called observed signiﬁcance levels or p-values (for probability values). For illustration, if in Example 6.4 the observed Y is y = 4, the p-value is 0.019; and if it is y = 3, the p-value is 0.080. That is, the p-value is the observed “tail” probability of a statistic being at least as extreme as the particular observed value when H0 is true. Hence, more generally, if Y = u(X1 , X2 , . . . , Xn ) is the statistic to be used in a test of H0 and if the critical region is of the form u(x1 , x2 , . . . , xn ) ≤ c, an observed value u(x1 , x2 , . . . , xn ) = d means that the p-value = PH0 (Y ≤ d). That is, if G(y) is the distribution function of Y = u(X1 , X2 , . . . , Xn ), provided that H0 is true, the p-value is equal to G(d) in this case. However, G(Y ), in the continuous case, is uniformly distributed on the unit interval, so an observed value G(d) ≤ 0.05 is equivalent to selecting c, so that PH0 [u(X1 , X2 , . . . , Xn ) ≤ c] = 0.05 and observing that d ≤ c. Most computer programs automatically print out the p-value of a test. Example 6.5. Let X1 , X2 , . . . , X25 be a random sample from N (μ, σ 2 = 4). To test H0 : μ = 77 against the one-sided alternative hypothesis H1 : μ < 77, say we observe the 25 values and determine that x = 76.1. The variance of X is σ 2 /n = 4/25 = 0.16; so we know that Z = (X − 77)/0.4 is N (0, 1) provided that μ = 77. Since the observed value of this test statistic is z = (76.1−77)/0.4 = −2.25, the p-value of the test is Φ(−2.25) = 1 − 0.988 = 0.012. Accordingly, if we were using a signiﬁcance level of α = 0.05, we would reject H0 and accept H1 : μ < 77 because 0.012 < 0.05.

256

Some Elementary Statistical Inferences EXERCISES 6.1. For the test at level 0.05 of the hypotheses given by (6.1) with μ0 = 30,000 and n = 20, obtain the power function, (use σ = 5000). Evaluate the power function for the following values: μ = 25,000; 27,500; 30,000; 32,500; and 35,000. Then sketch this power function and see if it agrees with Figure 6.1. 6.2. Consider the power function γ(μ) and its derivative γ (μ) given by (6.5) and (6.6). Show that γ (μ) is strictly negative for μ < μ0 and strictly positive for μ > μ0 . 6.3. Show that the test deﬁned by 6.9 has exact size α for testing H0 : μ = μ0 versus H1 : μ = μ0 . 6.4. Consider the one-sided t-test for H0 : μ = μ0 versus HA1 : μ > μ0 constructed in Example 5.4 and the two-sided t-test for t-test for H0 : μ = μ0 versus H1 : μ = μ0 given in (6.9). Assume that both tests are of size α. Show that for μ > μ0 , the power function of the one-sided test is larger than the power function of the two-sided test. 6.5. Assume that the weight of cereal in a “10-ounce box” is N (μ, σ 2 ). To test H0 : μ = 10.1 against H1 : μ > 10.1, we take a random sample of size n = 16 and observe that x = 10.4 and s = 0.4. (a) Do we accept or reject H0 at the 5% signiﬁcance level? (b) What is the approximate p-value of this test? 6.6. Each of 51 golfers hit three golf balls of brand X and three golf balls of brand Y in a random order. Let Xi and Yi equal the averages of the distances traveled by the brand X and brand Y golf balls hit by the ith golfer, i = 1, 2, . . . , 51. Let Wi = Xi − Yi , i = 1, 2, . . . , 51. Test H0 : μW = 0 against H1 : μW > 0, where μW is the mean of the diﬀerences. If w = 2.07 and s2W = 84.63, would H0 be accepted or rejected at an α = 0.05 signiﬁcance level? What is the p-value of this test? 6.7. Among the data collected for the World Health Organization air quality monitoring project is a measure of suspended particles in μg/m3 . Let X and Y equal the concentration of suspended particles in μg/m3 in the city center (commercial district) for Melbourne and Houston, respectively. Using n = 13 observations of X and m = 16 observations of Y , we test H0 : μX = μY against H1 : μX < μY . (a) Deﬁne the test statistic and critical region, assuming that the unknown variances are equal. Let α = 0.05. (b) If x = 72.9, sx = 25.6, y = 81.7, and sy = 28.3, calculate the value of the test statistic and state your conclusion. 6.8. Let p equal the proportion of drivers who use a seat belt in a country that does not have a mandatory seat belt law. It was claimed that p = 0.14. An advertising campaign was conducted to increase this proportion. Two months after the campaign, y = 104 out of a random sample of n = 590 drivers were wearing their seat belts. Was the campaign successful?

257

Some Elementary Statistical Inferences (a) Deﬁne the null and alternative hypotheses. (b) Deﬁne a critical region with an α = 0.01 signiﬁcance level. (c) Determine the approximate p-value and state your conclusion. 6.9. In Exercise 2.18 we found a conﬁdence interval for the variance σ 2 using the variance S 2 of a random sample of size n arising from N (μ, σ 2 ), where the mean μ is unknown. In testing H0 : σ 2 = σ02 against H1 : σ 2 > σ02 , use the critical region deﬁned by (n − 1)S 2 /σ02 ≥ c. That is, reject H0 and accept H1 if S 2 ≥ cσ02 /(n − 1). If n = 13 and the signiﬁcance level α = 0.025, determine c. 6.10. In Exercise 2.27, in ﬁnding a conﬁdence interval for the ratio of the variances of two normal distributions, we used a statistic S12 /S22 , which has an F -distribution when those two variances are equal. If we denote that statistic by F , we can test H0 : σ12 = σ22 against H1 : σ12 > σ22 using the critical region F ≥ c. If n = 13, m = 11, and α = 0.05, ﬁnd c.

7

Chi-Square Tests

In this section we introduce tests of statistical hypotheses called chi-square tests. A test of this sort was originally proposed by Karl Pearson in 1900, and it provided one of the earlier methods of statistical inference. Let the random variable Xi be N (μi , σi2 ), i = 1, 2, . . . , n, and let X1 , X2 , . . . , Xn be mutually independent. Thus the joint pdf of these variables is

2 n 1 x i − μi 1 , −∞ < xi < ∞. exp − 2 1 σi σ1 σ2 · · · σn (2π)n/2 The variable that is deﬁned by the exponent (apart from the coeﬃcient − 21 ) random n is 1 [(Xi − μi )/σi ]2 , and this random variable has a χ2 (n) distribution. We can generalize this joint normal distribution of probability to n random variables that are dependent and we called the distribution a multivariate normal distribution. A certain exponent in the joint pdf (apart from a coeﬃcient of −1/2) deﬁnes a random variable that is χ2 (n). This fact is the mathematical basis of the chi-square tests. Let us now discuss some random variables that have approximate chi-square distributions. Let X1 be b(n, p1 ). Consider the random variable X1 − np1 Y = , np1 (1 − p1 ) which has, as n → ∞, an approximate N (0, 1) distribution (see Theorem 2.1). Furthermore the distribution of Y 2 is approximately χ2 (1). Let X2 = n − X1 and let p2 = 1 − p1 . Let Q1 = Y 2 . Then Q1 may be written as

258

Some Elementary Statistical Inferences

Q1 =

(X1 − np1 )2 np1 (1 − p1 )

(X1 − np1 )2 (X1 − np1 )2 + np1 n(1 − p1 ) (X1 − np1 )2 (X2 − np2 )2 + np1 np2

= =

(7.1)

because (X1 − np1 )2 = (n − X2 − n + np2 )2 = (X2 − np2 )2 . This result can be generalized as follows. Let X1 , X2 , . . . , Xk−1 have a multinomial distribution with the parameters n and p1 , . . . , pk−1 . Let Xk = n − (X1 + · · · + Xk−1 ) and let pk = 1 − (p1 + · · · + pk−1 ). Deﬁne Qk−1 by k (Xi − npi )2 . Qk−1 = npi i=1 It is proved in a more advanced course that, as n → ∞, Qk−1 has an approximate χ2 (k − 1) distribution. Some writers caution the user of this approximation to be certain that n is large enough so that each npi , i = 1, 2, . . . , k, is at least equal to 5. In any case, it is important to realize that Qk−1 does not have a chi-square distribution, only an approximate chi-square distribution. The random variable Qk−1 may serve as the basis of the tests of certain statistical hypotheses which we now discuss. Let the sample space A of a random experiment be the union of a ﬁnite number k of mutually disjoint sets A1 , A2 , . . . , Ak . Furthermore, let P (Ai ) = pi , i = 1, 2, . . . , k, where pk = 1 − p1 − · · · − pk−1 , so that pi is the probability that the outcome of the random experiment is an element of the set Ai . The random experiment is to be repeated n independent times and Xi represents the number of times the outcome is an element of set Ai . That is, X1 , X2 , . . . , Xk = n − X1 − · · · − Xk−1 are the frequencies with which the outcome is, respectively, an element of A1 , A2 , . . . , Ak . Then the joint pmf of X1 , X2 , . . . , Xk−1 is the multinomial pmf with the parameters n, p1 , . . . , pk−1 . Consider the simple hypothesis (concerning this multinomial pmf) H0 : p1 = p10 , p2 = p20 , . . . , pk−1 = pk−1,0 (pk = pk0 = 1 − p10 − · · · − pk−1,0 ), where p10 , . . . , pk−1,0 are speciﬁed numbers. It is desired to test H0 against all alternatives. If the hypothesis H0 is true, the random variable Qk−1 =

k (Xi − npi0 )2 1

npi0

has an approximate chi-square distribution with k − 1 degrees of freedom. Since, when H0 is true, npi0 is the expected value of Xi , one would feel intuitively that observed values of Qk−1 should not be too large if H0 is true. With this in mind, we may use Table II of Appendix: Tables for Distributions, with k − 1 degrees of freedom, and ﬁnd c so that P (Qk−1 ≥ c) = α, where α is the desired signiﬁcance level of the test. If, then, the hypothesis H0 is rejected when the observed value

259

Some Elementary Statistical Inferences of Qk−1 is at least as great as c, the test of H0 has a signiﬁcance level that is approximately equal to α. This is frequently called a goodness-of-ﬁt test. Some illustrative examples follow. Example 7.1. One of the ﬁrst six positive integers is to be chosen by a random experiment (perhaps by the cast of a die). Let Ai = {x : x = i}, i = 1, 2, . . . , 6. The hypothesis H0 : P (Ai ) = pi0 = 16 , i = 1, 2, . . . , 6, is tested, at the approximate 5% signiﬁcance level, against all alternatives. To make the test, the random experiment is repeated under the same conditions, 60 independent times. In this example, k = 6 and npi0 = 60( 16 ) = 10, i = 1, 2, . . . , 6. Let Xi denote the frequency with which the random 6 experiment terminates with the outcome in Ai , i = 1, 2, . . . , 6, and let Q5 = 1 (Xi − 10)2 /10. If H0 is true, Table II, with k − 1 = 6 − 1 = 5 degrees of freedom, shows that we have P (Q5 ≥ 11.1) = 0.05. Now suppose that the experimental frequencies of A1 , A2 , . . . , A6 are, respectively, 13, 19, 11, 8, 5, and 4. The observed value of Q5 is (19 − 10)2 (11 − 10)2 (8 − 10)2 (5 − 10)2 (4 − 10)2 (13 − 10)2 + + + + + = 15.6. 10 10 10 10 10 10 Since 15.6 > 11.1, the hypothesis P (Ai ) = (approximate) 5% signiﬁcance level.

1 6,

i = 1, 2, . . . , 6, is rejected at the

Example 7.2. A point is to be selected from the unit interval {x : 0 < x < 1} by a random process. Let A1 = {x : 0 < x ≤ 14 }, A2 = {x : 14 < x ≤ 12 }, A3 = {x : 12 < x ≤ 34 }, and A4 = {x : 34 < x < 1}. Let the probabilities pi , i = 1, 2, 3, 4, assigned to these sets under the hypothesis be determined by the pdf 2x, 0 < x < 1, zero elsewhere. Then these probabilities are, respectively, 1/4 1 3 5 7 2x dx = 16 , p20 = 16 , p30 = 16 , p40 = 16 . p10 = 0

Thus the hypothesis to be tested is that p1 , p2 , p3 , and p4 = 1 − p1 − p2 − p3 have the preceding values in a multinomial distribution with k = 4. This hypothesis is to be tested at an approximate 0.025 signiﬁcance level by repeating the random experiment n = 80 independent times under the same conditions. Here the npi0 for i = 1, 2, 3, 4, are, respectively, 5, 15, 25, and 35. Suppose the observed frequencies , A3 , and A4 are 6, 18, 20, and 36, respectively. Then the observed value of A1 , A2 4 of Q3 = 1 (Xi − npi0 )2 /(npi0 ) is (18 − 15)2 (20 − 25)2 (36 − 35)2 64 (6 − 5)2 + + + = = 1.83, 5 15 25 35 35 approximately. From Table II, with 4 − 1 = 3 degrees of freedom, the value corresponding to a 0.025 signiﬁcance level is c = 9.35. Since the observed value of Q3 is less than 9.35, the hypothesis is accepted at the (approximate) 0.025 level of signiﬁcance. Thus far we have used the chi-square test when the hypothesis H0 is a simple hypothesis. More often we encounter hypotheses H0 in which the multinomial probabilities p1 , p2 , . . . , pk are not completely speciﬁed by the hypothesis H0 . That is,

260

Some Elementary Statistical Inferences under H0 , these probabilities are functions of unknown parameters. For an illustration, suppose that a certain random variable Y can take on any real value. Let us partition the space {y : −∞ < y < ∞} into k mutually disjoint sets A1 , A2 , . . . , Ak so that the events A1 , A2 , . . . , Ak are mutually exclusive and exhaustive. Let H0 be the hypothesis that Y is N (μ, σ 2 ) with μ and σ 2 unspeciﬁed. Then each pi = Ai

1 √ exp[−(y − μ)2 /2σ 2 ] dy, 2πσ

i = 1, 2, . . . , k,

is a function of the unknown parameters μ and σ 2 . Suppose that we take a random sample Y1 , . . . , Yn of size n from this distribution. If we let Xi denote the frequency of Ai , i = 1, 2, . . . , k, so that X1 + X2 + · · · + Xk = n, the random variable Qk−1 =

k (Xi − npi )2 i=1

npi

cannot be computed once X1 , . . . , Xk have been observed, since each pi , and hence Qk−1 , is a function of μ and σ 2 . Accordingly, choose the values of μ and σ 2 that minimize Qk−1 . These values depend upon the observed X1 = x1 , . . . , Xk = xk and are called minimum chi-square estimates of μ and σ 2 . These point estimates of μ and σ 2 enable us to compute numerically the estimates of each pi . Accordingly, if these values are used, Qk−1 can be computed once Y1 , Y2 , . . . , Yn , and hence X1 , X2 , . . . , Xk , are observed. However, a very important aspect of the fact, which we accept without proof, is that now Qk−1 is approximately χ2 (k − 3). That is, the number of degrees of freedom of the approximate chi-square distribution of Qk−1 is reduced by one for each parameter estimated by the observed data. This statement applies not only to the problem at hand but also to more general situations. Two examples are now be given. The ﬁrst of these examples deals with the test of the hypothesis that two multinomial distributions are the same. Remark 7.1. In many cases, such as that involving the mean μ and the variance σ 2 of a normal distribution, minimum chi-square estimates are diﬃcult to compute. Other estimates, such as the maximum likelihood estimates, μ ˆ = Y and σˆ2 = V = (n − 1)S 2 /n, are used to evaluate pi and Qk−1 . In general, Qk−1 is not minimized by maximum likelihood estimates, and thus its computed value is somewhat greater than it would be if minimum chi-square estimates are used. Hence, when comparing it to a critical value listed in the chi-square table with k−3 degrees of freedom, there is a greater chance of rejection than there would be if the actual minimum of Qk−1 is used. Accordingly, the approximate signiﬁcance level of such a test is somewhat higher than the value found in the table. This modiﬁcation should be kept in mind and, if at all possible, each pi should be estimated using the frequencies X1 , . . . , Xk rather than directly using the observations Y1 , Y2 , . . . , Yn of the random sample. Example 7.3. In this example, we consider two multinomial distributions with parameters nj , p1j , p2j , . . . , pkj and j = 1, 2, respectively. Let Xij , i = 1, 2, . . . , k, j = 1, 2, represent the corresponding frequencies. If n1 and n2 are large and the

261

Some Elementary Statistical Inferences observations from one distribution are independent of those from the other, the random variable 2 k (Xij − nj pij )2 j=1 i=1

nj pij

is the sum of two independent random variables each of which we treat as though it were χ2 (k − 1); that is, the random variable is approximately χ2 (2k − 2). Consider the hypothesis H0 : p11 = p12 , p21 = p22 , . . . , pk1 = pk2 , where each pi1 = pi2 , i = 1, 2, . . . , k, is unspeciﬁed. Thus we need point estimates of these parameters. The maximum likelihood estimator of pi1 = pi2 , based upon the frequencies Xij , is (Xi1 + Xi2 )/(n1 + n2 ), i = 1, 2, . . . , k. Note that we need only k − 1 point estimates, because we have a point estimate of pk1 = pk2 once we have point estimates of the ﬁrst k − 1 probabilities. In accordance with the fact that has been stated, the random variable 2 k {Xij − nj [(Xi1 + Xi2 )/(n1 + n2 )]}2 nj [(Xi1 + Xi2 )/(n1 + n2 )] j=1 i=1

has an approximate χ2 distribution with 2k − 2 − (k − 1) = k − 1 degrees of freedom. Thus we are able to test the hypothesis that two multinomial distributions are the same; this hypothesis is rejected when the computed value of this random variable is at least as great as an appropriate number from Table II of Appendix: Tables of Distributions, with k − 1 degrees of freedom. This test is often called the chi-square test for homogeneity, (the null is equivalent to homogeneous distributions). The second example deals with the subject of contingency tables. Example 7.4. Let the result of a random experiment be classiﬁed by two attributes (such as the color of the hair and the color of the eyes). That is, one attribute of the outcome is one and only one of certain mutually exclusive and exhaustive events, say A1 , A2 , . . . , Aa ; and the other attribute of the outcome is also one and only one of certain mutually exclusive and exhaustive events, say B1 , B2 , . . . , Bb . Let pij = P (Ai ∩ Bj ), i = 1, 2, . . . , a; j = 1, 2, . . . , b. The random experiment is repeated n independent times and Xij denotes the frequency of the event Ai ∩ Bj . Since there are k = ab such events as Ai ∩ Bj , the random variable Qab−1 =

b a (Xij − npij )2 npij j=1 i=1

has an approximate chi-square distribution with ab − 1 degrees of freedom, provided that n is large. Suppose that we wish to test the independence of the A and the B attributes, i.e., the hypothesis H0 : P (Ai ∩ Bj ) = P (Ai )P (Bj ), i = 1, 2, . . . , a; j = 1, 2, . . . , b. Let us denote P (Ai ) by pi. and P (Bj ) by p.j . It follows that pi. =

b j=1

262

pij , p.j =

a i=1

pij , and 1 =

b a j=1 i=1

pij =

b j=1

p.j =

a i=1

pi. .

Some Elementary Statistical Inferences Then the hypothesis can be formulated as H0 : pij = pi. p.j , i = 1, 2, . . . , a; j = 1, 2, . . . , b. To test H0 , we can use Qab−1 with pij replaced by pi. p.j . But if pi. , i = 1, 2, . . . , a, and p.j , j = 1, 2, . . . , b, are unknown, as they frequently are in applications, we cannot compute Qab−1 once the frequencies are observed. In such a case, we estimate these unknown parameters by pˆi· =

Xi· n ,

where Xi· =

b Xij , for i = 1, 2, . . . , a, j=1

and pˆ·j =

X·j n ,

where X·j =

a

Xij , for j = 1, 2, . . . , b.

i=1

Since i pi. = j p.j = 1, we have estimated only a−1+b−1 = a+b−2 parameters. So if these estimates are used in Qab−1 , with pij = pi. p.j , then, according to the rule that has been stated in this section, the random variable b a [Xij − n(Xi. /n)(X.j /n)]2 n(Xi. /n)(X.j /n) j=1 i=1

has an approximate chi-square distribution with ab − 1 − (a + b − 2) = (a − 1)(b − 1) degrees of freedom provided that H0 is true. The hypothesis H0 is then rejected if the computed value of this statistic exceeds the constant c, where c is selected from Table II of Appendix: Tables of Distributions, so that the test has the desired signiﬁcance level α. This is the chi-square test for independence. In each of the four examples of this section, we have indicated that the statistic used to test the hypothesis H0 has an approximate chi-square distribution, provided that n is suﬃciently large and H0 is true. To compute the power of any of these tests for values of the parameters not described by H0 , we need the distribution of the statistic when H0 is not true. In each of these cases, the statistic has an approximate distribution called a noncentral chi-square distribution. EXERCISES 7.1. A number is to be selected from the interval {x : 0 < x < 2} by a random process. Let Ai = {x : (i − 1)/2 < x ≤ i/2}, i = 1, 2, 3, and let A4 = {x : 3 hypothesis assigns probabilities 2 < x < 2}. For i = 1, 2, 3, 4, suppose a certain pi0 to these sets in accordance with pi0 = Ai ( 12 )(2 − x) dx, i = 1, 2, 3, 4. This hypothesis (concerning the multinomial pdf with k = 4) is to be tested at the 5% level of signiﬁcance by a chi-square test. If the observed frequencies of the sets Ai , i = 1, 2, 3, 4, are respectively, 30, 30, 10, 10, would H0 be accepted at the (approximate) 5% level of signiﬁcance?

263

Some Elementary Statistical Inferences 7.2. Deﬁne the sets A1 = {x : −∞ < x ≤ 0}, Ai = {x : i − 2 < x ≤ i − 1}, i = 2, . . . , 7, and A8 = {x : 6 < x < ∞}. A certain hypothesis assigns probabilities pi0 to these sets Ai in accordance with (x − 3)2 1 √ exp − dx, i = 1, 2, . . . , 7, 8. pi0 = 2(4) Ai 2 2π This hypothesis (concerning the multinomial pdf with k = 8) is to be tested, at the 5% level of signiﬁcance, by a chi-square test. If the observed frequencies of the sets Ai , i = 1, 2, . . . , 8, are, respectively, 60, 96, 140, 210, 172, 160, 88, and 74, would H0 be accepted at the (approximate) 5% level of signiﬁcance? 7.3. A die was cast n = 120 independent times and the following data resulted: Spots Up Frequency

1 b

2 20

3 20

4 20

5 20

6 40 − b

If we use a chi-square test, for what values of b would the hypothesis that the die is unbiased be rejected at the 0.025 signiﬁcance level? 7.4. Consider the problem from genetics of crossing two types of peas. The Mendelian theory states that the probabilities of the classiﬁcations (a) round and yellow, (b) 9 3 3 , 16 , 16 , wrinkled and yellow, (c) round and green, and (d) wrinkled and green are 16 1 and 16 , respectively. If, from 160 independent observations, the observed frequencies of these respective classiﬁcations are 86, 35, 26, and 13, are these data consistent with the Mendelian theory? That is, test, with α = 0.01, the hypothesis that the 9 3 3 1 , 16 , 16 , and 16 . respective probabilities are 16 7.5. Two diﬀerent teaching procedures were used on two diﬀerent groups of students. Each group contained 100 students of about the same ability. At the end of the term, an evaluating team assigned a letter grade to each student. The results were tabulated as follows.

Group I II

A 15 9

B 25 18

Grade C D 32 17 29 28

F 11 16

Total 100 100

If we consider these data to be independent observations from two respective multinomial distributions with k = 5, test at the 5% signiﬁcance level the hypothesis that the two distributions are the same (and hence the two teaching procedures are equally eﬀective). 7.6. Let the result of a random experiment be classiﬁed as one of the mutually exclusive and exhaustive ways A1 , A2 , A3 and also as one of the mutually exclusive and exhaustive ways B1 , B2 , B3 , B4 . Two hundred independent trials of the experiment result in the following data: A1 A2 A3

264

B1 10 11 6

B2 21 27 19

B3 15 21 27

B4 6 13 24

Some Elementary Statistical Inferences Test, at the 0.05 signiﬁcance level, the hypothesis of independence of the A attribute and the B attribute, namely, H0 : P (Ai ∩ Bj ) = P (Ai )P (Bj ), i = 1, 2, 3 and j = 1, 2, 3, 4, against the alternative of dependence. 7.7. A certain genetic model suggests that the probabilities of a particular trinomial distribution are, respectively, p1 = p2 , p2 = 2p(1 − p), and p3 = (1 − p)2 , where 0 < p < 1. If X1 , X2 , X3 represent the respective frequencies in n independent trials, explain how we could check on the adequacy of the genetic model. 7.8. Let the result of a random experiment be classiﬁed as one of the mutually exclusive and exhaustive ways A1 , A2 , A3 and also as one of the mutually exhaustive ways B1 , B2 , B3 , B4 . Say that 180 independent trials of the experiment result in the following frequencies: A1 A2 A3

B1 15 − 3k 15 15 + 3k

B2 15 − k 15 15 + k

B3 15 + k 15 15 − k

B4 15 + 3k 15 15 − 3k

where k is one of the integers 0, 1, 2, 3, 4, 5. What is the smallest value of k that leads to the rejection of the independence of the A attribute and the B attribute at the α = 0.05 signiﬁcance level? 7.9. It is proposed to ﬁt the Poisson distribution to the following data: x Frequency

0 20

1 40

2 16

3 18

3 1. Using this fact and the generation of exponential (λ) variates discussed above, the following algorithm generates a realization of X (assume that the uniforms generated are independent of one another).

268

Some Elementary Statistical Inferences 1. 2. 3. 4.

Set X = 0 and T = 0. Generate U uniform (0, 1) and let Y = −(1/λ) log(1 − U ). Set T = T + Y . If T > 1, output X; else set X = X + 1 and go to step 2.

The program poisrand found in Appendix: R Functions provides an R coding of this algorithm for generating n simulations of a Poisson distribution with parameter λ. As an illustration, we obtained 1000 realizations from a Poisson distribution with λ = 5 by running R with the command temp = poisrand(1000,5). This stores the realizations in the vector temp. The sample average of these realizations is found by the command mean(temp). In our case, the realized mean was 4.895. Example 8.3 (Monte Carlo Integration). Suppose we want to obtain the integral b g(x) dx for a continuous function g over the closed and bounded interval [a, b]. a If the antiderivative of g does not exist, then numerical integration is in order. A simple numerical technique is the method of Monte Carlo. We can write the integral as b b 1 dx = (b − a)E[g(X)], g(x) dx = (b − a) g(x) b−a a a where X has the uniform (a, b) distribution. The Monte Carlo technique is then to generate a random sample X1 , . . . , Xn of size n from the uniform (a, b) distribution b and compute Yi = (b − a)g(Xi ). Then Y is an unbiased estimator of a g(x) dx. Example 8.4 (Estimation of π by Monte Carlo Integration). For a numerical example, reconsider the estimation of π. Instead of the experiment described in √ Example 8.1, we use the method of Monte Carlo integration. Let g(x) = 4 1 − x2 for 0 < x < 1. Then 1

g(x) dx = E[g(X)],

π= 0

where X has the uniform (0, 1) distribution. Hence we need to generate a random sample X1 , . . . , Xn from the uniform (0, 1) distribution and form Yi = 4 1 − Xi2 . Then Y is a unbiased estimator of π. Note that Y is estimating a mean, so the large sample conﬁdence interval (2.6) derived in Example 2.2 for means can be used to estimate the error of estimation. Recall that this 95% conﬁdence interval is given by √ √ (y − 1.96s/ n, y + 1.96s/ n), where s is the value of the sample standard deviation. The table below gives the results for estimates of π for various runs of diﬀerent sample sizes along with the conﬁdence intervals. n y √ y − 1.96(s/√n) y + 1.96(s/ n)

100 3.217849 3.054664 3.381034

1000 3.103322 3.046330 3.160314

10,000 3.135465 3.118080 3.152850

100,000 3.142066 3.136535 3.147597

269

Some Elementary Statistical Inferences Note that for each experiment the conﬁdence interval trapped π. See Appendix: R Functions, piest2, for the actual code used for the computation. Numerical integration techniques have made great strides over the last 20 years. But the simplicity of integration by Monte Carlo still makes it a powerful technique. −1 (u) in closed form, then we can As Theorem 8.1 shows, if we can obtain FX easily generate observations with cdf FX . In many cases where this is not possible, techniques have been developed to generate observations. Note that the normal distribution serves as an example of such a case, and, in the next example, we show how to generate normal observations. In Section 8.1, we discuss an algorithm which can be adapted for many of these cases. Example 8.5 (Generating Normal Observations). To simulate normal variables, Box and Muller (1958) suggested the following procedure. Let Y1 , Y2 be a random sample from the uniform distribution over 0 < y < 1. Deﬁne X1 and X2 by X1

=

(−2 log Y1 )1/2 cos(2πY2 ),

X2

=

(−2 log Y1 )1/2 sin(2πY2 ).

This transformation is one-to-one and maps {(y1 , y2 ) : 0 < y1 < 1, 0 < y2 < 1} onto {(x1 , x2 ) : −∞ < x1 < ∞, −∞ < x2 < ∞} except for sets involving x1 = 0 and x2 = 0, which have probability zero. The inverse transformation is given by

2 x + x22 , y1 = exp − 1 2 x2 1 y2 = arctan . 2π x1 This has the Jacobian

2 2 2 x1 + x22 (−x1 ) exp − x1 + x2 (−x2 ) exp − 2 2 J = 2 /x 1/x −x 2 1 1 (2π)(1 + x22 /x21 ) (2π)(1 + x22 /x21 ) 2

2 x1 + x22 x1 + x22 2 2 − exp − −(1 + x2 /x1 ) exp − 2 2 = . = 2 2 (2π)(1 + x2 /x1 ) 2π Since the joint pdf of Y1 and Y2 is 1 on 0 < y1 < 1, 0 < y2 < 1, and zero elsewhere, the joint pdf of X1 and X2 is

2 x + x22 exp − 1 2 , −∞ < x1 < ∞, −∞ < x2 < ∞. 2π That is, X1 and X2 are independent, standard normal random variables. One of the most commonly used normal generators is a variant of the above procedure called the Marsaglia and Bray (1964) algorithm; see Exercise 8.21.

270

Some Elementary Statistical Inferences Observations from a contaminated normal distribution can easily be generated using a normal generator and a uniform generator. We close this section by estimating via Monte Carlo the signiﬁcance level of a t-test when the underlying distribution is a contaminated normal. Example 8.6. Let X be a random variable with mean μ and consider the hypotheses (8.3) H0 : μ = 0 versus H1 : μ > 0. Suppose we decide to base this test on a sample of size n = 20 from the distribution of X, using the t-test with rejection rule (8.4) Reject H0 : μ = 0 in favor of H1 : μ > 0 if t > t.05,19 = 1.729, √ where t = x/(s/ 20) and x and s are the sample mean and standard deviation, respectively. If X has a normal distribution, then this test has level 0.05. But what if X does not have a normal distribution? In particular, for this example, suppose X has the contaminated normal distribution with = 0.25 and σc = 25; that is, 75% of the time an observation is generated by a standard normal distribution, while 25% of the time it is generated by a normal distribution with mean 0 and standard deviation 25. Hence the mean of X is 0, so H0 is true. To obtain the exact signiﬁcance level of the test would be quite complicated. We would have to obtain the distribution of t when X has this contaminated normal distribution. As an alternative, we estimate the level (and the error of estimation) by simulation. Let N be the number of simulations. The following algorithm gives the steps of our simulation: 1. Set k = 1, I = 0. 2. Simulate a random sample of size 20 from the distribution of X. 3. Based on this sample, compute the test statistic t. 4. If t > 1.729, increase I by 1. 5. If k = N ; go to step 6; else increase k by 1 and go to step 2.

(1 − α )/N . 6. Compute α = I/N and the approximate error = 1.96 α Then α is our simulated estimate of α and the half-width of a conﬁdence interval for α serves as our estimate of the error of estimation. The routine empalphacn, found in Appendix: R Functions, provides R code for this algorithm. When we ran it for N = 10,000 we obtained the results: No. Simulat. 10,000

Empirical α 0.0412

Error 0.0039

95% CI for α (0.0373, 0.0451)

Based on these results, the t-test appears to be slightly conservative when the sample is drawn from this contaminated normal distribution.

271

Some Elementary Statistical Inferences

8.1

Accept–Reject Generation Algorithm

In this section, we develop the accept–reject procedure that can often be used to simulate random variables whose inverse cdf cannot be obtained in closed form. Let X be a continuous random variable with pdf f (x). For this discussion, we call this pdf the target pdf. Suppose it is relatively easy to generate an observation of the random variable Y which has pdf g(x) and that for some constant M we have f (x) ≤ M g(x) ,

−∞ < x < ∞.

(8.5)

We call g(x) the instrumental pdf. For clarity, we write the accept–reject as an algorithm: Algorithm 8.1 (Accept–Reject Algorithm). Let f (x) be a pdf. Suppose that Y is a random variable with pdf g(y), U is a random variable with a uniform(0, 1) distribution, Y and U are independent, and (8.5) holds. The following algorithm generates a random variable X with pdf f (x). 1. Generate Y and U . 2. If U ≤

f (Y ) M g(Y ) ,

then take X = Y . Otherwise return to step 1.

3. X has pdf f (x). Proof of the validity of the algorithm: Let −∞ < x < ∞. Then f (Y ) P [X ≤ x] = P Y ≤ x|U ≤ M g(Y ) # " (Y ) P Y ≤ x, U ≤ Mf g(Y ) # " = (Y ) P U ≤ Mf g(Y ) x " f (y)/M g(y) # du g(y)dy −∞ 0 = ∞ " f (y)/M g(y) # du g(y)dy −∞ 0 x f (y) g(y)dy −∞ M g(y) = ∞ f (y) g(y)dy −∞ M g(y) x = f (y) dy.

(8.6) (8.7)

−∞

Hence, by diﬀerentiating both sides, we ﬁnd that the pdf of X is f (x). As Exercise 8.14 shows, from step (8.6) of the proof, we can ignore normalizing constants of the two pdfs f (x) and g(x). For example, if f (x) = kh(x) and g(x) = ct(x) for constants c and k, then we can use the rule h(x) ≤ M2 t(x) , −∞ < x < ∞,

272

(8.8)

Some Elementary Statistical Inferences and change the ratio in step 2 of the algorithm to U ≤ h(Y )/[M2 t(Y )]. This often simpliﬁes the use of the accept–reject algorithm. As an example of the accept–reject algorithm, consider simulating a Γ(α, β) distribution. There are several approaches to generating gamma observations; see, for instance, Kennedy and Gentle (1980). We present the approach discussed in Robert and Casella (1999). Recall if X has a Γ(α, 1) distribution, then the random variable βX has a Γ(α, β) distribution. αSo without loss of generality, we can assume that β = 1. If α is an integer, X = i=1 Yi , where the Yi s are iid Γ(1, 1). In this case, by expression (8.2), we see that the inverse cdf of Yi is easily written in closed form, and, hence, X is easy to generate. Thus the only remaining case is when α is not an integer. Assume then that X has a Γ(α, 1) distribution, where α is not an integer. Let Y have a Γ([α], 1/b) distribution, where b < 1 is chosen later and, as usual, [α] means the greatest integer less than or equal to α. To establish rule (8.8), consider the ratio, with h(x) and t(x) proportional to the pdfs of x and y, respectively, given by h(x) = b−[α] xα−[α] e−(1−b)x , t(x)

(8.9)

where we have ignored some of the normalizing constants. We next determine the constant b. As Exercise 8.15 shows, the derivative of expression (8.9) is d −[α] α−[α] −(1−b)x b x e = b−[α] e−(1−b)x [(α − [α]) − x(1 − b)]xα−[α]−1 , dx

(8.10)

which has a maximum critical value at x = (α − [α])/(1 − b). Hence, using the maximum of h(x)/t(x), α−[α] α − [α] h(x) ≤ b−[α] . t(x) (1 − b)e

(8.11)

Now, we need to ﬁnd our choice of b. Diﬀerentiating the right side of this inequality with respect to b, we get, as Exercise 8.16 shows, d −[α] [α]−α −[α] [α]−α [α] − αb b , (8.12) (1 − b) = −b (1 − b) db b(1 − b) which has a critical value at b = [α]/α < 1. As shown in that exercise, this value of b provides a minimum of the right side of expression (8.11). Thus, if we take b = [α]/α < 1, then equality (8.11) holds and it is the tightest inequality possible. The ﬁnal value of M is the right side of expression (8.11) evaluated at b = [α]/α < 1. The following example oﬀers a simpler derivation for a normal generator where the instrumental pdf is the pdf of a Cauchy random variable. Example 8.7. Suppose that X is a normally distributed random variable with pdf φ(x) = (2π)−1/2 exp{−x2 /2} and Y has a Cauchy distribution with pdf g(x) =

273

Some Elementary Statistical Inferences π −1 (1 + x2 )−1 . As Exercise 8.9 shows, the Cauchy distribution is easy to simulate because its inverse cdf is a known function. Ignoring normalizing constants, the ratio to bound is f (x) ∝ (1 + x2 ) exp{−x2 /2}, g(x)

−∞ < x < ∞.

(8.13)

As Exercise 8.17 shows, the derivative of this ratio is −x exp{−x2 /2}(x2 − 1), which has critical values at ±1. These values provide maxima to (8.13). Hence, (1 + x2 ) exp{−x2 /2} ≤ 2 exp{−1/2} = 1.213, so M = 1.213. One result of the proof of Algorithm 8.1 is that the probability of acceptance in the algorithm is M −1 . This follows immediately from the denominator factor in step (8.6) of the proof. Note, however, that this holds only for properly normed pdfs. For instance, in the last example, the maximum value of the ratio of properly normed pdfs is π √ 2 exp{−1/2} = 1.52. 2π Hence, 1/M = 1.52−1 = 0.66. Therefore, the probability that the algorithm accepts is 0.66. EXERCISES 8.1. Prove the converse of Theorem MCT. That is, let X be a random variable with a continuous cdf F (x). Assume that F (x) is strictly increasing on the space of X. Consider the random variable Z = F (X). Show that Z has a uniform distribution on the interval (0, 1). 1 1 dx. Hence, by using a uniform(0, 1) generator, 8.2. Recall that log 2 = 0 x+1 approximate log 2. Obtain an error of estimation in terms of a large sample 95% conﬁdence interval. If you have access to the statistical package R, write an R function for the estimate and the error of estimation. Obtain your estimate for 10,000 simulations and compare it to the true value. % $ 1.96 1 √ exp − 21 t2 dt. 8.3. Similar to Exercise 8.2 but now approximate 0 2π 8.4. Suppose X is a random variable with the pdf fX (x) = b−1 f ((x − a)/b), where b > 0. Suppose we can generate observations from f (z). Explain how we can generate observations from fX (x). 8.5. Determine a method to generate random observations for the logistic pdf, (4.9). If access is available, write an R function which returns a random sample of observations from a logistic distribution.

274

Some Elementary Statistical Inferences 8.6. Determine a method to generate random observations for the following pdf: 4x3 0 < x < 1 f (x) = 0 elsewhere. If access is available, write an R function which returns a random sample of observations from this pdf. 8.7. Determine a method to generate random observations for the Laplace pdf, (4.10). If access is available, write an R function which returns a random sample of observations from a Laplace distribution. 8.8. Determine a method to generate random observations for the extreme-valued pdf which is given by f (x) = exp {x − ex } ,

−∞ < x < ∞.

(8.14)

If access is available, write an R function which returns a random sample of observations from an extreme-valued distribution. 8.9. Determine a method to generate random observations for the Cauchy distribution with pdf 1 , −∞ < x < ∞. (8.15) f (x) = π(1 + x2 ) If access is available, write an R function which returns a random sample of observation from a Cauchy distribution. 8.10. Suppose we are interested in a particular Weibull distribution with pdf 1 2 −x3 /θ 3 0 0, consider the following accept–reject algorithm: 1/α

1. Generate U1 and U2 iid uniform(0, 1) random variables. Set V1 = U1 1/β V2 = U 2 .

and

2. Set W = V1 + V2 . If W ≤ 1, set X = V1 /W ; else go to step 1. 3. Deliver X. Show that X has a beta distribution with parameters α and β. See Kennedy and Gentle (1980). 8.21. Consider the following algorithm: 1. Generate U and V independent uniform (−1, 1) random variables. 2. Set W = U 2 + V 2 . 3. If W > 1 go to step 1.

4. Set Z = (−2 log W )/W and let X1 = U Z and X2 = V Z. Show that the random variables X1 and X2 are iid with a common N (0, 1) distribution. This algorithm was proposed by Marsaglia and Bray (1964).

276

Some Elementary Statistical Inferences

9

Bootstrap Procedures

In the last section, we introduced the method of Monte Carlo and discussed several of its applications. In the last few years, however, Monte Carlo procedures have become increasingly used in statistical inference. In this section, we present the bootstrap, one of these procedures. We concentrate on conﬁdence intervals and tests for one- and two-sample problems in this section.

9.1

Percentile Bootstrap Conﬁdence Intervals

Let X be a random variable of the continuous type with pdf f (x; θ), for θ ∈ Ω. Suppose X = (X1 , X2 , . . . , Xn ) is a random sample on X and θ = θ(X) is a point estimator of θ. The vector notation, X, proves useful in this section. In Sections 2 and 3, we discussed the problem of obtaining conﬁdence intervals for θ in certain situations. In this section, we discuss a general method called the percentile bootstrap procedure, which is a resampling procedure. It was proposed by Efron (1979). Informative discussions of such procedures can be found in the books by Efron and Tibshirani (1993) and Davison and Hinkley (1997). To motivate the procedure, suppose for the moment that θ has a N (θ, σθ2) distribution.

(9.1)

Then as in Section 2, a (1 − α)100% conﬁdence interval for θ is (θL , θU ), where θL = θ − z (1−α/2) σθ

and

θU = θ − z (α/2) σθ,

(9.2)

and z (γ) denotes the γ100th percentile of a standard normal random variable; i.e., z (γ) = Φ−1 (γ), where Φ is the cdf of a N (0, 1) random variable (see also Exercise 9.4). We have gone to a superscript notation here to avoid confusion with the usual subscript notation on critical values. Now suppose that θ and σθ are realizations from the sample and θL and θU are σ2 ) calculated as in (9.2). Next suppose that θ∗ is a random variable with a N (θ, θ distribution. Then, by (9.2), θ∗ − θ ∗ (1−α/2) = α/2. (9.3) ≤ −z P (θ ≤ θL ) = P σθ Likewise, P (θ∗ ≤ θU ) = 1 − (α/2). Therefore, θL and θU are the α2 100th and (1 − α2 )100th percentiles of the distribution of θ∗ . That is, the percentiles of the σ 2 ) distribution form the (1 − α)100% conﬁdence interval for θ. N (θ, θ We want our ﬁnal procedure to be quite general, so the normality assumption (9.1) is deﬁnitely not desired and, in Remark 9.1, we do show that this assumption is not necessary. So, in general, let H(t) denote the cdf of θ. In practice, though, we do not know the function H(t). Hence the above conﬁdence interval deﬁned by statement (9.3) cannot be obtained. But suppose we

277

Some Elementary Statistical Inferences ∗ ) for each could take an inﬁnite number of samples X1 , X2 , . . .; obtain θ∗ = θ(X ∗ ∗ sample X ; and then form the histogram of these estimates θ . The percentiles of this histogram would be the conﬁdence interval deﬁned by expression (9.3). Since we only have one sample, this is impossible. It is, however, the idea behind bootstrap procedures. Bootstrap procedures simply resample from the empirical distribution deﬁned by the one sample. The sampling is done at random and with replacement and the resamples are all of size n, the size of the original sample. That is, suppose x = (x1 , x2 , . . . , xn ) denotes the realization of the sample. Let Fn denote the empirical distribution function of the sample. Recall that Fn is a discrete cdf which puts mass n−1 at each point xi and that Fn (x) is an estimator of F (x). Then a ∗ ∗ ∗ bootstrap sample is a random sample, say x∗ = (x 1 , x2 , . . . , xn ), drawn from Fn . n ∗ ∗ −1 2 As Exercise 9.1 shows, E(xi ) = x and V (xi ) = n i=1 (xi − x) . At ﬁrst glance, this resampling the sample seems like it would not work. But our only information on sampling variability is within the sample itself, and by resampling the sample we are simulating this variability. We now give an algorithm which obtains a bootstrap conﬁdence interval. For clarity, we present a formal algorithm, which can be readily coded into languages such as R. Let x = (x1 , x2 , . . . , xn ) be the realization of a random sample drawn from a cdf F (x; θ), θ ∈ Ω. Let θ be a point estimator of θ. Let B, an integer, denote the number of bootstrap replications, i.e., the number of resamples. In practice, B is often 3000 or more. 1. Set j = 1. 2. While j ≤ B, do steps 2–5. 3. Let x∗j be a random sample of size n drawn from the sample x. That is, the observations x∗j are drawn at random from x1 , x2 , . . . , xn , with replacement. ∗ ). 4. Let θj∗ = θ(x j 5. Replace j by j + 1. ∗ ∗ ∗ ∗ 6. Let θ(1) ≤ θ(2) ≤ · · · ≤ θ(B) denote the ordered values of θ1∗ , θ2∗ , . . . , θB . Let m = [(α/2)B], where [·] denotes the greatest integer function. Form the interval ); (9.4) (θ∗ , θ∗ (m)

that is, obtain the α2 100% ∗ . tribution of θ1∗ , θ2∗ , . . . , θB

and (1 −

(B+1−m)

α 2 )100%

percentiles of the sampling dis-

The interval in (9.4) is called the percentile bootstrap conﬁdence interval for θ. In step 6, the subscripted parenthetical notation is a common notation for order statistics (Section 4), which is handy in this section.

278

Some Elementary Statistical Inferences Example 9.1. In this example, we sample from a known distribution, but, in practice, the distribution is usually unknown. Let X1 , X2 , . . . , Xn be a random sample from a Γ(1, β) distribution. Since the mean of this distribution is β, the sample average X is an unbiased estimator of β. In this example, X serves as our point estimator of β. The following 20 data points are the realizations (rounded) of a random sample of size n = 20 from a Γ(1, 100) distribution: 131.7 4.3

182.7 265.6

73.3 61.9

10.7 10.8

150.4 48.8

42.3 22.5

22.2 8.8

17.9 150.6

264.0 103.0

154.4 85.9

The value of X for this sample is x = 90.59, which is our point estimate of β. For illustration, we generated one bootstrap sample of these data. This ordered bootstrap sample is 4.3 48.8

4.3 48.8

4.3 85.9

10.8 131.7

10.8 131.7

10.8 150.4

10.8 154.4

17.9 154.4

22.5 264.0

42.3 265.6

As Exercise 9.1 shows, in general, the sample mean of a bootstrap sample is an unbiased estimator of original sample mean x. The sample mean of this particular bootstrap sample is x∗ = 78.725. We wrote an R function to generate bootstrap samples and the percentile conﬁdence interval above; see the program percentciboot.s of Appendix: R Functions. Figure 9.1 displays a histogram of 3000 x∗ s for the above sample. The sample mean of these 3000 values is 90.13, close to x = 90.59. Our program also obtained a 90% (bootstrap percentile) conﬁdence interval given by (61.655, 120.48), which the reader can locate on the ﬁgure. It did trap μ = 100. 600 500

Frequency

400 300 200 100 0 40

60

80

100 x*

120

140

160

Figure 9.1: Histogram of the 3000 bootstrap x∗ s. The 90% bootstrap conﬁdence interval is (61.655, 120.48). Exercise 9.2 shows that if we are sampling from a Γ(1, β) distribution, then the interval (2nX/[χ22n ](1−(α/2)) , 2nX/[χ22n ](α/2) ) is an exact (1 − α)100% conﬁdence interval for β. Note that, in keeping with our superscript notation for critical

279

Some Elementary Statistical Inferences values, [χ22n ](γ) denotes the γ100% percentile of a χ2 distribution with 2n degrees of freedom. The 90% conﬁdence interval for our sample is (64.99, 136.69). Remark 9.1. Brieﬂy, we show that the normal assumption on the distribution of (9.1), is transparent to the argument. Suppose H is the cdf of θ and that H θ, depends on θ. Then, using Theorem 8.1, we can ﬁnd an increasing transformation is N (φ, σ 2 ), where φ = m(θ) φ = m(θ) such that the distribution of φ = m(θ) c 2 and σc is some variance. For example, take the transformation to be m(θ) = Fc−1 (H(θ)), where Fc (x) is the cdf of a N (φ, σc2 ) distribution. Then, as above, (φ − z (1−α/2) σc , φ − z (α/2) σc ) is a (1 − α)100% conﬁdence interval for φ. But note that # " 1 − α = P φ − z (1−α/2) σc < φ < φ − z (α/2) σc ) # " (9.5) = P m−1 (φ − z (1−α/2) σc ) < θ < m−1 (φ − z (α/2) σc ) . (1−α/2) σc ), m−1 (φ−z (α/2) σc )) is a (1−α)100% conﬁdence interval Hence, (m−1 (φ−z for θ. Now suppose H is the cdf H with a realization θ substituted in for θ, i.e., σ 2 ) distribution above. Suppose θ∗ is a random variable with analogous to the N (θ, θ and φ∗ = m(θ∗ ). We have Let φ = m(θ) cdf H. # " P θ∗ ≤ m−1 (φ − z (1−α/2) σc )

# " P φ∗ ≤ φ − z (1−α/2) σc φ∗ − φ (1−α/2) = α/2, ≤ −z = P σc

=

similar to (9.3). Therefore, m−1 (φ − z (1−α/2) σc ) is the α2 100th percentile of the Likewise, m−1 (φ − z (α/2) σc ) is the (1 − α )100th percentile of the cdf H. cdf H. 2 form the Therefore, in the general case too, the percentiles of the distribution of H conﬁdence interval for θ. What about the validity of a bootstrap conﬁdence interval? Davison and Hinkley (1997) discuss the theory behind the bootstrap in Chapter 2 of their book. Under some general conditions, they show that the bootstrap conﬁdence interval is asymptotically valid. One way of improving the bootstrap is to use a pivot random variable, a variable whose distribution is free of other parameters. in the last example, √ For instance, σX , where σ ˆX = S/ n and S = [ (Xi −X)2 /(n−1)]1/2 ; instead of using X, use X/ˆ that is, adjust X by its standard error. This is discussed in Exercise 9.5. Other improvements are discussed in the two books cited earlier.

9.2

Bootstrap Testing Procedures

Bootstrap procedures can also be used eﬀectively in testing hypotheses. We begin by discussing these procedures for two-sample problems, which cover many of the nuances of the use of the bootstrap in testing.

280

Some Elementary Statistical Inferences Consider a two-sample location problem; that is, X = (X1 , X2 , . . . , Xn1 ) is a random sample from a distribution with cdf F (x) and Y = (Y1 , Y2 , . . . , Yn2 ) is a random sample from a distribution with the cdf F (x − Δ), where Δ ∈ R. The parameter Δ is the shift in locations between the two samples. Hence Δ can be written as the diﬀerence in location parameters. In particular, assuming that the means μY and μX exist, we have Δ = μY − μX . We consider the one-sided hypotheses given by (9.6) H0 : Δ = 0 versus H1 : Δ > 0 . As our test statistic, we take the diﬀerence in sample means, i.e., V = Y − X.

(9.7)

Our decision rule is to reject H0 if V ≥ c. As is often done in practice, we base our decision on the p-value of the test. Recall if the samples result in the values x1 , x2 , . . . , xn1 and y1 , y2 , . . . , yn2 with realized sample means x and y, respectively, then the p-value of the test is p = PH0 [V ≥ y − x].

(9.8)

Our goal is a bootstrap estimate of the p-value. But, unlike the last section, the bootstraps here have to be performed when H0 is true. An easy way to do this is to combine the samples into one large sample and then to resample at random and with replacement the combined sample into two samples, one of size n1 (new xs) and one of size n2 (new ys). Hence the resampling is performed under one distribution; i.e., H0 is true. Let B be a positive integer and let v = y − x. Our bootstrap algorithm is 1. Combine the samples into one sample: z = (x , y ). 2. Set j = 1. 3. While j ≤ B, do steps 3–6. 4. Obtain a random sample with replacement of size n1 from z. Call the sample x∗ = (x∗1 , x∗2 , . . . , x∗n1 ). Compute x∗j . 5. Obtain a random sample with replacement of size n2 from z. Call the sample y∗ = (y1∗ , y2∗ , . . . , yn∗ 2 ). Compute y ∗j . 6. Compute vj∗ = y ∗j − x∗j . 7. The bootstrap estimated p-value is given by p∗ =

∗ #B j=1 {vj ≥ v} . B

(9.9)

Note that the theory cited above for the bootstrap conﬁdence intervals covers this testing situation also. Hence, this bootstrap p-value is valid.

281

Some Elementary Statistical Inferences Example 9.2. For illustration, we generated data sets from a contaminated normal distribution. Let W be a random variable with the contaminated normal distribution with proportion of contamination = 0.20 and σc = 4. Thirty independent observations W1 , W2 , . . . , W30 were generated from this distribution. Then we let Xi = 10Wi + 100 for 1 ≤ i ≤ 15 and Yi = 10Wi+15 + 120 for 1 ≤ i ≤ 15. Hence the true shift parameter is Δ = 20. The actual (rounded) data are 94.2 109.3

111.3 106.0

90.0 111.7

125.5 120.3

107.1 118.6

67.9 105.0

X variates 99.7 116.8 111.9 111.6 Y variates 98.2 128.6 111.8 129.3

92.2 146.4

166.0 103.9

95.7

123.5 130.8

116.5 139.8

143.2

Based on the comparison boxplots below, the scales of the two data sets appear to be the same, while the y-variates (Sample 2) appear to be shifted to the right of x-variates (Sample 1).

-----------I +I---------

Sample 1

Sample 2

*

O

---------------I + I----------------+---------+---------+---------+---------+---------+------C3 60 80 100 120 140 160 *

There are three outliers in the data sets. Our test statistic for these data is v = y−x = 117.74−111.11 = 6.63. Computing with the R program boottesttwo.s found in Appendix: R Functions, we performed the bootstrap algorithm given above for B = 3000 bootstrap replications. The bootstrap p-value was p∗ = 0.169. This means that (0.169)(3000) = 507 of the bootstrap test statistics exceeded the value of the test statistic. Furthermore, these bootstrap values were generated under H0 . In practice, H0 would generally not be rejected for a p-value this high. In Figure 9.2, we display a histogram of the 3000 values of the bootstrap test statistic that were obtained. The relative area to the right of the value of the test statistic, 6.63, is approximately equal to p∗ . For comparison purposes, we used the two-sample “pooled” t-test discussed in Example 6.2 to test these hypotheses. As the reader can obtain in Exercise 9.7, for these data, t = 0.93 with a p-value of 0.18, which is quite close to the bootstrap p-value. The above test uses the diﬀerence in sample means as the test statistic. Certainly other test statistics could be used. Exercise 9.6 asks the reader to obtain the

282

Some Elementary Statistical Inferences 800

Frequency

600

400

200

0 –30

–20

–10

0

10

v*

20

Figure 9.2: Histogram of the 3000 bootstrap v ∗ s. Locate the value of the test statistic v = y − x = 6.63 on the horizontal axis. The area (proportional to overall area) to the right is the p-value of the bootstrap test. bootstrap test based on the diﬀerence in sample medians. Often, as with conﬁdence intervals, standardizing the test statistic by a scale estimator improves the bootstrap test. The bootstrap test described above for the two-sample problem is analogous to permutation tests. In the permutation test, the test statistic is calculated for all possible samples of xs and ys drawn without replacement from the combined data. Often, it is approximated by Monte Carlo methods, in which case it is quite similar to the bootstrap test except, in the case of the bootstrap, the sampling is done with replacement; see Exercise 9.9. Usually, the permutation tests and the bootstrap tests give very similar solutions; see Efron and Tibshirani (1993) for discussion. As our second testing situation, consider a one-sample location problem. Suppose X1 , X2 , . . . , Xn is a random sample from a continuous cdf F (x) with ﬁnite mean μ. Suppose we want to test the hypotheses H0 : μ = μ0 versus H1 : μ > μ0 , where μ0 is speciﬁed. As a test statistic we use X with the decision rule Reject H0 in favor of H1 if X is too large. Let x1 , x2 , . . . , xn be the realization of the random sample. We base our decision on the p-value of the test, namely, p = PH0 [X ≥ x], where x is the realized value of the sample average when the sample is drawn. Our bootstrap test is to obtain a bootstrap estimate of this p-value. At ﬁrst glance, one

283

Some Elementary Statistical Inferences might proceed by bootstrapping the statistic X. But note that the p-value must be estimated under H0 . One way of assuring H0 is true is instead of bootstrapping x1 , x2 , . . . , xn is to bootstrap the values: zi = x i − x + μ 0 ,

i = 1, 2, . . . , n.

(9.10)

Our bootstrap procedure is to randomly sample with replacement from z1 , z2 , . . . , zn . Letting z ∗ be such an observation, it is easy to see that E(z ∗ ) = μ0 ; see Exercise 9.10. Hence, using the zi s, the bootstrap resampling is performed under H0 . To be precise, here is our algorithm to compute this bootstrap test. Let B be a positive integer. 1. Form the vector of shifted observations: z = (z1 , z2 , . . . , zn ), where zi = x i − x + μ0 . 2. Set j = 1. 3. While j ≤ B, do steps 3–5. 4. Obtain a random sample with replacement of size n from z. Call the sample z∗j . Compute its sample mean z ∗j . 5. j is replaced by j + 1. 6. The bootstrap estimated p-value is given by p∗ =

∗ #B j=1 {z j ≥ x} . B

(9.11)

The theory discussed for the bootstrap conﬁdence intervals remains valid for this testing situation also. Example 9.3. To illustrate the bootstrap test described in the last paragraph, consider the following data set. We generated n = 20 observations Xi = 10Wi +100, where Wi has a contaminated normal distribution with proportion of contamination 20% and σc = 4. Suppose we are interested in testing H0 : μ = 90 versus H1 : μ > 90. Because the true mean of Xi is 100, the null hypothesis is false. The data generated are 119.7 95.4

104.1 77.2

92.8 100.0

85.4 114.2

108.6 150.3

93.4 102.3

67.1 105.8

88.4 107.5

101.0 0.9

97.2 94.1

The sample mean of these values is x = 95.27, which exceeds 90, but is it signiﬁcantly over 90? We wrote an R function to perform the algorithm described above, bootstrapping the values zi = xi − 95.27 + 90; see the program boottestonemean of Appendix: R Functions. We obtained 3000 values z ∗j , which are displayed in the histogram in Figure 9.3. The mean of these 3000 values is 89.96, which is quite

284

Some Elementary Statistical Inferences 1000

Frequency

800

600

400

200

0 60

70

80

90

100

110

z*

Figure 9.3: Histogram of the 3000 bootstrap z ∗ s discussed in Example 9.3. The bootstrap p-value is the area (relative to the total area) under the histogram to right of the 95.27.

close to 90. Of these 3000 values, 563 exceeded x = 95.27; hence, the p-value of the bootstrap test is 0.188. The fraction of the total area which is to the right of 95.27 in Figure 9.3 is approximately equal to 0.188. Such a high p-value is usually deemed nonsigniﬁcant; hence, the null hypothesis would not be rejected. For comparison, the reader is asked to show in Exercise 9.11 that the value of the one-sample t-test is t = 0.84, which has a p-value of 0.20. A test based on the median is discussed in Exercise 9.12.

EXERCISES 9.1. Let x1 , x2 , . . . , xn be the values of a random sample. A bootstrap sample, x∗ = (x∗1 , x∗2 , . . . , x∗n ), is a random sample of x1 , x2 , . . . , xn drawn with replacement. (a) Show that x∗1 , x∗2 , . . . , x∗n are iid with common cdf Fn , the empirical cdf of x1 , x2 , . . . , x n . (b) Show that E(x∗i ) = x. (c) If n is odd, show that median {x∗i } = x((n+1)/2) . n (d) Show that V (x∗i ) = n−1 i=1 (xi − x)2 . 9.2. Let X1 , X2 , . . . , Xn be a random sample from a Γ(1, β) distribution. (a) Show that the conﬁdence interval (2nX/(χ22n )(1−(α/2)) , 2nX/(χ22n )(α/2) ) is an exact (1 − α)100% conﬁdence interval for β.

285

Some Elementary Statistical Inferences (b) Using part (a), show that the 90% conﬁdence interval for the data of Example 9.1 is (64.99, 136.69). 9.3. Consider the situation discussed in Example 9.1. Suppose we want to estimate the median of Xi using the sample median. (a) Determine the median for a Γ(1, β) distribution. (b) The algorithm for the bootstrap percentile conﬁdence intervals is general and hence can be used for the median. Rewrite the R code in program percentciboot.s of Appendix: R Functions so the median is the estimator. Using the sample given in the example, obtain a 90% bootstrap percentile conﬁdence interval for the median. Did it trap the true median in this case? 9.4. Suppose X1 , X2 , . . . , Xn is a random sample drawn from a N (μ, σ 2 ) distribution. As discussed in Example 2.1, the pivot random variable for a conﬁdence interval is X −μ √ , (9.12) t= S/ n where X and S are the sample mean and standard deviation, respectively. Recall by that t has a Student t-distribution with n − 1 degrees of freedom; hence, its distribution is free of all parameters for this normal situation. In the notation of (γ) this section, tn−1 denotes the γ100% percentile of a t-distribution with n−1 degrees of freedom. Using this notation, show that a (1 − α)100% conﬁdence interval for μ is

(1−α/2) s (α/2) s √ √ x−t ,x − t . (9.13) n n 9.5. Frequently, the bootstrap percentile conﬁdence interval can be improved if the estimator θ is standardized by an estimate of scale. To illustrate this, consider a bootstrap for a conﬁdence interval for the mean. Let x∗1 , x∗2 , . . . , x∗n be a bootstrap sample drawn from the sample x1 , x2 , . . . , xn . Consider the bootstrap pivot [analog of (9.12)]: x∗ − x t∗ = ∗ √ , (9.14) s / n n where x∗ = n−1 i=1 x∗i and s

∗2

= (n − 1)

−1

n

(x∗i − x∗ )2 .

i=1

(a) Rewrite the percentile bootstrap conﬁdence interval algorithm using the mean and collecting t∗j for j = 1, 2, . . . , B. Form the interval

∗(1−α/2) s ∗(α/2) s √ √ x−t ,x − t , (9.15) n n where t∗(γ) = t∗([γ∗B]) ; that is, order the t∗j s and pick oﬀ the quantiles.

286

Some Elementary Statistical Inferences (b) Rewrite the R program percentciboot.s of Appendix: R Functions and use it to ﬁnd a 90% conﬁdence interval for μ for the data in Example 9.3. Use 3000 bootstraps. (c) Compare your conﬁdence interval in the last part with the nonstandardized bootstrap conﬁdence interval based on the program percentciboot.s of Appendix: R Functions. 9.6. Consider the algorithm for a two-sample bootstrap test given in Section 9.2. (a) Rewrite the algorithm for the bootstrap test based on the diﬀerence in medians. (b) Consider the data in Example 9.2. By substituting the diﬀerence in medians for the diﬀerence in means in the R program boottesttwo.s of Appendix: R Functions, obtain the bootstrap test for the algorithm of part (a). (c) Obtain the estimated p-value of your test for B = 3000 and compare it to the estimated p-value of 0.063 which the authors obtained. 9.7. Consider the data of Example 9.2. The two-sample t-test of Example 6.2 can be used to test these hypotheses. The test is not exact here (why?), but it is an approximate test. Show that the value of the test statistic is t = 0.93, with an approximate p-value of 0.18. 9.8. In Example 9.3, suppose we are testing the two-sided hypotheses, H0 : μ = 90 versus H1 : μ = 90. (a) Determine the bootstrap p-value for this situation. (b) Rewrite the R program boottestonemean of Appendix: R Functions to obtain this p-value. (c) Compute the p-value based on 3000 bootstraps. 9.9. Consider the following permutation test for the two-sample problem with hypotheses (9.6). Let x = (x1 , x2 , . . . , xn1 ) and y = (y1 , y2 , . . . , yn2 ) be the realizations of the two random samples. The test statistic is the diﬀerence in sample means y − x. The estimated p-value of the test is calculated as follows: 1. Combine the data into one sample z = (x , y ). 2. Obtain all possible samples of size n1 drawn without replacement from z. Each such sample automatically gives another sample of & size2 'n2 , i.e., all elements such samples. of z not in the sample of size n1 . There are M = n1n+n 1 3. For each such sample j: (a) Label the sample of size n1 by x∗ and label the sample of size n2 by y∗ .

287

Some Elementary Statistical Inferences (b) Calculate vj∗ = y ∗ − x∗ . 4. The estimated p-value is p∗ = #{vj∗ ≥ y − x}/M . (a) Suppose we have two samples each of size 3 which result in the realizations: x = (10, 15, 21) and y = (20, 25, 30). Determine the test statistic and the permutation test described above along with the p-value. (b) If we ignore distinct samples, then we can approximate the permutation test by using the bootstrap algorithm with resampling performed at random and without replacement. Modify the bootstrap program boottesttwo.s of Appendix: R Functions to do this and obtain this approximate permutation test based on 3000 resamples for the data of Example 9.2. (c) In general, what is the probability of having distinct samples in the approximate permutation test described in the last part? Assume that the original data are distinct values. 9.10. Let z ∗ be drawn at random from the discrete distribution which has mass n−1 at each point zi = xi − x + μ0 , where (x1 , x2 , . . . , xn ) is the realization of a random sample. Determine E(z ∗ ) and V (z ∗ ). 9.11. For the situation described in Example 9.3, show that the value of the onesample t-test is t = 0.84 and its associated p-value is 0.20. 9.12. For the situation described in Example 9.3, obtain the bootstrap test based on medians. Use the same hypotheses; i.e., H0 : μ = 90 versus H1 : μ > 90. 9.13. Consider the Darwin’s experiment on Zea mays discussed in Examples 5.1 and 5.5. (a) Obtain a bootstrap test for this experimental data. Keep in mind that the data are recorded in pairs. Hence your resampling procedure must keep this dependence intact and still be under H0 . (b) Provided computational facilities exist, write an R program that executes your bootstrap test and compare its p-value with that found in Example 5.5.

10

∗

Tolerance Limits for Distributions

We propose now to investigate a problem that has something of the same ﬂavor as that treated in Section 4. Speciﬁcally, can we compute the probability that a certain random interval includes (or covers) a preassigned percentage of the probability of the distribution under consideration? And, by appropriate selection of the random interval, can we be led to an additional distribution free method of statistical inference?

288

Some Elementary Statistical Inferences Let X be a random variable with distribution function F (x) of the continuous type. Let Z = F (X). Then, as shown in Exercise 8.1, Z has a uniform(0, 1) distribution. That is, Z = F (X) has the pdf 1 0 0, lim P [|Xn − X| ≥ ] = 0,

n→∞

or equivalently, lim P [|Xn − X| < ] = 1.

n→∞

If so, we write P

Xn → X. P

If Xn → X, we often say that the mass of the diﬀerence Xn − X is converging to 0. In statistics, often the limiting random variable X is a constant; i.e., X is a degenerate random variable with all its mass at some constant a. In this case, P we write Xn → a. Also, as Exercise 1.1 shows, convergence of the real sequence P an → a is equivalent to an → a. One way of showing convergence in probability is to use Chebyshev’s Theorem. An illustration of this is given in the following proof. To emphasize the fact that we are working with sequences of random variables, we may place a subscript n on random variables, like X to read X n . Theorem 1.1 (Weak Law of Large Numbers). Let {Xn } be a sequence of iid 2 random n variables having common mean μ and variance σ < ∞. Let X n = −1 n i=1 Xi . Then P

X n → μ. Proof: The mean and variance of X n are μ and σ 2 /n, respectively. Hence, by Chebyshev’s Theorem, we have for every > 0, √ √ σ2 P [|X n − μ| ≥ ] = P [|X n − μ| ≥ ( n/σ)(σ/ n)] ≤ 2 → 0. n This theorem says that all the mass of the distribution of X n is converging to μ, as n → ∞. In a sense, for n large, X n is close to μ. But how close? For instance, if we were to estimate μ by X n , what can we say about the error of estimation? We answer this in Section 3. Actually, in a more advanced course, a Strong Law of Large Numbers is proved; see page 124 of Chung (1974). One result of this theorem is that we can weaken the hypothesis of Theorem 1.1 to the assumption that the random variables Xi are independent and each has ﬁnite mean μ. Thus the Strong Law of Large Numbers is a ﬁrst moment theorem, while the Weak Law requires the existence of the second moment. There are several theorems concerning convergence in probability which will be useful in the sequel. Together the next two theorems say that convergence in probability is closed under linearity. P

P

P

Theorem 1.2. Suppose Xn → X and Yn → Y . Then Xn + Yn → X + Y .

296

Consistency and Limiting Distributions Proof: Let > 0 be given. Using the triangle inequality, we can write |Xn − X| + |Yn − Y | ≥ |(Xn + Yn ) − (X + Y )| ≥ . Since P is monotone relative to set containment, we have P [|(Xn + Yn ) − (X + Y )| ≥ ]

≤ ≤

P [|Xn − X| + |Yn − Y | ≥ ] P [|Xn − X| ≥ /2] + P [|Yn − Y | ≥ /2].

By the hypothesis of the theorem, the last two terms converge to 0, which gives us the desired result. P

P

Theorem 1.3. Suppose Xn → X and a is a constant. Then aXn → aX. Proof: If a = 0, the result is immediate. Suppose a = 0 . Let > 0 . The result follows from these equalities: P [|aXn − aX| ≥ ] = P [|a||Xn − X| ≥ ] = P [|Xn − X| ≥ /|a|], and by hypotheses the last term goes to 0. P

Theorem 1.4. Suppose Xn → a and the real function g is continuous at a. Then P g(Xn ) → g(a) . Proof: Let > 0 . Then since g is continuous at a, there exists a δ > 0 such that if |x − a| < δ, then |g(x) − g(a)| < . Thus |g(x) − g(a)| ≥ ⇒ |x − a| ≥ δ. Substituting Xn for x in the above implication, we obtain P [|g(Xn ) − g(a)| ≥ ] ≤ P [|Xn − a| ≥ δ]. By the hypothesis, the last term goes to 0 as n → ∞, which gives us the result. P

This theorem gives us many useful results. For instance, if Xn → a, then Xn2 1/Xn Xn

→ a2 P

P

→ 1/a, √ P → a,

provided a = 0 provided a ≥ 0. P

Actually, in a more advanced class, it is shown that if Xn → X and g is a P continuous function, then g(Xn ) → g(X); see page 104 of Tucker (1967). We make use of this in the next theorem. P

P

P

Theorem 1.5. Suppose Xn → X and Yn → Y . Then Xn Yn → XY .

297

Consistency and Limiting Distributions Proof: Using the above results, we have Xn Yn

= P

→

1 2 X + 2 n 1 2 X + 2

1 2 Y − 2 n 1 2 Y − 2

1 (Xn − Yn )2 2 1 (X − Y )2 = XY. 2

Let us return to our discussion of sampling and statistics. Consider the situation where we have a random variable X whose distribution has an unknown parameter θ ∈ Ω. We seek a statistic based on a sample to estimate θ. We now introduce consistency: Deﬁnition 1.2 (Consistency). Let X be a random variable with cdf F (x, θ), θ ∈ Ω. Let X1 , . . . , Xn be a sample from the distribution of X and let Tn denote a statistic. We say Tn is a consistent estimator of θ if P

Tn → θ. If X1 , . . . , Xn is a random sample from a distribution with ﬁnite mean μ and variance σ 2 , then by the Weak Law of Large Numbers, the sample mean, X n , is a consistent estimator of μ. Example 1.1 (Sample Variance). Let X1 , . . . , Xn denote a random sample from a distribution with mean μ and variance σ 2 . We can show that the sample variance is an unbiased estimator of σ 2 . We now show that it is a consistent estimator of P σ 2 . Recall that Theorem 1.1 showed that X n → μ. To show that the sample 2 variance converges in probability to σ , assume further that E[X14 ] < ∞, so that Var(S 2 ) < ∞. Using the preceding results, we can show the following: n n 1 1 n 2 Sn2 = (Xi − X n )2 = X2 − Xn n − 1 i=1 n − 1 n i=1 i → 1 · [E(X12 ) − μ2 ] = σ 2 . P

Hence the sample variance is a consistent estimator of σ 2 . From the discussion P above, we have immediately that Sn → σ; that is, the sample standard deviation is a consistent estimator of the population standard deviation. Unlike the last example, sometimes we can obtain the convergence by using the distribution function. We illustrate this with the following example: Example 1.2 (Maximum of a Sample from a Uniform Distribution). Suppose X1 , . . . , Xn is a random sample from a uniform(0, θ) distribution. Suppose θ is unknown. An intuitive estimate of θ is the maximum of the sample. Let Yn = max {X1 , . . . , Xn }. Exercise 1.4 shows that the cdf of Yn is ⎧ t>θ ⎨ 1 t n FYn (t) = (1.1) 0θ −∞ 0 be given. Then lim P [|Xn − b| ≤ ] = lim FXn (b + ) − lim FXn [(b − ) − 0] = 1 − 0 = 1,

n→∞

n→∞

n→∞

which is the desired result. A result that will prove quite useful is the following: Theorem 2.3. Suppose Xn converges to X in distribution and Yn converges in probability to 0. Then Xn + Yn converges to X in distribution. The proof is similar to that of Theorem 2.2 and is left to Exercise 2.12. We often use this last result as follows. Suppose it is diﬃcult to show that Xn converges to X in distribution, but it is easy to show that Yn converges in distribution to X and that Xn − Yn converges to 0 in probability. Hence, by this last theorem, D Xn = Yn + (Xn − Yn ) → X, as desired. The next two theorems state general results. A proof of the ﬁrst result can be found in a more advanced text, while the second, Slutsky’s Theorem, follows similarly to that of Theorem 2.1. Theorem 2.4. Suppose Xn converges to X in distribution and g is a continuous function on the support of X. Then g(Xn ) converges to g(X) in distribution. An often-used application of this theorem occurs when we have a sequence of random variables Zn which converges in distribution to a standard normal random variable Z. Because the distribution of Z 2 is χ2 (1), it follows by Theorem 2.4 that Zn2 converges in distribution to a χ2 (1) distribution.

305

Consistency and Limiting Distributions Theorem 2.5 (Slutsky’s Theorem). Let Xn , X, An , and Bn be random variables D P P and let a and b be constants. If Xn → X, An → a, and Bn → b, then D

An + Bn Xn → a + bX.

2.1

Bounded in Probability

Another useful concept, related to convergence in distribution, is boundedness in probability of a sequence of random variables. First consider any random variable X with cdf FX (x). Then given > 0, we can bound X in the following way. Because the lower limit of FX is 0 and its upper limit is 1, we can ﬁnd η1 and η2 such that FX (x) < /2 for x ≤ η1 and FX (x) > 1 − (/2) for x ≥ η2 . Let η = max{|η1 |, |η2 |}, then P [|X| ≤ η] = FX (η) − FX (−η − 0) ≥ 1 − (/2) − (/2) = 1 − .

(2.7)

Thus random variables which are not bounded [e.g., X is N (0, 1)] are still bounded in this way. This is a useful concept for sequences of random variables, which we deﬁne next. Deﬁnition 2.2 (Bounded in Probability). We say that the sequence of random variables {Xn } is bounded in probability if, for all > 0, there exist a constant B > 0 and an integer N such that n ≥ N ⇒ P [|Xn | ≤ B ] ≥ 1 − . Next, consider a sequence of random variables {Xn } which converge in distribution to a random variable X which has cdf F . Let > 0 be given and choose η so that (2.7) holds for X. We can always choose η so that η and −η are continuity points of F . We then have lim P [|Xn | ≤ η] ≥ lim FXn (η) − lim FXn (−η − 0) = FX (η) − FX (−η) ≥ 1 − .

n→∞

n→∞

n→∞

To be precise, we can then choose N so large that P [|Xn | ≤ η] ≥ 1 − , for n ≥ N . We have thus proved the following theorem Theorem 2.6. Let {Xn } be a sequence of random variables and let X be a random variable. If Xn → X in distribution, then {Xn } is bounded in probability. As the following example shows, the converse of this theorem is not true. Example 2.5. Take {Xn } to be the following sequence of degenerate random variables. For n = 2m even, X2m = 2 + (1/(2m)) with probability 1. For n = 2m − 1 odd, X2m−1 = 1+(1/(2m)) with probability 1. Then the sequence {X2 , X4 , X6 , . . .} converges in distribution to the degenerate random variable Y = 2, while the sequence {X1 , X3 , X5 , . . .} converges in distribution to the degenerate random variable W = 1. Since the distributions of Y and W are not the same, the sequence {Xn } does not converge in distribution. Because all of the mass of the sequence {Xn } is in the interval [1, 5/2], however, the sequence {Xn } is bounded in probability.

306

Consistency and Limiting Distributions One way of thinking of a sequence which is bounded in probability (or one which is converging to a random variable in distribution) is that the probability mass of |Xn | is not escaping to ∞. At times we can use boundedness in probability instead of convergence in distribution. A property we will need later is given in the following theorem, Theorem 2.7. Let {Xn } be a sequence of random variables bounded in probability and let {Yn } be a sequence of random variables which converge to 0 in probability. Then P Xn Yn → 0. Proof: Let > 0 be given. Choose B > 0 and an integer N such that n ≥ N ⇒ P [|Xn | ≤ B ] ≥ 1 − . Then lim P [|Xn Yn | ≥ ]

n→∞

≤

lim P [|Xn Yn | ≥ , |Xn | ≤ B ]

n→∞

+ lim P [|Xn Yn | ≥ , |Xn | > B ] n→∞

lim P [|Yn | ≥ /B ] + = ,

≤

n→∞

(2.8)

from which the desired result follows.

2.2

Δ-Method

A common problem is the situation where we know the distribution of a random variable, but we want to determine the distribution of a function of it. This is also true in asymptotic theory, and Theorems 2.4 and 2.5 are illustrations of this. Another such result is called the Δ-method. To establish this result, we need a convenient form of the mean value theorem with remainder, sometimes called Young’s Theorem; see Hardy (1992) or Lehmann (1999). Suppose g(x) is diﬀerentiable at x. Then we can write g(y) = g(x) + g (x)(y − x) + o(|y − x|),

(2.9)

where the notation o means a = o(b) if and only if

a b

→ 0, as b → 0.

The little-o notation is used in terms of convergence in probability, also. We often write op (Xn ), which means Yn = op (Xn ) if and only if

Yn P Xn →

0, as n → ∞.

(2.10)

There is a corresponding Op notation, which is given by Yn = Op (Xn ) if and only if

Yn Xn

is bounded in probability as n → ∞.

(2.11)

The following theorem illustrates the little-o notation, but it also serves as a lemma for Theorem 2.9.

307

Consistency and Limiting Distributions Theorem 2.8. Suppose {Yn } is a sequence of random variables which is bounded P in probability. Suppose Xn = op (Yn ). Then Xn → 0, as n → ∞. Proof: Let > 0 be given. Because the sequence {Yn } is bounded in probability, there exist positive constants N and B such that n ≥ N =⇒ P [|Yn | ≤ B ] ≥ 1 − .

(2.12)

Also, because Xn = op (Yn ), we have Xn P → 0, Yn

(2.13)

as n → ∞. We then have P [|Xn | ≥ ]

= ≤

P [|Xn | ≥ , |Yn | ≤ B ] + P [|Xn | ≥ , |Yn | > B ] Xn P ≥ + P [|Yn | > B ] . |Yn | B

By (2.13) and (2.12), respectively, the ﬁrst and second terms on the right side can be made arbitrarily small by choosing n suﬃciently large. Hence the result is true. We can now prove the theorem about the asymptotic procedure, which is often called the Δ method. Theorem 2.9. Let {Xn } be a sequence of random variables such that √

n(Xn − θ) → N (0, σ 2 ). D

(2.14)

Suppose the function g(x) is diﬀerentiable at θ and g (θ) = 0. Then √ D n(g(Xn ) − g(θ)) → N (0, σ 2 (g (θ))2 ).

(2.15)

Proof: Using expression (2.9), we have g(Xn ) = g(θ) + g (θ)(Xn − θ) + op (|Xn − θ|), where op is interpreted as in (2.10). Rearranging, we have √

√ √ n(g(Xn ) − g(θ)) = g (θ) n(Xn − θ) + op ( n|Xn − θ|).

√ Because (2.14) holds, Theorem 2.6 √ implies that n|Xn −θ| is bounded in probability. Therefore, by Theorem 2.8, op ( n|Xn − θ|) → 0, in probability. Hence, by (2.14) and Theorem 2.1, the result follows. Illustrations of the Δ-method can be found in Example 2.8 and the exercises.

308

Consistency and Limiting Distributions

2.3

Moment Generating Function Technique

To ﬁnd the limiting distribution function of a random variable Xn by using the deﬁnition obviously requires that we know FXn (x) for each positive integer n. But it is often diﬃcult to obtain FXn (x) in closed form. Fortunately, if it exists, the mgf that corresponds to the cdf FXn (x) often provides a convenient method of determining the limiting cdf. The following theorem, which is essentially Curtiss’ (1942) modiﬁcation of a theorem of L´evy and Cram´er, explains how the mgf may be used in problems of limiting distributions. A proof of the theorem is beyond our scope. It can readily be found, for instance, on page 171 of Breiman (1968) (a proof based on characteristic functions). Theorem 2.10. Let {Xn } be a sequence of random variables with mgf MXn (t) that exists for −h < t < h for all n. Let X be a random variable with mgf M (t), which D exists for |t| ≤ h1 ≤ h. If limn→∞ MXn (t) = M (t) for |t| ≤ h1 , then Xn → X. In this and the subsequent sections are several illustrations of the use of Theorem 2.10. In some of these examples it is convenient to use a certain limit that is established in some courses in advanced calculus. We refer to a limit of the form cn ψ(n) b , lim 1 + + n→∞ n n where b and c do not depend upon n and where limn→∞ ψ(n) = 0. Then cn cn ψ(n) b b lim 1 + + = lim 1 + = ebc . n→∞ n→∞ n n n

(2.16)

For example, −n/2 √ −n/2 t2 t2 / n t2 t2 = lim 1 − + . lim 1 − + 3/2 n→∞ n→∞ n n n n √ Here b = −t2 , c = − 12 , and ψ(n) = t2 / n. Accordingly, for every ﬁxed value of t, 2 the limit is et /2 . Example 2.6. Let Yn have a distribution that is b(n, p). Suppose that the mean μ = np is the same for every n; that is, p = μ/n, where μ is a constant. We shall ﬁnd the limiting distribution of the binomial distribution, when p = μ/n, by ﬁnding the limit of MYn (t). Now n μ(et − 1) tYn t n MYn (t) = E(e ) = [(1 − p) + pe ] = 1 + n for all real values of t. Hence we have t

lim MYn (t) = eμ(e

−1)

n→∞

309

Consistency and Limiting Distributions for all real values of t. Since there exists a distribution, namely the Poisson distribut tion with mean μ, that has mgf eμ(e −1) , then, in accordance with the theorem and under the conditions stated, it is seen that Yn has a limiting Poisson distribution with mean μ. Whenever a random variable has a limiting distribution, we may, if we wish, use the limiting distribution as an approximation to the exact distribution function. The result of this example enables us to use the Poisson distribution as an approximation to the binomial distribution when n is large and p is small. To illustrate the use 1 . of the approximation, let Y have a binomial distribution with n = 50 and p = 25 Then 1 50 49 + 50( 25 )( 24 = 0.400, P r(Y ≤ 1) = ( 24 25 ) 25 ) approximately. Since μ = np = 2, the Poisson approximation to this probability is e−2 + 2e−2 = 0.406. Example 2.7. Let Zn be χ2 (n). Then the mgf of Zn is (1 − 2t)−n/2 , t < 12 . The n and 2n. The limiting distribution mean and the variance of Zn are, respectively, √ of the random variable Yn = (Zn − n)/ 2n will be investigated. Now the mgf of Yn is Zn − n √ MYn (t) = E exp t 2n √

= =

√

e−tn/ 2n E(etZn / 2n ) −n/2 t 2 n 1 − 2√ exp − t , n 2 2n

√ t

0, ≤ |Xnj − Xj | ≤ Xn − X. Hence limn→∞ P [|Xnj − Xj | ≥ ] ≤ limn→∞ P [Xn − X ≥ ] = 0, which is the desired result. P Conversely, if Xnj → Xj for all j = 1, . . . , p, then by the second part of the inequality (4.3), p |Xnj − Xj |, ≤ Xn − X ≤ i=1

for any > 0. Hence limn→∞ P [Xn − X ≥ ]

≤

limn→∞ P [

p

|Xnj − Xj | ≥ ]

j=1

≤

p

limn→∞ P [|Xnj − Xj | ≥ /p] = 0.

j=1

Based on this result, many of the theorems involving convergence in probability can easily be extended to the multivariate setting. Some of these results are given in the exercises. This is true of statistical results, too. For example, in Section 2, we showed that if X1 , . . . , Xn is a random sample from the distribution of a random variable X with mean, μ, and variance, σ 2 , then X n and Sn2 are consistent

321

Consistency and Limiting Distributions estimates of μ and σ 2 . By the last theorem, we have that (X n , Sn2 ) is a consistent estimate of (μ, σ 2 ). As another simple application, consider the multivariate analog of the sample mean and sample variance. Let {Xn } be a sequence of iid random vectors with common mean vector µ and variance-covariance matrix Σ. Denote the vector of means by n 1 Xn = Xi . (4.5) n i=1 Of course, Xn is just the vector of sample means, (X 1 , . . . , X p ) . By the Weak Law of Large Numbers, Theorem 1.1, X j → μj , in probability, for each j. Hence, by Theorem 4.1, Xn → µ, in probability. How about the analog of the sample variances? Let Xi = (Xi1 , . . . , Xip ) . Deﬁne the sample variances and covariances by 2 Sn,j

=

1 (Xij − X j )2 , for j = 1, . . . , p, n − 1 i=1

(4.6)

Sn,jk

=

1 (Xij − X j )(Xik − X k ), n − 1 i=1

(4.7)

n

n

for j = k = 1, . . . , p.

Assuming ﬁnite fourth moments, the Weak Law of Large Numbers shows that all these componentwise sample variances and sample covariances converge in probability to distribution variances and covariances, respectively. As in our discussion after the Weak Law of Large Numbers, the Strong Law of Large Numbers implies that this convergence is true under the weaker assumption of the existance of ﬁnite second moments. If we deﬁne the p × p matrix S to be the matrix with the jth 2 and (j, k)th entry Sn,jk , then S → Σ, in probability. diagonal entry Sn,j The deﬁnition of convergence in distribution remains the same. We state it here in terms of vector notation. Deﬁnition 4.2. Let {Xn } be a sequence of random vectors with Xn having distribution function Fn (x) and X be a random vector with distribution function F (x). Then {Xn } converges in distribution to X if lim Fn (x) = F (x),

(4.8)

n→∞

D

for all points x at which F (x) is continuous. We write Xn → X. In the multivariate case, there are analogs to many of the theorems in Section 2. We state two important theorems without proof. Theorem 4.2. Let {Xn } be a sequence of random vectors which converge in distribution to a random vector X and let g(x) be a function which is continuous on the support of X. Then g(Xn ) converges in distribution to g(X). We can apply this theorem to show that convergence in distribution implies marginal convergence. Simply take g(x) = xj , where x = (x1 , . . . , xp ) . Since g is continuous, the desired result follows.

322

Consistency and Limiting Distributions It is often diﬃcult to determine convergence in distribution by using the deﬁnition. As in the univariate case, convergence in distribution is equivalent to convergence of moment generating functions, which we state in the following theorem. Theorem 4.3. Let {Xn } be a sequence of random vectors with Xn having distribution function Fn (x) and moment generating function Mn (t). Let X be a random vector with distribution function F (x) and moment generating function M (t). Then {Xn } converges in distribution to X if and only if, for some h > 0, lim Mn (t) = M (t),

n→∞

(4.9)

for all t such that t < h. The proof of this theorem can be found in, for instance, Tucker (1967). Also, the usual proof is for characteristic functions instead of moment generating functions. As we mentioned previously, characteristic functions always exist, so convergence in distribution is completely characterized by convergence of corresponding characteristic functions. The moment generating function of Xn is E[exp{t Xn }]. Note that t Xn is a random variable. We can frequently use this and univariate theory to derive results in the multivariate case. A perfect example of this is the multivariate central limit theorem. Theorem 4.4 (Multivariate Central Limit Theorem). Let {Xn } be a sequence of iid random vectors with common mean vector µ and variance-covariance matrix Σ which is positive deﬁnite. Assume the common moment generating function M (t) exists in an open neighborhood of 0. Let √ 1 (Xi − µ) = n(X − µ). Yn = √ n i=1 n

Then Yn converges in distribution to a Np (0, Σ) distribution. Proof: Let t ∈ Rp be a vector in the stipulated neighborhood of 0. The moment generating function of Yn is n 1 Mn (t) = E exp t √ (Xi − µ) n i=1 n 1 = E exp √ t (Xi − µ) n i=1 n 1 = E exp √ , (4.10) Wi n i=1 where Wi = t (Xi − µ). Note that W1 , . . . , Wn are iid with mean 0 and variance Var(Wi ) = t Σt. Hence, by the simple Central Limit Theorem, 1 D √ Wi → N (0, t Σt). n i=1 n

(4.11)

323

Consistency and Limiting Distributions √ n Expression (4.10), though, is the mgf of (1/ n) i=1 Wi evaluated at 1. Therefore, by (4.11), we must have n 2 1 → e1 t Σt/2 = et Σt/2 . Mn (t) = E exp (1) √ Wi n i=1 Because the last quantity is the moment generating function of a Np (0, Σ) distribution, we have the desired result. Suppose X1 , X2 , . . . , Xn is a random sample from a distribution with mean vector µ and variance-covariance matrix Σ. Let Xn be the vector of sample means. Then, from the Central Limit Theorem, we say that

Xn has an approximate Np µ, n1 Σ distribution. (4.12) A result that we use frequently concerns linear transformations. Its proof is obtained by using moment generating functions and is left as an exercise. Theorem 4.5. Let {Xn } be a sequence of p-dimensional random vectors. Suppose D Xn → N (µ, Σ). Let A be an m × p matrix of constants and let b be an mD dimensional vector of constants. Then AXn + b → N (Aµ + b, AΣA ). A result that will prove to be quite useful is the extension of the Δ-method; see Theorem 2.9. A proof can be found in Chapter 3 of Serﬂing (1980). Theorem 4.6. Let {Xn } be a sequence of p-dimensional random vectors. Suppose √

D

n(Xn − µ0 ) → Np (0, Σ).

Let g be a transformation g(x) = (g1 (x), . . . , gk (x)) such that 1 ≤ k ≤ p and the k × p matrix of partial derivatives, ∂gi , i = 1, . . . k; j = 1, . . . , p , B= ∂μj are continuous and do not vanish in a neighborhood of µ0 . Let B0 = B at µ0 . Then √ D n(g(Xn ) − g(µ0 )) → Nk (0, B0 ΣB0 ).

EXERCISES 4.1. Let {Xn } be a sequence of p-dimensional random vectors. Show that Xn → Np (µ, Σ) if and only if a Xn → N1 (a µ, a Σa), D

for all vectors a ∈ Rp .

324

D

(4.13)

Consistency and Limiting Distributions 4.2. Let X1 , . . . , Xn be a random sample from a uniform(a, b) distribution. Let Y1 = min Xi and let Y2 = max Xi . Show that (Y1 , Y2 ) converges in probability to the vector (a, b) . 4.3. Let Xn and Yn be p-dimensional random vectors. Show that if P

D

Xn − Yn → 0 and Xn → X, D

where X is a p-dimensional random vector, then Yn → X. 4.4. Let Xn and Yn be p-dimensional random vectors such that Xn and Yn are independent for each n and their mgfs exist. Show that if D

D

Xn → X and Yn → Y, D

where X and Y are p-dimensional random vectors, then (Xn , Yn ) → (X, Y). 4.5. Suppose Xn has a Np (µn , Σn ) distribution. Show that D

Xn → Np (µ, Σ) iﬀ µn → µ and Σn → Σ.

Answers to Selected Exercises 1.7 No; Yn − n1 .

2.14 (b) N (0, 1).

2.1 Degenerate at μ.

2.16 (b) N (0, 1).

2.2 Gamma(α = 1, β = 1).

2.19

2.3 Gamma(α = 1, β = 1).

3.2 0.954.

2.4 Gamma(α = 2, β = 1).

3.3 0.604.

2.7 Degenerate at β.

3.4 0.840.

2.9 0.682.

3.5 0.728.

2.10 (b) 0.815.

3.7 0.08.

2.13 Degenerate at μ2 +

σ2 σ1 (x

− μ1 ).

1 5.

3.9 0.267.

325

326

Maximum Likelihood Methods 1

Maximum Likelihood Estimation

We can introduce maximum likelihood estimates (mle) as a point estimation procedure. We continue this development showing that these likelihood procedures give rise to a formal theory of statistical inference (conﬁdence and testing procedures). Under certain conditions (regularity conditions), these procedures are asymptotically optimal. Consider a random variable X whose pdf f (x; θ) depends on an unknown parameter θ which is in a set Ω. Our general discussion is for the continuous case, but the results extend to the discrete case also. For information, we have a random sample (iid) X1 , . . . , Xn on X. Suppose that X1 , . . . , Xn are iid random variables with common pdf f (x; θ), θ ∈ Ω. For now, we assume that θ is a scalar, but we can extend the results to vectors. The parameter θ is unknown. The basis of our inferential procedures is the likelihood function given by L(θ; x) =

n

f (xi ; θ),

θ ∈ Ω,

(1.1)

i=1

where x = (x1 , . . . , xn ) . Because we treat L as a function of θ in this chapter, we have transposed the xi and θ in the argument of the likelihood function. In fact, we often write it as L(θ). Actually, the log of this function is usually more convenient to use and we denote it by l(θ) = log L(θ) =

n

log f (xi ; θ),

θ ∈ Ω.

(1.2)

i=1

Note that there is no loss of information in using l(θ) because the log is a one-to-one function. Most of our discussion in this chapter remains the same if X is a random

From Chapter 6 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

327

Maximum Likelihood Methods vector. Although we generally consider X to be a univariate random variable, for several of our examples it is a random vector. 1 , . . . , Xn ), where θ maximizes the function Our point estimator of θ is θ = θ(X L(θ). We call θ the maximum likelihood estimator (mle) of θ. The binomial and normal probability models are several motivating examples. Later we give several more examples, but ﬁrst we oﬀer a theoretical justiﬁcation for considering the mle. Let θ0 denote the true value of θ. Theorem 1.1 shows that the maximum of L(θ) asymptotically separates the true model at θ0 from models at θ = θ0 . To prove this theorem, we assume certain assumptions, usually called regularity conditions. Assumptions 1.1 (Regularity Conditions). Regularity conditions (R0)–(R1) are given by (R0) The pdfs are distinct; i.e., θ = θ ⇒ f (xi ; θ) = f (xi ; θ ). (R1) The pdfs have common support for all θ. (R2) The point θ0 is an interior point in Ω. The ﬁrst assumption states that the parameter identiﬁes the pdf. The second assumption implies that the support of Xi does not depend on θ. This is restrictive, and some examples and exercises cover models in which (R1) is not true. Theorem 1.1. Let θ0 be the true parameter. Under assumptions (R0) and (R1), lim Pθ0 [L(θ0 , X) > L(θ, X)] = 1, for all θ = θ0 .

n→∞

(1.3)

Proof: By taking logs, the inequality L(θ0 , X) > L(θ, X) is equivalent to n f (Xi ; θ) 1 < 0. log n i=1 f (Xi ; θ0 ) Since the summands are iid with ﬁnite expectation and the function φ(x) = − log(x) is strictly convex, it follows from the Law of Large Numbers and Jensen’s inequality that, when θ0 is the true parameter, n f (X1 ; θ) f (X1 ; θ) f (Xi ; θ) P 1 < log Eθ0 . log → Eθ0 log n i=1 f (Xi ; θ0 ) f (X1 ; θ0 ) f (X1 ; θ0 )

But Eθ0

f (X1 ; θ) f (x; θ) = f (x; θ0 ) dx = 1. f (X1 ; θ0 ) f (x; θ0 )

Because log 1 = 0, the theorem follows. Note that common support is needed to obtain the last equalities. Theorem 1.1 says that asymptotically the likelihood function is maximized at the true value θ0 . So in considering estimates of θ0 , it seems natural to consider the value of θ which maximizes the likelihood.

328

Maximum Likelihood Methods Deﬁnition 1.1 (Maximum Likelihood Estimator). We say that θ = θ(X) is a maximum likelihood estimator (mle) of θ if θ = Argmax L(θ; X).

(1.4)

The notation Argmax means that L(θ; X) achieves its maximum value at θ. To determine the mle, we often take the log of the likelihood and determine its critical value; that is, letting l(θ) = log L(θ), the mle solves the equation ∂l(θ) = 0. ∂θ

(1.5)

This is an example of an estimating equation, which we often label as an EE. Example 1.1 (Laplace Distribution). Let X1 , . . . , Xn be iid with density f (x; θ) =

1 −|x−θ| e , −∞ < x < ∞, −∞ < θ < ∞. 2

(1.6)

This pdf is referred to as either the Laplace or the double exponential distribution. The log of the likelihood simpliﬁes to l(θ) = −n log 2 −

n

|xi − θ|.

i=1

The ﬁrst partial derivative is l (θ) =

n

sgn(xi − θ),

(1.7)

i=1

where sgn(t) = 1, 0, or − 1 depending on whether t > 0, t = 0, or t < 0. Note that d |t| = sgn(t), which is true unless t = 0. Setting equation (1.7) to 0, we have used dt the solution for θ is med{x1 , x2 , . . . , xn }, because the median makes half the terms of the sum in expression (1.7) nonpositive and half nonnegative. Recall that we denote the median of a sample by Q2 (the second quartile of the sample). Hence, θ = Q2 is the mle of θ for the Laplace pdf (1.6). There is no guarantee that the mle exists or, if it does, whether it is unique. This is often clear from the application as in the next two examples. Other examples are given in the exercises. Example 1.2 (Logistic Distribution). Let X1 , . . . , Xn be iid with density f (x; θ) =

exp{−(x − θ)} , −∞ < x < ∞, −∞ < θ < ∞. (1 + exp{−(x − θ)})2

(1.8)

The log of the likelihood simpliﬁes to l(θ) =

n i=1

log f (xi ; θ) = nθ − nx − 2

n

log(1 + exp{−(xi − θ)}).

i=1

329

Maximum Likelihood Methods Using this, the ﬁrst partial derivative is l (θ) = n − 2

n i=1

exp{−(xi − θ)} . 1 + exp{−(xi − θ)}

(1.9)

Setting this equation to 0 and rearranging terms results in the equation n i=1

n exp{−(xi − θ)} = . 1 + exp{−(xi − θ)} 2

(1.10)

Although this does not simplify, we can show that equation (1.10) has a unique solution. The derivative of the left side of equation (1.10) simpliﬁes to (∂/∂θ)

n i=1

exp{−(xi − θ)} exp{−(xi − θ)} = > 0. 1 + exp{−(xi − θ)} (1 + exp{−(xi − θ)})2 i=1 n

Thus the left side of equation (1.10) is a strictly increasing function of θ. Finally, the left side of (1.10) approaches 0 as θ → −∞ and approaches n as θ → ∞. Thus equation (1.10) has a unique solution. Also, the second derivative of l(θ) is strictly negative for all θ; so the solution is a maximum. Having shown that the mle exists and is unique, we can use a numerical method to obtain the solution. In this case, Newton’s procedure is useful. We discuss this in general in the next section, at which time we reconsider this example. Example 1.3. We can discuss the mle of the probability of success θ for a random sample X1 , X2 , . . . , Xn from the Bernoulli distribution with pmf x θ (1 − θ)1−x x = 0, 1 p(x) = 0 elsewhere, where 0 ≤ θ ≤ 1. Recall that the mle is X, the proportion of sample successes. Now suppose that we know in advance that, instead of 0 ≤ θ ≤ 1, θ is restricted by the inequalities 0 ≤ θ ≤ 1/3. If the observations were such that x > 1/3, then x would not be a satisfactory estimate. Since ∂l(θ) x, under the ∂θ > 0, provided θ <

restriction 0 ≤ θ ≤ 1/3, we can maximize l(θ) by taking θ = min x, 13 . The following is an appealing property of maximum likelihood estimates. Theorem 1.2. Let X1 , . . . , Xn be iid with the pdf f (x; θ), θ ∈ Ω. For a speciﬁed function g, let η = g(θ) be a parameter of interest. Suppose θ is the mle of θ. Then is the mle of η = g(θ). g(θ) Proof: First suppose g is a one-to-one function. The likelihood of interest is L(g(θ)), but because g is one-to-one, max L(g(θ)) = max L(η) = max L(g −1 (η)). η=g(θ)

η

i.e., take η = g(θ). But the maximum occurs when g −1 (η) = θ;

330

Maximum Likelihood Methods Suppose g is not one-to-one. For each η in the range of g, deﬁne the set (preimage) g −1 (η) = {θ : g(θ) = η}. Hence, θ is The maximum occurs at θ and the domain of g is Ω, which covers θ. in one of these preimages and, in fact, it can only be in one preimage. Hence to Then η ) is that unique preimage containing θ. maximize L(η), choose η so that g −1 ( η = g(θ). Consider another example, where X1 , . . . , Xn are iid Bernoulli random variables with probability of success p. As shown in this example, p = X is the mle of p. In the large sample conﬁdence interval for p,an estimate of p(1 − p) is required. By Theorem 1.2, the mle of this quantity is p(1 − p). We close this section by showing that maximum likelihood estimators, under regularity conditions, are consistent estimators. Recall that X = (X1 , . . . , Xn ). Theorem 1.3. Assume that X1 , . . . , Xn satisfy the regularity conditions (R0) through (R2), where θ0 is the true parameter, and further that f (x; θ) is diﬀerentiable with respect to θ in Ω. Then the likelihood equation, ∂ L(θ) = 0, ∂θ or equivalently ∂ l(θ) = 0, ∂θ P has a solution θn such that θn → θ0 .

Proof: Because θ0 is an interior point in Ω, (θ0 − a, θ0 + a) ⊂ Ω, for some a > 0. Deﬁne Sn to be the event Sn = {X : l(θ0 ; X) > l(θ0 − a; X)} ∩ {X : l(θ0 ; X) > l(θ0 + a; X)} . By Theorem 1.1, P (Sn ) → 1. So we can restrict attention to the event Sn . But on Sn , l(θ) has a local maximum, say, θn , such that θ0 −a < θn < θ0 +a and l (θn ) = 0. That is,

Sn ⊂ X : |θn (X) − θ0 | < a ∩ X : l (θn (X)) = 0 . Therefore, 1 = lim P (Sn ) ≤ lim P n→∞

n→∞

X : |θn (X) − θ0 | < a ∩ X : l (θn (X)) = 0 ≤ 1.

It follows that for the sequence of solutions θn , P [|θn − θ0 | < a] → 1. The only contentious point in the proof is that the sequence of solutions might depend on a. But we can always choose a solution “closest” to θ0 in the following

331

Maximum Likelihood Methods way. For each n, the set of all solutions in the interval is bounded; hence, the inﬁmum over solutions closest to θ0 exists. Note that this theorem is vague in that it discusses solutions of the equation. If, however, we know that the mle is the unique solution of the equation l (θ) = 0, then it is consistent. We state this as a corollary: Corollary 1.1. Assume that X1 , . . . , Xn satisfy the regularity conditions (R0) through (R2), where θ0 is the true parameter, and that f (x; θ) is diﬀerentiable with respect to θ in Ω. Suppose the likelihood equation has the unique solution θn . Then θn is a consistent estimator of θ0 .

EXERCISES 1.1. Let X1 , X2 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, 0 < θ < ∞. Determine the mle of θ. 1.2. Let X1 , X2 , . . . , Xn represent a random sample from each of the distributions having the following pdfs: (a) f (x; θ) = θxθ−1 , 0 < x < 1, 0 < θ < ∞, zero elsewhere. (b) f (x; θ) = e−(x−θ) , θ ≤ x < ∞, −∞ < θ < ∞, zero elsewhere. Note this is a nonregular case. In each case ﬁnd the mle θˆ of θ. 1.3. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from a distribution with pdf f (x; θ) = 1, θ − 12 ≤ x ≤ θ + 12 , −∞ < θ < ∞, zero elsewhere. Note this is a nonregular case. Show that every statistic u(X1 , X2 , . . . , Xn ) such that Yn − 12 ≤ u(X1 , X2 , . . . , Xn ) ≤ Y1 + 12 is a mle of θ. In particular, (4Y1 + 2Yn + 1)/6, (Y1 + Yn )/2, and (2Y1 + 4Yn − 1)/6 are three such statistics. Thus, uniqueness is not, in general, a property of a mle. 1.4. Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = 2x/θ2 , elsewhere. Note this is a nonregular case. Find:

0 < x ≤ θ, zero

(a) The mle θˆ for θ. ˆ = θ. (b) The constant c so that E(cθ) (c) The mle for the median of the distribution. 1.5. Suppose X1 , X2 , . . . , Xn are iid with pdf f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, zero elsewhere. Find the mle of P (X ≤ 2). 1.6. Let the table

332

Maximum Likelihood Methods x Frequency

0 6

1 10

2 14

3 13

4 6

5 1

represent a summary of a sample of size 50 from a binomial distribution having n = 5. Find the mle of P (X ≥ 3). 1.7. Let X1 , X2 , X3 , X4 , X5 be a random sample from a Cauchy distribution with median θ, that is, with pdf f (x; θ) =

1 1 , −∞ < x < ∞, π 1 + (x − θ)2

where −∞ < θ < ∞. If x1 = −1.94, x2 = 0.59, x3 = −5.98, x4 = −0.08, and x5 = −0.77, ﬁnd by numerical methods the mle of θ. 1.8. Let the table x Frequency

0 7

1 14

2 12

3 13

4 6

5 3

represent a summary of a random sample of size 55 from a Poisson distribution. Find the maximum likelihood estimate of P (X = 2). 1.9. Let X1 , X2 , . . . , Xn be a random sample from a Bernoulli distribution with parameter p. If p is restricted so that we know that 12 ≤ p ≤ 1, ﬁnd the mle of this parameter. 1.10. Let X1 , X2 , . . . , Xn be a random sample from a N (θ, σ 2 ) distribution, where σ 2 is ﬁxed but −∞ < θ < ∞. (a) Show that the mle of θ is X. (b) If θ is restricted by 0 ≤ θ < ∞, show that the mle of θ is θ = max{0, X}. 1.11. Let X1 , X2 , . . . , Xn be a random sample from the Poisson distribution with 0 < θ ≤ 2. Show that the mle of θ is θ = min{X, 2}. 1.12. Let X1 , X2 , . . . , Xn be a random sample from a distribution with one of two 2 pdfs. If θ = 1, then f (x; θ = 1) = √12π e−x /2 , −∞ < x < ∞. If θ = 2, then f (x; θ = 2) = 1/[π(1 + x2 )], −∞ < x < ∞. Find the mle of θ.

2

Rao–Cram´ er Lower Bound and Eﬃciency

In this section, we establish a remarkable inequality called the Rao–Cram´ er lower bound, which gives a lower bound on the variance of any unbiased estimate. We then show that, under regularity conditions, the variances of the maximum likelihood estimates achieve this lower bound asymptotically. As in the last section, let X be a random variable with pdf f (x; θ), θ ∈ Ω, where the parameter space Ω is an open interval. In addition to the regularity conditions (1.1) of Section 1, for the following derivations, we require two more regularity conditions, namely,

333

Maximum Likelihood Methods Assumptions 2.1 (Additional Regularity Conditions). Regularity conditions (R3) and (R4) are given by (R3) The pdf f (x; θ) is twice diﬀerentiable as a function of θ. (R4) The integral f (x; θ) dx can be diﬀerentiated twice under the integral sign as a function of θ. Note that conditions (R1)–(R4) mean that the parameter θ does not appear in the endpoints of the interval in which f (x; θ) > 0 and that we can interchange integration and diﬀerentiation with respect to θ. Our derivation is for the continuous case, but the discrete case can be handled in a similar manner. We begin with the identity ∞

f (x; θ) dx.

1= −∞

Taking the derivative with respect to θ results in ∞ ∂f (x; θ) dx. 0= ∂θ −∞ The latter expression can be rewritten as ∞ ∂f (x; θ)/∂θ 0= f (x; θ) dx, f (x; θ) −∞ or, equivalently,

∞

0= −∞

∂ log f (x; θ) f (x; θ) dx. ∂θ

Writing this last equation as an expectation, we have established ∂ log f (X; θ) = 0; E ∂θ

(2.1)

(2.2)

f (X;θ) that is, the mean of the random variable ∂ log ∂θ is 0. If we diﬀerentiate (2.1) again, it follows that ∞ ∞ 2 ∂ log f (x; θ) ∂ log f (x; θ) ∂ log f (x; θ) f (x; θ) dx. (2.3) f (x; θ) dx + 0= ∂θ2 ∂θ ∂θ −∞ −∞

The second term of the right side of this equation can be written as an expectation, which we call Fisher information and we denote it by I(θ); that is, 2 ∞ ∂ log f (X; θ) ∂ log f (x; θ) ∂ log f (x; θ) f (x; θ) dx = E . (2.4) I(θ) = ∂θ ∂θ ∂θ −∞ From equation (2.3), we see that I(θ) can be computed from 2 ∞ 2 ∂ log f (X; θ) ∂ log f (x; θ) . f (x; θ) dx = −E I(θ) = − ∂θ2 ∂θ2 −∞

334

(2.5)

Maximum Likelihood Methods Using equation (2.2), Fisher information is the variance of the random variable ∂ log f (X;θ) ; i.e., ∂θ ∂ log f (X; θ) . (2.6) I(θ) = Var ∂θ Usually, expression (2.5) is easier to compute than expression (2.4). Remark 2.1. Note that the information is the weighted mean of either 2 ∂ 2 log f (x; θ) ∂ log f (x; θ) or − , ∂θ ∂θ2 where the weights are given by the pdf f (x; θ). That is, the greater these derivatives are on the average, the more information that we get about θ. Clearly, if they were equal to zero [so that θ would not be in log f (x; θ)], there would be zero information about θ. The important function ∂ log f (x; θ) ∂θ is called the score function. Recall that it determines the estimating equations for the mle; that is, the mle θˆ solves n ∂ log f (xi ; θ) i=1

∂θ

=0

for θ. Example 2.1 (Information for a Bernoulli Random Variable). Let X be Bernoulli b(1, θ). Thus log f (x; θ) = ∂ log f (x; θ) = ∂θ ∂ 2 log f (x; θ) = ∂θ2

x log θ + (1 − x) log(1 − θ) x 1−x − θ 1−θ x 1−x − 2− . θ (1 − θ)2

Clearly,

I(θ)

= =

1−X −X − θ2 (1 − θ)2 1 1 θ 1−θ 1 = , + = + θ2 (1 − θ)2 θ (1 − θ) θ(1 − θ)

−E

which is larger for θ values close to zero or one. Example 2.2 (Information for a Location Family). Consider a random sample X1 , . . . , Xn such that (2.7) Xi = θ + ei , i = 1, . . . , n,

335

Maximum Likelihood Methods where e1 , e2 , . . . , en are iid with common pdf f (x) and with support (−∞, ∞). Then the common pdf of Xi is fX (x; θ) = f (x − θ). We call model (2.7) a location model. Assume that f (x) satisﬁes the regularity conditions. Then the information is 2 ∞ f (x − θ) f (x − θ) dx I(θ) = f (x − θ) −∞ ∞ 2 f (z) = f (z) dz, (2.8) f (z) −∞ where the last equality follows from the transformation z = x − θ. Hence, in the location model, the information does not depend on θ. As an illustration, reconsider Example 1.1 concerning the Laplace distribution. Let X1 , X2 , . . . , Xn be a random sample from this distribution. Then it follows that Xi can be expressed as X i = θ + ei , (2.9) where e1 , . . . , en are iid with common pdf f (z) = 2−1 exp{−|z|}, for −∞ < z < ∞. d |z| = sgn(z). Then f (z) = −2−1 sgn(z) exp{−|z|} As we did in Example 1.1, use dz 2 and, hence, [f (z)/f (z)] = [−sgn(z)]2 = 1, so that

∞

I(θ) = −∞

f (z) f (z)

2

∞

f (z) dz =

f (z) dz = 1.

(2.10)

−∞

Note that the Laplace pdf does not satisfy the regularity conditions, but this argument can be made rigorous; see Huber (1981). From (2.6), for a sample of size 1, say X1 , Fisher information is the variance of the random variable ∂ log f∂θ(X1 ;θ) . What about a sample of size n? Let X1 , X2 , . . . , Xn be a random sample from a distribution having pdf f (x; θ). The likelihood L(θ) is the pdf of the random sample, and the random variable whose variance is the information in the sample is given by ∂ log L(θ, X) ∂ log f (Xi ; θ) = . ∂θ ∂θ i=1 n

The summands are iid with common variance I(θ). Hence the information in the sample is ∂ log L(θ, X) = nI(θ). (2.11) Var ∂θ Thus the information in a random sample of size n is n times the information in a sample of size 1. So, in Example 2.1, the Fisher information in a random sample of size n from a Bernoulli b(1, θ) distribution is n/[θ(1 − θ)]. We are now ready to obtain the Rao–Cram´er lower bound, which we state as a theorem.

336

Maximum Likelihood Methods Theorem 2.1 (Rao–Cram´er Lower Bound). Let X1 , . . . , Xn be iid with common pdf f (x; θ) for θ ∈ Ω. Assume that the regularity conditions (R0)-(R4) hold. Let Y = u(X1 , X2 , . . . , Xn ) be a statistic with mean E(Y ) = E[u(X1 , X2 , . . . , Xn )] = k(θ). Then [k (θ)]2 . (2.12) Var(Y ) ≥ nI(θ) Proof: The proof is for the continuous case, but the proof for the discrete case is quite similar. Write the mean of Y as ∞ ∞ k(θ) = ··· u(x1 , . . . , xn )f (x1 ; θ) · · · f (xn ; θ) dx1 · · · dxn . −∞

−∞

Diﬀerentiating with respect to θ, we obtain

k (θ)

∞

= −∞

=

···

∞ −∞

u(x1 , x2 , . . . , xn )

n 1

1 ∂f (xi ; θ) f (xi ; θ) ∂θ

× f (x1 ; θ) · · · f (xn ; θ) dx1 · · · dxn n ∞ ∞ ∂ log f (xi ; θ) ··· u(x1 , x2 , . . . , xn ) ∂θ −∞ −∞ 1 × f (x1 ; θ) · · · f (xn ; θ) dx1 · · · dxn .

(2.13)

n Deﬁne the random variable Z by Z = 1 [∂ log f (Xi ; θ)/∂θ]. We know from (2.2) and (2.11) that E(Z) = 0 and Var(Z) = nI(θ), respectively. Also, equation (2.13) can be expressed in terms of expectation as k (θ) = E(Y Z). Hence we have k (θ) = E(Y Z) = E(Y )E(Z) + ρσY nI(θ), where ρ is the correlation coeﬃcient between Y and Z. Using E(Z) = 0, this simpliﬁes to k (θ) . ρ= σY nI(θ) Because ρ2 ≤ 1, we have

[k (θ)]2 ≤ 1, σY2 nI(θ)

which, upon rearrangement, is the desired result. Corollary 2.1. Under the assumptions of Theorem 2.1, if Y = u(X1 , . . . , Xn ) is an unbiased estimator of θ, so that k(θ) = θ, then the Rao–Cram´er inequality becomes Var(Y ) ≥

1 . nI(θ)

337

Maximum Likelihood Methods Consider the Bernoulli model with probability of success θ which was treated in Example 2.1. In the example we showed that 1/nI(θ) = θ(1 − θ)/n. You should know, the mle of θ is X. The mean and variance of a Bernoulli (θ) distribution are θ and θ(1 − θ), respectively. Hence the mean and variance of X are θ and θ(1 − θ)/n, respectively. That is, in this case the variance of the mle has attained the Rao–Cram´er lower bound. We now make the following deﬁnitions. Deﬁnition 2.1 (Eﬃcient Estimator). Let Y be an unbiased estimator of a parameter θ in the case of point estimation. The statistic Y is called an eﬃcient estimator of θ if and only if the variance of Y attains the Rao–Cram´er lower bound. Deﬁnition 2.2 (Eﬃciency). In cases in which we can diﬀerentiate with respect to a parameter under an integral or summation symbol, the ratio of the Rao–Cram´er lower bound to the actual variance of any unbiased estimator of a parameter is called the eﬃciency of that estimator. Example 2.3 (Poisson(θ) Distribution). Let X1 , X2 , . . . , Xn denote a random sample from a Poisson distribution that has the mean θ > 0. It is known that X is an mle of θ; we shall show that it is also an eﬃcient estimator of θ. We have ∂ log f (x; θ) ∂θ

∂ (x log θ − θ − log x!) ∂θ x x−θ −1= . θ θ

= =

Accordingly, E

∂ log f (X; θ) ∂θ

2 =

E(X − θ)2 σ2 θ 1 = 2 = 2 = . 2 θ θ θ θ

The Rao–Cram´er lower bound in this case is 1/[n(1/θ)] = θ/n. But θ/n is the variance of X. Hence X is an eﬃcient estimator of θ. Example 2.4 (Beta(θ, 1) Distribution). Let X1 , X2 , . . . , Xn denote a random sample of size n > 2 from a distribution with pdf θxθ−1 for 0 < x < 1 f (x; θ) = (2.14) 0 elsewhere, where the parameter space is Ω = (0, ∞). This is the beta distribution, with parameters θ and 1, which we denote by beta(θ, 1). The derivative of the log of f is 1 ∂ log f = log x + . (2.15) ∂θ θ From this we have ∂ 2 log f /∂θ2 = −θ−2 . Hence the information is I(θ) = θ−2 .

338

Maximum Likelihood Methods Next, we ﬁnd the mle of θ and investigate its eﬃciency. The log of the likelihood function is n n log xi − log xi + n log θ. l(θ) = θ i=1

i=1

The ﬁrst partial of l(θ) is n ∂l(θ) = log xi + . ∂θ θ i=1 n

(2.16)

n Setting this to 0 and solving for θ, the mle is θ = −n/ i=1 log Xi . To obtain the let Yi = − log Xi . A straight transformation argument shows that distribution of θ, the distribution is Γ(1, 1/θ). Because the Xi s are independent, it should show that n W = i=1 Yi is Γ(n, 1/θ). E[W k ] =

(n + k − 1)! , θk (n − 1)!

(2.17)

for k > −n. So, in particular for k = −1, we get = nE[W −1 ] = θ E[θ]

n . n−1

Hence, θ is biased, but the bias vanishes as n → ∞. Also, note that the estimator [(n − 1)/n]θ is unbiased. For k = −2, we get E[θ2 ] = n2 E[W −2 ] = θ2

n2 , (n − 1)(n − 2)

2 , we obtain and, hence, after simplifying E(θ2 ) − [E(θ)] = θ2 Var(θ)

n2 . (n − 1)2 (n − 2)

i.e., From this, we can obtain the variance of the unbiased estimator [(n − 1)/n]θ, n−1 θ2 Var . θ = n n−2 From above, the information is I(θ) = θ−2 and, hence, the variance of an unbiased 2 θ2 > θn , the unbiased estimator [(n − 1)/n]θ eﬃcient estimator is θ2 /n. Because n−2 is not eﬃcient. Notice, though, that its eﬃciency (as in Deﬁnition 2.2) converges to 1 as n → ∞. Later in this section, we say that [(n − 1)/n]θ is asymptotically eﬃcient. In the above examples, we were able to obtain the mles in closed form along with their distributions and, hence, moments. This is often not the case. Maximum likelihood estimators, however, have an asymptotic normal distribution. In fact, mles are asymptotically eﬃcient. To prove these assertions, we need the additional regularity condition given by

339

Maximum Likelihood Methods Assumptions 2.2 (Additional Regularity Condition). Regularity condition (R5) is (R5) The pdf f (x; θ) is three times diﬀerentiable as a function of θ. Further, for all θ ∈ Ω, there exist a constant c and a function M (x) such that 3 ∂ ≤ M (x), log f (x; θ) ∂θ3 with Eθ0 [M (X)] < ∞, for all θ0 − c < θ < θ0 + c and all x in the support of X. Theorem 2.2. Assume X1 , . . . , Xn are iid with pdf f (x; θ0 ) for θ0 ∈ Ω such that the regularity conditions (R0)–(R5) are satisﬁed. Suppose further that the Fisher information satisﬁes 0 < I(θ0 ) < ∞. Then any consistent sequence of solutions of the mle equations satisﬁes √ 1 D . (2.18) n(θ − θ0 ) → N 0, I(θ0 ) Proof: Expanding the function l (θ) into a Taylor series of order 2 about θ0 and evaluating it at θn , we get 1 l (θn ) = l (θ0 ) + (θn − θ0 )l (θ0 ) + (θn − θ0 )2 l (θn∗ ), 2

(2.19)

where θn∗ is between θ0 and θn . But l (θn ) = 0. Hence, rearranging terms, we obtain √ n(θn − θ0 ) =

n−1/2 l (θ0 ) . −n−1 l (θ0 ) − (2n)−1 (θn − θ0 )l (θ∗ )

(2.20)

n

By the Central Limit Theorem, 1 ∂ log f (Xi ; θ0 ) D 1 √ l (θ0 ) = √ → N (0, I(θ0 )), ∂θ n n i=1 n

(2.21)

because the summands are iid with Var(∂ log f (Xi ; θ0 )/∂θ) = I(θ0 ) < ∞. Also, by the Law of Large Numbers, 1 ∂ 2 log f (Xi ; θ0 ) P 1 → I(θ0 ). − l (θ0 ) = − n n i=1 ∂θ2 n

(2.22)

To complete the proof then, we need only show that the second term in the P denominator of expression (2.20) goes to zero in probability. Because θn − θ0 → 0, −1 ∗ this follows provided that n l (θn ) is bounded in probability. Let c0 be the constant deﬁned in condition (R5). Note that |θn − θ0 | < c0 implies that |θn∗ − θ0 | < c0 , which in turn by condition (R5) implies the following string of inequalities: n 3 n ∂ log f (Xi ; θ) 1 ∗ ≤ 1 − l (θn ) ≤ 1 M (Xi ). (2.23) n n n ∂θ3 i=1 i=1

340

Maximum Likelihood Methods n P By condition (R5), Eθ0 [M (X)] < ∞; hence, n1 i=1 M (Xi ) → Eθ0 [M (X)], by the Law of Large Numbers. For the bound, we select 1 + Eθ0 [M (X)]. Let > 0 be given. Choose N1 and N2 so that n ≥ N1

⇒

n ≥ N2

⇒

P [|θn − θ0 | < c0 ] ≥ 1 − 2 n 1 P M (Xi ) − Eθ0 [M (X)] < 1 ≥ 1 − . n 2 i=1

(2.24) (2.25)

It follows from (2.23)–(2.25) that 1 ∗ n ≥ max{N1 , N2 } ⇒ P − l (θn ) ≤ 1 + Eθ0 [M (X)] ≥ 1 − ; n 2 hence, n−1 l (θn∗ ) is bounded in probability. We next generalize Deﬁnitions 2.1 and 2.2 concerning eﬃciency to the asymptotic case. Deﬁnition 2.3. Let X1 , . . . , Xn be independent and identically distributed with ˆ ˆ probability density function f (x; θ).Suppose θ1n = θ1n (X1 , . . . , Xn ) is an estimator √ ˆ D of θ0 such that n(θ1n − θ0 ) → N 0, σθ2ˆ . Then 1n

(a) The asymptotic eﬃciency of θˆ1n is deﬁned to be e(θˆ1n ) =

1/I(θ0 ) . σθ2ˆ

(2.26)

1n

(b) The estimator θˆ1n is said to be asymptotically eﬃcient if the ratio in part (a) is 1. √ D . Then the (c) Let θˆ2n be another estimator such that n(θˆ2n − θ0 ) → N 0, σθ2ˆ 2n ˆ ˆ asymptotic relative eﬃciency (ARE) of θ1n to θ2n is the reciprocal of the ratio of their respective asymptotic variances; i.e., e(θˆ1n , θˆ2n ) =

σθ2ˆ

2n

σθ2ˆ

.

(2.27)

1n

Hence, by Theorem 2.2, under regularity conditions, maximum likelihood estimators are asymptotically eﬃcient estimators. This is a nice optimality result. Also, if two estimators are asymptotically normal with the same asymptotic mean, then intuitively the estimator with the smaller asymptotic variance would be selected over the other as a better estimator. In this case, the ARE of the selected estimator to the nonselected one is greater than 1.

341

Maximum Likelihood Methods Example 2.5 (ARE of the Sample Median to the Sample Mean). We obtain this ARE under the Laplace and normal distributions. Consider ﬁrst the Laplace location model as given in expression (2.9); i.e., Xi = θ + ei ,

i = 1, . . . , n.

(2.28)

By Example 1.1, we know that the mle of θ is the sample median, Q2 . By (2.10), the information I(θ0 ) = 1 for this distribution; hence, Q2 is asymptotically normal with mean θ and variance 1/n. On the other hand, by the Central Limit Theorem, the sample mean X is asymptotically normal with mean θ and variance σ 2 /n, where σ 2 = Var(Xi ) = Var(ei + θ) = Var(ei ) = E(e2i ). But ∞ ∞ 2 2 −1 E(ei ) = z 2 exp{−|z|} dz = z 3−1 exp{−z} dz = Γ(3) = 2. −∞

0

Therefore, the ARE(Q2 , X) = 21 = 2. Thus, if the sample comes from a Laplace distribution, then asymptotically the sample median is twice as eﬃcient as the sample mean. Next suppose the location model (2.28) holds, except now the pdf of ei is N (0, 1). Under this model Q2 is asymptotically normal with mean θ and variance (π/2)/n. 1 = 2/π = Because the variance of X is 1/n, in this case, the ARE(Q2 , X) = π/2 0.636. Since π/2 = 1.57, asymptotically, X is 1.57 times more eﬃcient than Q2 if the sample arises from the normal distribution. Theorem 2.2 is also a practical result for it gives us a way of doing inference. The asymptotic standard deviation of the mle θ is [nI(θ0 )]−1/2 . Because I(θ) is a continuous function of θ, it follows that I(θn ) → I(θ0 ). P

Thus we have a consistent estimate of the asymptotic standard deviation of the mle. Based on this result and discussion of conﬁdence intervals, for a speciﬁed 0 < α < 1, the following interval is an approximate (1 − α)100% conﬁdence interval for θ, ⎞ ⎛ 1 1 ⎠. ⎝θn − zα/2 , θn + zα/2 (2.29) nI(θn ) nI(θn ) Remark 2.2. If we use the asymptotic distributions to construct conﬁdence intervals for θ, the fact that the ARE(Q2 , X) = 2 when the underlying distribution is the Laplace means that n would need to be twice as large for X to get the same length conﬁdence interval as we would if we used Q2 . A simple corollary to Theorem 2.2 yields the asymptotic distribution of a function g(θn ) of the mle.

342

Maximum Likelihood Methods Corollary 2.2. Under the assumptions of Theorem 2.2, suppose g(x) is a continuous function of x which is diﬀerentiable at θ0 such that g (θ0 ) = 0. Then √ D n(g(θn ) − g(θ0 )) → N

g (θ0 )2 . 0, I(θ0 )

(2.30)

The proof of this corollary follows immediately from the Δ-method, Theorem 2.2. The proof of Theorem 2.2 contains an asymptotic representation of θ which proves useful; hence, we state it as another corollary. Corollary 2.3. Under the assumptions of Theorem 2.2, √

n(θn − θ0 ) =

1 1 ∂ log f (Xi ; θ0 ) √ + Rn , I(θ0 ) n i=1 ∂θ n

(2.31)

P

where Rn → 0. The proof is just a rearrangement of equation (2.20) and the ensuing results in the proof of Theorem 2.2. Example 2.6 (Example 2.4, Continued). Let X1 , . . . , Xn be a random sample having the common pdf (2.14). Recall that I(θ) = θ−2 and that the mle is θ = n −n/ i=1 log Xi . Hence, θ is approximately normally distributed with mean θ and variance θ2 /n. Based on this, an approximate (1 − α)100% conﬁdence interval for θ is θ θ ± zα/2 √ . n Recall that we were able to obtain the exact distribution of θ in this case. As an exact conﬁdence interval for Exercise 2.12 shows, based on this distribution of θ, θ can be constructed. In obtaining the mle of θ, we are often in the situation of Example 1.2; that is, we = 0 cannot can verify the existence of the mle, but the solution of the equation l (θ) be obtained in closed form. In such situations, numerical methods are used. One iterative method that exhibits rapid (quadratic) convergence is Newton’s method. The sketch in Figure 2.1 helps recall this method. Suppose θ(0) is an initial guess at the solution. The next guess (one-step estimate) is the point θ(1) , which is the horizontal intercept of the tangent line to the curve l (θ) at the point (θ(0) , l (θ(0) )). A little algebra ﬁnds l (θ(0) ) . (2.32) θ(1) = θ(0) − l (θ(0) ) We then substitute θ(1) for θ(0) and repeat the process. On the ﬁgure, trace the second step estimate θ(2) ; the process is continued until convergence.

343

Maximum Likelihood Methods y

dl(

(0))

(1)

(0)

dl(

(1))

Figure 2.1: Beginning with the starting value θ(0) , the one-step estimate is θ(1) , which is the intersection of the tangent line to the curve l (θ) at θ(0) and the horizontal axis. In the ﬁgure, dl(θ) = l (θ). Example 2.7 (Example 1.2, continued). Recall Example 1.2, where the random sample X1 , . . . , Xn has the common logisitic density f (x; θ) =

exp{−(x − θ)} , (1 + exp{−(x − θ)})2

−∞ < x < ∞, −∞ < θ < ∞.

(2.33)

We showed that the likelihood equation has a unique solution, though it cannot be be obtained in closed form. To use formula (2.32), we need the ﬁrst and second partial derivatives of l(θ) and an initial guess. Expression (1.9) of Example 1.2 gives the ﬁrst partial derivative, from which the second partial is l (θ) = −2

n i=1

exp{−(xi − θ)} . (1 + exp{−(xi − θ)})2

The logistic distribution is similar to the normal distribution; hence, we can use X as our initial guess of θ. The subroutine mlelogistic in Appendix B is an R routine which obtains the k-step estimates. We close this section with a remarkable fact. The estimate θ(1) in equation (2.32) is called the one-step estimator. As Exercise 2.13 shows, this estimator has the same asymptotic distribution as the mle, [i.e., (2.18)], provided that the

344

Maximum Likelihood Methods initial guess θ(0) is a consistent estimator of θ. That is, the one-step estimate is an asymptotically eﬃcient estimate of θ. This is also true of the other iterative steps. EXERCISES 2.1. Prove that X, the mean of a random sample of size n from a distribution that is N (θ, σ 2 ), −∞ < θ < ∞, is, for every known σ 2 > 0, an eﬃcient estimator of θ. 2.2. Given f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, with θ > 0, formally compute the reciprocal of 2 ∂ log f (X : θ) nE . ∂θ Compare this with the variance of (n + 1)Yn /n, where Yn is the largest observation of a random sample of size n from this distribution. Comment. 2.3. Given the pdf f (x; θ) =

1 , π[1 + (x − θ)2 ]

−∞ < x < ∞, −∞ < θ < ∞,

show that the Rao–Cram´er lower bound is 2/n, where n is the size of a random sam√ ple from this Cauchy distribution. What is the asymptotic distribution of n(θ−θ) if θ is the mle of θ? 2.4. Consider Example 2.2, where we discussed the location model. (a) Write the location model when ei has the logistic pdf given in expression (4.9). (b) Using expression (2.8), show that the information I(θ) = 1/3 for the model in part (a). Hint: In the integral of expression (2.8), use the substitution u = (1 + e−z )−1 . Then du = f (z)dz. 2.5. Using the same location model as in part (a) Exercise 2.4, obtain the ARE of the sample median to mle of the model. Hint: The mle of θ for this model is discussed in Example 2.7. Furthermore, Q2 is asymptotically normal with asymptotic mean θ and asymptotic variance 1/(4f 2 (0)n). 2.6. Consider a location model (Example 2.2) when the error pdf is the contaminated normal with as the proportion of contamination and with σc2 as the variance of the contaminated part. Show that the ARE of the sample median to the sample mean is given by e(Q2 , X) =

2[1 + (σc2 − 1)][1 − + (/σc )]2 . π

(2.34)

Use the hint in Exercise 2.5 for the median.

345

Maximum Likelihood Methods (a) If σc2 = 9, use (2.34) to ﬁll in the following table: e(Q2 , X)

0

0.05

0.10

0.15

(b) Notice from the table that the sample median becomes the “better” estimator when increases from 0.10 to 0.15. Determine the value for where this occurs [this involves a third-degree polynomial in , so one way of obtaining the root is to use the Newton algorithm discussed around expression (2.32)]. 2.7. Let X have a gamma distribution with α = 4 and β = θ > 0. (a) Find the Fisher information I(θ). (b) If X1 , X2 , . . . , Xn is a random sample from this distribution, show that the mle of θ is an eﬃcient estimator of θ. √ (c) What is the asymptotic distribution of n(θ − θ)? 2.8. Let X be N (0, θ), 0 < θ < ∞. (a) Find the Fisher information I(θ). (b) If X1 , X2 , . . . , Xn is a random sample from this distribution, show that the mle of θ is an eﬃcient estimator of θ. √ (c) What is the asymptotic distribution of n(θ − θ)? 2.9. If X1 , X2 , . . . , Xn is a random sample from a distribution with pdf 3θ 3 0 < x < ∞, 0 < θ < ∞ (x+θ)4 f (x; θ) = 0 elsewhere, show that Y = 2X is an unbiased estimator of θ and determine its eﬃciency. We want 2.10. Let X1 , X2 , . . . , Xn be a random √ sample from a N (0, θ) distribution. n θ. Find the constant c so that Y = c to estimate the standard deviation i=1 |Xi | √ is an unbiased estimator of θ and determine its eﬃciency. 2.11. Let X be the mean of a random sample of size n from a N (θ, σ 2 ) distribution, 2 2 −∞ < θ < ∞, σ 2 > 0. Assume that σ 2 is known. Show that X − σn is an unbiased estimator of θ2 and ﬁnd its eﬃciency. n 2.12. Recall that n θ = −n/ i=1 log Xi is the mle of θ for a beta(θ, 1) distribution. Also, W = − i=1 log Xi has the gamma distribution Γ(n, 1/θ). (a) Show that 2θW has a χ2 (2n) distribution. (b) Using part (a), ﬁnd c1 and c2 so that 2θn P c1 < < c2 = 1 − α, θ for 0 < α < 1. Next, obtain a (1 − α)100% conﬁdence interval for θ.

346

Maximum Likelihood Methods (c) For α = 0.05 and n = 10, compare the length of this interval with the length of the interval found in Example 2.6. 2.13. By using expressions (2.21) and (2.22), obtain the result for the one-step estimate discussed at the end of this section. 2.14. Let S 2 be the sample variance of a random sample of size n > 1 from N (μ, θ), 0 < θ < ∞, where μ is known. We know E(S 2 ) = θ. (a) What is the eﬃciency of S 2 ? (b) Under these conditions, what is the mle θ of θ? √ (c) What is the asymptotic distribution of n(θ − θ)?

3

Maximum Likelihood Tests

The last section presented an inference for pointwise estimation and conﬁdence intervals based on likelihood theory. In this section, we present a corresponding inference for testing hypotheses. As in the last section, let X1 , . . . , Xn be iid with pdf f (x; θ) for θ ∈ Ω. In this section, θ is a scalar, but in Sections 4 and 5 extensions to the vector-valued case are discussed. Consider the two-sided hypotheses H0 : θ = θ0 versus H1 : θ = θ0 ,

(3.1)

where θ0 is a speciﬁed value. Recall that the likelihood function and its log are given by L(θ) l(θ)

= =

n i=1 n

f (Xi ; θ) log f (Xi ; θ).

i=1

Let θ denote the maximum likelihood estimate of θ. To motivate the test, consider Theorem 1.1, which says that if θ0 is the true value of θ, then, asymptotically, L(θ0 ) is the maximum value of L(θ). Consider the ratio of two likelihood functions, namely, Λ=

L(θ0 ) . L(θ)

(3.2)

Note that Λ ≤ 1, but if H0 is true, Λ should be large (close to 1), while if H1 is true, Λ should be smaller. For a speciﬁed signiﬁcance level α, this leads to the intuitive decision rule (3.3) Reject H0 in favor of H1 if Λ ≤ c, where c is such that α = Pθ0 [Λ ≤ c]. We call it the likelihood ratio test (LRT). Theorem 3.1 derives the asymptotic distribution of Λ under H0 , but ﬁrst we look at two examples.

347

Maximum Likelihood Methods Example 3.1 (Likelihood Ratio Test for the Exponential Distribution). Suppose X1 , . . . , Xn are iid with pdf f (x; θ) = θ−1 exp {−x/θ}, for x, θ > 0. Let the hypotheses be given by (3.1). The likelihood function simpliﬁes to L(θ) = θ−n exp {−(n/θ)X}. As we know, the mle of θ is X. After some simpliﬁcation, the likelihood ratio test statistic simpliﬁes to n X exp {−nX/θ0 }. (3.4) Λ = en θ0 The decision rule is to reject H0 if Λ ≤ c. But further simpliﬁcation of the test is possible. Other than the constant en , the test statistic is of the form g(t) = tn exp {−nt},

t > 0,

where t = x/θ0 . Using diﬀerentiable calculus, it is easy to show that g(t) has a unique critical value at 1, i.e., g (1) = 0, and further that t = 1 provides a maximum, because g (1) < 0. As Figure 3.1 depicts, g(t) ≤ c if and only if t ≤ c1 or t ≥ c2 . This leads to Λ ≤ c, if and only if,

X θ0

≤ c1 or

X θ0

≥ c2 .

n Note that under the null hypothesis, H0 , the statistic (2/θ0 ) i=1 Xi has a χ2 distribution with 2n degrees of freedom. Based on this, the following decision rule results in a level α test: n n Reject H0 if (2/θ0 ) i=1 Xi ≤ χ21−α/2 (2n) or (2/θ0 ) i=1 Xi ≥ χ2α/2 (2n), (3.5) where χ21−α/2 (2n) is the lower α/2 quantile of a χ2 distribution with 2n degrees of freedom and χ2α/2 (2n) is the upper α/2 quantile of a χ2 distribution with 2n degrees of freedom. Other choices of c1 and c2 can be made, but these are usually the choices used in practice. Exercise 3.1 investigates the power curve for this test.

Example 3.2 (Likelihood Ratio Test for the Mean of a Normal pdf). Consider a random sample X1 , X2 , . . . , Xn from a N (θ, σ 2 ) distribution where −∞ < θ < ∞ and σ 2 > 0 is known. Consider the hypotheses H0 : θ = θ0 versus H1 : θ = θ0 , where θ0 is speciﬁed. The likelihood function is n/2 n 1 2 −1 2 exp −(2σ ) (xi − θ) L(θ) = 2πσ 2 i=1 n/2 n 1 2 −1 2 = exp −(2σ ) (xi − x) exp{−(2σ 2 )−1 n(x − θ)2 }. 2πσ 2 i=1

348

Maximum Likelihood Methods g(t)

c t c1

c2

Figure 3.1: Plot for Example 3.1, showing that the function g(t) ≤ c if and only if t ≤ c1 or t ≥ c2 . Of course, in Ω = {θ : −∞ < θ < ∞}, the mle is θ = X and thus Λ=

L(θ0 ) = exp{−(2σ 2 )−1 n(X − θ0 )2 }. L(θ)

Then Λ ≤ c is equivalent to −2 log Λ ≥ −2 log c. However, 2 X − θ0 √ −2 log Λ = , σ/ n which has a χ2 (1) distribution under H0 . Thus, the likelihood ratio test with signiﬁcance level α states that we reject H0 and accept H1 when 2 X − θ0 √ ≥ χ2α (1). (3.6) −2 log Λ = σ/ n In Exercise 3.3, the power function of this decision rule is obtained. Note also that this test is the same as the z-test for a normal mean with s replaced by σ; see Exercise 3.2. Other examples are given in the exercises. In the last two examples the likelihood ratio tests simplify and we are able to get the test in closed form. Often, though, this is impossible. In such cases, similarly to Example 2.7, we can obtain the mle by iterative routines and, hence, also the test statistic Λ. In Example 3.2, −2 log Λ had an exact χ2 (1) null distribution. While not true in general, as the following theorem shows, under regularity conditions, the asymptotic null distribution of −2 log Λ is χ2 with one degree of freedom. Hence in all cases an asymptotic test can be constructed. Theorem 3.1. Assume the same regularity conditions as for Theorem 2.2. Under the null hypothesis, H0 : θ = θ0 , −2 log Λ → χ2 (1). D

(3.7)

349

Maximum Likelihood Methods Proof: Expand the function l(θ) into a Taylor series about θ0 of order 1 and evaluate This results in it at the mle, θ. = l(θ0 ) + (θ − θ0 )l (θ0 ) + 1 (θ − θ0 )2 l (θ∗ ), l(θ) n 2

(3.8)

P P where θn∗ is between θ and θ0 . Because θ → θ0 , it follows that θn∗ → θ0 . This, in addition to the fact that the function l (θ) is continuous, and equation (2.22) of Theorem 2.2 imply that 1 P (3.9) − l (θn∗ ) → I(θ0 ). n By Corollary 2.3, √ 1 √ l (θ0 ) = n(θ − θ0 )I(θ0 ) + Rn , (3.10) n

where Rn → 0, in probability. If we substitute (3.9) and (3.10) into expression (3.8) and do some simpliﬁcation, we have − l(θ0 )) = { nI(θ0 )(θ − θ0 )}2 + R∗ , (3.11) −2 log Λ = 2(l(θ) n where Rn∗ → 0, in probability. By Theorem 2.2, the ﬁrst term on the right side of the above equation converges in distribution to a χ2 -distribution with one degree of freedom. Deﬁne the test statistic χ2L = −2 log Λ. For the hypotheses (3.1), this theorem suggests the decision rule Reject H0 in favor of H1 if χ2L ≥ χ2α (1).

(3.12)

By the last theorem, this test has asymptotic level α. If we cannot obtain the test statistic or its distribution in closed form, we can use this asymptotic test. Besides the likelihood ratio test, in practice two other likelihood-related tests are employed. A natural test statistic is based on the asymptotic distribution of θ. Consider the statistic 2 θ − θ0 ) . nI(θ)( (3.13) χ2 = W

→ I(θ0 ) in probability under the null Because I(θ) is a continuous function, I(θ) hypothesis, (3.1). It follows, under H0 , that χ2W has an asymptotic χ2 -distribution with one degree of freedom. This suggests the decision rule Reject H0 in favor of H1 if χ2W ≥ χ2α (1).

(3.14)

As with the test based on χ2L , this test has asymptotic level α. Actually, the relationship between the two test statistics is strong, because as equation (3.11) shows, under H0 , P χ2W − χ2L → 0. (3.15) The test (3.14) is often referred to as a Wald-type test, after Abraham Wald, who was a prominent statistician of the 20th century.

350

Maximum Likelihood Methods The third test is called a scores-type test, which is often referred to as Rao’s score test, after another prominent statistician, C. R. Rao. The scores are the components of the vector ∂ log f (Xn ; θ) ∂ log f (X1 ; θ) ,..., . (3.16) S(θ) = ∂θ ∂θ In our notation, we have 1 1 ∂ log f (Xi ; θ0 ) √ l (θ0 ) = √ . ∂θ n i=1 n n

Deﬁne the statistic

! χ2R

=

l (θ ) 0 nI(θ0 )

(3.17)

"2 .

(3.18)

Under H0 , it follows from expression (3.10) that χ2R = χ2W + R0n ,

(3.19)

where R0n converges to 0 in probability. Hence the following decision rule deﬁnes an asymptotic level α test under H0 : Reject H0 in favor of H1 if χ2R ≥ χ2α (1).

(3.20)

Example 3.3 (Example 2.6, Continued). As in Example 2.6, let X1 , . . . , Xn be a random sample having the common beta(θ, 1) pdf (2.14). We use this pdf to illustrate the three test statistics discussed above for the hypotheses H0 : θ = 1 versus H1 : θ = 1.

(3.21)

n Under H0 , f (x; θ) is the uniform(0, 1) pdf. Recall that θ = −n/ i=1 log Xi is the mle of θ. After some simpliﬁcation, the value of the likelihood function at the mle is "−n n ! n log Xi exp − log Xi exp {n(log n − 1)}. L(θ) = − i=1

i=1

so that Also, L(1) = 1. Hence the likelihood ratio test statistic is Λ = 1/L(θ), χ2L

= −2 log Λ = 2 −

n i=1

! log Xi − n log −

n

" log Xi

− n + n log n .

i=1

Recall that the information for this pdf is I(θ) = θ−2 . For the Wald-type test, we would estimate this consistently by θ−2 . The Wald-type test simpliﬁes to # 2 2 1 n (θ − 1) = n 1 − . (3.22) χ2W = θ2 θ

351

Maximum Likelihood Methods Finally, for the scores-type course, recall from (2.15) that the l (1) is l (1) =

n

log Xi + n.

i=1

Hence the scores-type test statistic is n

χ2R =

i=1

log Xi + n √ n

2

.

(3.23)

It is easy to show that expressions (3.22) and (3.23) are the same. From Example 2.4, we know the exact distribution of the maximum likelihood estimate. Exercise 3.7 uses this distribution to obtain an exact test. Example 3.4 (Likelihood Tests for the Laplace Location Model). Consider the location model Xi = θ + ei , i = 1, . . . , n, where −∞ < θ < ∞ and the random errors ei s are iid each having the Laplace pdf. Technically, the Laplace distribution does not satisfy all of the regularity conditions (R0)–(R5), but the results below can be derived rigorously; see, for example, Hettmansperger and McKean (2011). Consider testing the hypotheses H0 : θ = θ0 versus H1 : θ = θ0 , where θ0 is speciﬁed. Here Ω = (−∞, ∞) and ω = {θ0 }. By Example 1.1, we know that the mle of θ under Ω is Q2 = med{X, . . . , Xn }, the sample median. It follows that n −n |xi − Q2 | , L(Ω) = 2 exp − i=1

while −n

L( ω) = 2

exp −

n

|xi − θ0 | .

i=1

Hence the negative of twice the log of the likelihood ratio test statistic is n n −2 log Λ = 2 |xi − θ0 | − |xi − Q2 | . i=1

(3.24)

i=1

Thus the size α asymptotic likelihood ratio test for H0 versus H1 rejects H0 in favor of H1 if n n 2 |xi − θ0 | − |xi − Q2 | ≥ χ2α (1). i=1

i=1

By (2.10), the Fisher information for this model is I(θ) = 1. Thus, the Wald-type test statistic simpliﬁes to √ χ2W = [ n(Q2 − θ0 )]2 .

352

Maximum Likelihood Methods For the scores test, we have ∂ 1 ∂ log f (xi − θ) = log − |xi − θ| = sgn(xi − θ). ∂θ ∂θ 2 Hence the score vector for this model is S(θ) = (sgn(X1 − θ), . . . , sgn(Xn − θ)) . From the above discussion [see equation (3.17)], the scores test statistic can be written as χ2R = (S ∗ )2 /n, where S∗ =

n

sgn(Xi − θ0 ).

i=1

As Exercise 3.4 shows, under H0 , S ∗ is a linear function of a random variable with a b(n, 1/2) distribution. Which of the three tests should we use? Based on the above discussion, all three tests are asymptotically equivalent under the null hypothesis. Similarly to the concept of asymptotic relative eﬃciency (ARE), we can derive an equivalent concept of eﬃciency for tests; see advanced books such as Hettmansperger and McKean (2011). However, all three tests have the same asymptotic eﬃciency. Hence, asymptotic theory oﬀers little help in separating the tests. There have been ﬁnite sample comparisons in the literature; but, these studies have not selected any of these as a “best” test overall; see Chapter 7 of Lehmann (1999) for more discussion. EXERCISES 3.1. Consider the decision rule (3.5) derived in Example 3.1. Obtain the distribution of the test statistic under a general alternative and use it to obtain the power function of the test. If computational facilities are available, sketch this power curve for the case when θ0 = 1, n = 10, and α = 0.05. 3.2. Show that the test with decision rule (3.6) is like that of H0 : μ = μ0 versus H1 : μ = μ0 , except that here σ 2 is known. 3.3. Consider the decision rule (3.6) derived in Example 3.2. Obtain an equivalent test statistic which has a standard normal distribution under H0 . Next obtain the distribution of this test statistic under a general alternative and use it to obtain the power function of the test. If computational facilities are available, sketch this power curve for the case when θ0 = 0, n = 10, σ 2 = 1, and α = 0.05. 3.4. Consider Example 3.4. (a) Show that we can write S ∗ = 2T − n, where T = #{Xi > θ0 }. (b) Show that the scores test for this model is equivalent to rejecting H0 if T < c1 or T > c2 .

353

Maximum Likelihood Methods (c) Show that under H0 , T has the binomial distribution b(n, 1/2); hence, determine c1 and c2 so the test has size α. (d) Determine the power function for the test based on T as a function of θ. 3.5. Let X1 , X2 , . . . , Xn be a random sample from a N (μ0 , σ 2 = θ) distribution, of H0 : θ = θ0 where 0 < θ < ∞ and μ0 is known. Show that the likelihood ratio test n 2 versus H1 : θ = θ0 can be based upon the statistic W = i=1 (Xi − μ0 ) /θ0 . Determine the null distribution of W and give, explicitly, the rejection rule for a level α test. 3.6. For the test described in Exercise 3.5, obtain the distribution of the test statistic under general alternatives. If computational facilities are available, sketch this power curve for the case when θ0 = 1, n = 10, μ = 0, and α = 0.05. 3.7. Using the results of Example 2.4, ﬁnd an exact size α test for the hypotheses (3.21). 3.8. Let X1 , X2 , . . . , Xn be a random sample from a Poisson distribution with mean θ > 0. (a) Show that the likelihoodratio test of H0 : θ = θ0 versus H1 : θ = θ0 is based n upon the statistic Y = i=1 Xi . Obtain the null distribution of Y . (b) For θ0 = 2 and n = 5, ﬁnd the signiﬁcance level of the test that rejects H0 if Y ≤ 4 or Y ≥ 17. 3.9. Let X1 , X2 , . . . , Xn be a random sample from a Bernoulli b(1, θ) distribution, where 0 < θ < 1. (a) Show that the likelihoodratio test of H0 : θ = θ0 versus H1 : θ = θ0 is based n upon the statistic Y = i=1 Xi . Obtain the null distribution of Y . (b) For n = 100 and θ0 = 1/2, ﬁnd c1 so that the test rejects H0 when Y ≤ c1 or Y ≥ c2 = 100 − c1 has the approximate signiﬁcance level of α = 0.05. Hint: Use the Central Limit Theorem. 3.10. Let X1 , X2 , . . . , Xn be a random sample from a Γ(α = 3, β = θ) distribution, where 0 < θ < ∞. (a) Show that the likelihood ratio test of H0 : θ = θ0 versus H1 : θ = θ0 is based n upon the statistic W = i=1 Xi . Obtain the null distribution of 2W/θ0 . (b) For θ0 = 3 and n = 5, ﬁnd c1 and c2 so that the test that rejects H0 when W ≤ c1 or W ≥ c2 has signiﬁcance level 0.05. 3.11. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pdf f (x; θ) = θ exp −|x|θ /2Γ(1/θ), −∞ < x < ∞, where θ > 0. Suppose Ω = {θ : θ = 1, 2}. Consider the hypotheses H0 : θ = 2 (a normal distribution) versus H1 : θ = 1 (a double exponential distribution). Show that the likelihood ratio test can be based n on the statistic W = i=1 (Xi2 − |Xi |).

354

Maximum Likelihood Methods 3.12. Let X1 , X2 , . . . , Xn be a random sample from the beta distribution with α = β = θ and Ω = {θ : θ = 1, 2}. Show that the likelihood ratio test statistic Λ for testing H 0 : θ = 1 versus H1 : θ = 2 is a function of the statistic W = n n log X + i i=1 i=1 log (1 − Xi ). 3.13. Consider a location model Xi = θ + ei ,

i = 1, . . . , n,

(3.25)

where e1 , e2 , . . . , en are iid with pdf f (z). There is a nice geometric interpretation for estimating θ. Let X = (X1 , . . . , Xn ) and e = (e1 , . . . , en ) be the vectors of observations and random error, respectively, and let μ = θ1, where 1 is a vector with all components equal to 1. Let V be the subspace of vectors of the form μ; i.e., V = {v : v = a1, for some a ∈ R}. Then in vector notation we can write the model as X = μ + e, μ ∈ V. (3.26) Then we can summarize the model by saying, “Except for the random error vector e, X would reside in V .” Hence, it makes sense intuitively to estimate μ by a vector in V which is “closest” to X. That is, given a norm · in Rn , choose = Argmin X − v , v ∈ V. μ

(3.27)

(a) If the error pdf is the Laplace show that the minimization in (3.27) is equivalent to maximizing the likelihood when the norm is the l1 norm given by v 1 =

n

|vi |.

(3.28)

i=1

(b) If the error pdf is the N (0, 1), show that the minimization in (3.27) is equivalent to maximizing the likelihood when the norm is given by the square of the l2 norm n vi2 . (3.29) v 22 = i=1

3.14. Continuing with the last exercise, besides estimation there is also a nice geometric interpretation for testing. For the model (3.26), consider the hypotheses H0 : θ = θ0 versus H1 : θ = θ0 ,

(3.30)

where θ0 is speciﬁed. Given a norm · on Rn , denote by d(X, V ) the distance , where μ is deﬁned in between X and the subspace V ; i.e., d(X, V ) = X − μ should be close to μ = θ0 1 and, hence, equation (3.27). If H0 is true, then μ X − θ0 1 should be close to d(X, V ). Denote the diﬀerence by . RD = X − θ0 1 − X − μ

(3.31)

Small values of RD indicate that the null hypothesis is true, while large values indicate H1 . So our rejection rule when using RD is Reject H0 in favor of H1 if RD > c.

(3.32)

355

Maximum Likelihood Methods (a) If the error pdf is the Laplace, (1.6), show that expression (3.31) is equivalent to the likelihood ratio test when the norm is given by (3.28). (b) If the error pdf is the N (0, 1), show that expression (3.31) is equivalent to the likelihood ratio test when the norm is given by the square of the l2 norm, (3.29). 3.15. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pmf p(x; θ) = θx (1 − θ)1−x , x = 0, 1, where 0 < θ < 1. We wish to test H0 : θ = 1/3 versus H1 : θ = 1/3. (a) Find Λ and −2 log Λ. (b) Determine the Wald-type test. (c) What is Rao’s score statistic? 3.16. Let X1 , X2 , . . . , Xn be a random sample from a Poisson distribution with mean θ > 0. Test H0 : θ = 2 against H1 : θ = 2 using (a) −2 log Λ. (b) A Wald-type statistic. (c) Rao’s score statistic. 3.17. Let X1 , X2 , . . . , Xn be a random sample from a Γ(α, β) distribution where α is known and β > 0. Determine the likelihood ratio test for H0 : β = β0 against H1 : β = β0 . 3.18. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from a uniform distribution on (0, θ), where θ > 0. (a) Show that Λ for testing H0 : θ = θ0 against H1 : θ = θ0 is Λ = (Yn /θ0 )n , Yn ≤ θ0 , and Λ = 0 if Yn > θ0 . (b) When H0 is true, show that −2 log Λ has an exact χ2 (2) distribution, not χ2 (1). Note that the regularity conditions are not satisﬁed.

4

Multiparameter Case: Estimation

In this section, we discuss the case where θ is a vector of p parameters. There are analogs to the theorems in the previous sections in which θ is a scalar, and we present their results but, for the most part, without proofs. The interested reader can ﬁnd additional information in other books; see, for instance, Lehmann and Casella (1998) and Rao (1973).

356

Maximum Likelihood Methods Let X1 , . . . , Xn be iid with common pdf f (x; θ), where θ ∈ Ω ⊂ Rp . As before, the likelihood function and its log are given by L(θ)

=

n

f (xi ; θ)

i=1

l(θ)

=

log L(θ) =

n

log f (xi ; θ),

(4.1)

i=1

for θ ∈ Ω. The theory requires additional regularity conditions. In keeping with our number scheme in the last three sections, we have labeled these (R6)–(R9). In this section, when we say “under regularity conditions,” we mean all of the conditions of (1.1), (2.1), (2.2), which are relevant to the argument. The discrete case follows in the same way as the continuous case, so in general we state material in terms of the continuous case. Note that the proof of Theorem 1.1 does not depend on whether the parameter is a scalar or a vector. Therefore, with probability going to 1, L(θ) is maximized at the true value of θ. Hence, as an estimate of θ we consider the value which maximizes L(θ) or equivalently solves the vector equation (∂/∂θ)l(θ) = 0. If it exists, this value is called the maximum likelihood estimator (mle) and we denote it by θ. Often we are interested in a function of θ, say, the parameter η = g(θ). Because the second part of the proof of Theorem 1.2 remains true for θ as a vector, η = g(θ) is the mle of η. Example 4.1 (Maximum Likelihood Estimates Under the Normal Model). Suppose X1 , . . . , Xn are iid N (μ, σ 2 ). In this case, θ = (μ, σ 2 ) and Ω is the product space (−∞, ∞) × (0, ∞). The log of the likelihood simpliﬁes to l(μ, σ 2 ) = −

n 1 n log 2π − n log σ − 2 (xi − μ)2 . 2 2σ i=1

(4.2)

Taking partial derivatives of (4.2) with respect to μ and σ and setting them to 0, we get the simultaneous equations ∂l ∂μ

=

n 1 (xi − μ) = 0 σ 2 i=1

∂l ∂σ

=

−

n n 1 + 3 (xi − μ)2 = 0. σ σ i=1

n Solving these equations, we obtain μ = X and σ = (1/n) i=1 (Xi − X)2 as solutions. A check of the second partials shows that these maximize l(μ, σ 2 ), so n these are the mles. Also, by Theorem 1.2, (1/n) i=1 (Xi − X)2 is the mle of σ 2 . is an We know that these are consistent estimates of μ and σ 2 , respectively, that μ 2 $ 2 unbiased estimate of μ, and that σ is a biased estimate of σ whose bias vanishes as n → ∞.

357

Maximum Likelihood Methods Example 4.2 (General Laplace pdf). Let X1 , X2 , . . . , Xn be a random sample from the Laplace pdf fX (x) = (2b)−1 exp{−|x − a|/b}, −∞ < x < ∞, where the parameters (a, b) are in the space Ω = {(a, b) : −∞ < a < ∞, b > 0}. Recall in the last sections we looked at the special case where b = 1. As we now show, the mle of a is the sample median, regardless of the value of b. The log of the likelihood function is n xi − a l(a, b) = −n log 2 − n log b − b . i=1 The partial of l(a, b) with respect to a is n 1 ∂l(a, b) xi − a = sgn ∂a b i=1 b

1 sgn{xi − a}, b i=1 n

=

where the second equality follows because b > 0. Setting this partial to 0, we obtain the mle of a to be Q2 = med{X1 , X2 , . . . , Xn }, just as in Example 1.1. Hence the mle of a is invariant to the parameter b. Taking the partial of l(a, b) with respect to b, we obtain n n 1 ∂l(a, b) =− + 2 |xi − a|. ∂b b b i=1 Setting to 0 and solving the two equations simultaneously, we obtain, as the mle of b, the statistic n b = 1 |Xi − Q2 |. n i=1 Recall that the Fisher information in the scalar case was the variance of the random variable (∂/∂θ) log f (X; θ). The analog in the multiparameter case is the variance-covariance matrix of the gradient of log f (X; θ), that is, the variancecovariance matrix of the random vector given by ∂ log f (X; θ) ∂ log f (X; θ) ,..., . (4.3)

log f (X; θ) = ∂θ1 ∂θp Fisher information is then deﬁned by the p × p matrix I(θ) = Cov ( log f (X; θ)) .

(4.4)

The (j, k)th entry of I(θ) is given by ∂ ∂ log f (X; θ), log f (X; θ) ; j, k = 1, . . . , p. (4.5) Ijk = cov ∂θj ∂θk As in the scalar case, we can simplify this by using the identity 1 = f (x; θ) dx. Under the regularity conditions, as discussed in the second paragraph of this section, the partial derivative of this identity with respect to θj results in ∂ ∂ f (x; θ) dx = log f (x; θ) f (x; θ) dx 0= ∂θj ∂θj ∂ = E log f (X; θ) . (4.6) ∂θj

358

Maximum Likelihood Methods Next, on both sides of the ﬁrst equality above, take the partial derivative with respect to θk . After simpliﬁcation, this results in ∂2 log f (x; θ) f (x; θ) dx 0 = ∂θj ∂θk ∂ ∂ + log f (x; θ) log f (x; θ) f (x; θ) dx; ∂θj ∂θk that is, ∂ ∂2 ∂ log f (X; θ) log f (X; θ) = −E log f (X; θ) . E ∂θj ∂θk ∂θj ∂θk

(4.7)

Using (4.6) and (4.7) together, we obtain ∂2 log f (X; θ) . Ijk = −E ∂θj ∂θk

(4.8)

Information for a random sample follows in the same way as the scalar case. The pdf of the sample is the likelihood function L(θ; X). Replace f (X; θ) by L(θ; X) in the vector given in expression (4.3). Because log L is a sum, this results in the random vector n

log f (Xi ; θ). (4.9)

log L(θ; X) = i=1

Because the summands are iid with common covariance matrix I(θ), we have Cov( log L(θ; X)) = nI(θ).

(4.10)

As in the scalar case, the information in a random sample of size n is n times the information in a sample of size 1. The diagonal entries of I(θ) are 2 ∂ ∂ log f (X; θ) = −E log f (X ; θ) . Iii (θ) = Var i ∂θi ∂θi2 This is similar to the case when θ is a scalar, except now Iii (θ) is a function of the vector θ. Recall in the scalar case that (nI(θ))−1 was the Rao-Cram´er lower bound for an unbiased estimate of θ. There is an analog to this in the multiparameter case. In particular, if Yj = uj (X1 , . . . , Xn ) is an unbiased estimate of θj , then it can be shown that 1 % −1 & I (θ) jj ; (4.11) Var(Yj ) ≥ n see, for example, Lehmann (1983). As in the scalar case, we shall call an unbiased estimate eﬃcient if its variance attains this lower bound. Example 4.3 (Information Matrix for the Normal pdf). The log of a N (μ, σ 2 ) pdf is given by 1 1 (4.12) log f (x; μ, σ 2 ) = − log 2π − log σ − 2 (x − μ)2 . 2 2σ

359

Maximum Likelihood Methods The ﬁrst and second partial derivatives are ∂ log f ∂μ 2 ∂ log f ∂μ2 ∂ log f ∂σ ∂ 2 log f ∂σ 2 2 ∂ log f ∂μ∂σ

= = = = =

1 (x − μ) σ2 1 − 2 σ 1 1 − + 3 (x − μ)2 σ σ 1 3 − 4 (x − μ)2 σ2 σ 2 − 3 (x − μ). σ

Upon taking the negative of the expectations of the second partial derivatives, the information matrix for a normal density is 1 0 I(μ, σ) = σ2 . (4.13) 0 σ22 We may want the information matrix for (μ, σ 2 ). This can be obtained by taking partial derivatives with respect to σ 2 instead of σ; however, in Example 4.6, we obtain it via a transformation. From Example n 4.1, the maximum likelihood estimates = X and σ 2 = (1/n) i=1 (Xi − X)2 , respectively. Based on the of μ and σ 2 are μ information matrix, we note that X is an eﬃcient estimate of μ for ﬁnite samples. In Example 4.6, we consider the sample variance. Example 4.4 (Information Matrix for a Location and Scale Family). (Suppose ' , −∞ < X1 , X2 , . . . , Xn is a random sample with common pdf fX (x) = b−1 f x−a b x < ∞, where (a, b) is in the space Ω = {(a, b) : −∞ < a < ∞, b > 0} and f (z) is a pdf such that f (z) > 0 for −∞ < z < ∞. As Exercise 4.8 shows, we can model Xi as (4.14) Xi = a + bei , where the ei s are iid with pdf f (z). This is called a location and scale model (LASP). Example 4.2 illustrated this model when f (z) had the Laplace pdf. In Exercise 4.9, the reader is asked to show that the partial derivatives are ' ( 1 x−a 1 f x−a ∂ b ( ' log f = − ∂a b b b f x−a b ' ( x−a x−a 1 x−a 1 ∂ b f b ( ' . log f = − 1+ ∂b b b b f x−a b Using (4.5) and (4.6), we then obtain I11 =

360

∞ −∞

1 b2

' ( 2 f x−a x−a 1 b ( ' f dx. b b f x−a b

Maximum Likelihood Methods Now make the substitution z = (x − a)/b, dz = (1/b)dx. Then we have I11

1 = 2 b

∞ −∞

f (z) f (z)

2 f (z) dz;

(4.15)

hence, information on the location parameter a does not depend on a. As Exercise 4.9 shows, upon making this substitution, the other entries in the information matrix are 2 zf (z) 1 ∞ 1+ f (z) dz (4.16) I22 = b2 −∞ f (z) 2 1 ∞ f (z) z f (z) dz. (4.17) I12 = b2 −∞ f (z) Thus, the information matrix can be written as (1/b)2 times a matrix whose entries are free of the parameters a and b. As Exercise 4.10 shows, the oﬀ-diagonal entries of the information matrix are 0 if the pdf f (z) is symmetric about 0. Example 4.5 (Multinomial Distribution). Consider a random trial which can result in one, and only one, of k outcomes or categories. Let Xj be 1 or 0 depending on whether the jth outcome occurs or does not, for j = 1, . . . , k. Suppose the probk ability that outcome j occurs is pj ; hence, j=1 pj = 1. Let X = (X1 , . . . , Xk−1 ) and p = (p1 , . . . , pk−1 ) . The distribution of X is multinomial. Recall that the pmf is given by ⎞⎛ ⎞1−k−1 ⎛ j=1 xj k−1 k−1 x j pj ⎠ ⎝1 − pj ⎠ , (4.18) f (x, p) = ⎝ j=1

j=1

k−1 where the parameter space is Ω = {p : 0 < pj < 1, j = 1, . . . , k − 1; j=1 pj < 1}. We ﬁrst obtain the information matrix. The ﬁrst partial of the log of f with respect to pi simpliﬁes to k−1 1 − j=1 xj xi ∂ log f = − k−1 . ∂pi pi 1 − j=1 pj The second partial derivatives are given by ∂ 2 log f ∂p2i

=

∂ 2 log f ∂pi ∂ph

=

k−1 1 − j=1 xj xi − 2− k−1 pi (1 − j=1 pj )2 k−1 1 − j=1 xj − , i = h < k. k−1 (1 − j=1 pj )2

Recall for this distribution that marginally each random variable Xj has a Bernoulli distribution with mean pj . Recalling that pk = 1−(p1 +· · ·+pk−1 ), the expectations

361

Maximum Likelihood Methods of the negatives of the second partial derivatives are straightforward and result in the information matrix ⎤ ⎡ 1 1 1 1 ··· p1 + pk pk pk 1 1 1 1 ⎥ ⎢ ··· pk p2 + pk pk ⎥ ⎢ I(p) = ⎢ (4.19) ⎥. .. .. .. ⎦ ⎣ . . . 1 1 1 · · · pk−1 + p1k pk pk This is a patterned matrix with inverse [see ⎡ −p1 p2 p1 (1 − p1 ) ⎢ −p1 p2 p (1 − p2 ) 2 ⎢ I−1 (p) = ⎢ .. .. ⎣ . . −p1 pk−1 −p2 pk−1

page 170 of Graybill (1969)], ⎤ ··· −p1 pk−1 ⎥ ··· −p2 pk−1 ⎥ ⎥. .. ⎦ . · · · pk−1 (1 − pk−1 )

(4.20)

Next, we obtain the mles for a random sample X1 , X2 , . . . , Xn . The likelihood function is given by

L(p) =

n k−1 i=1 j=1

Let tj =

n i=1

⎛ pj ji ⎝1 − x

k−1

⎞1−k−1 j=1 xji pj ⎠

.

(4.21)

j=1

xji , for j = 1, . . . , k − 1. With simpliﬁcation, the log of L reduces to ⎞ ⎞ ⎛ ⎛ k−1 k−1 k−1 l(p) = tj log pj + ⎝n − tj ⎠ log ⎝1 − pj ⎠ . j=1

j=1

j=1

The ﬁrst partial of l(p) with respect to ph leads to the system of equations k−1 n − j=1 tj th ∂l(p) = − = 0, k−1 ∂ph ph 1 − j=1 pj

h = 1, . . . , k − 1.

It is easily seen that ph = th /n satisﬁes these equations. Hence the maximum likelihood estimates are n i=1 Xih , h = 1, . . . , k − 1. (4.22) p$ h = n n Each random variable i=1 Xih is binomial(n, ph ) with variance nph (1−ph ). Therefore, the maximum likelihood estimates are eﬃcient estimates. As a ﬁnal note on information, suppose the information matrix is diagonal. Then the lower bound of the variance of the jth estimator (4.11) is 1/(nIjj (θ)). Because Ijj (θ) is deﬁned in terms of partial derivatives, [see (4.5)], this is the information in treating all θi , except θj , as known. For instance, in Example 4.3, for the normal pdf the information matrix is diagonal; hence, the information for

362

Maximum Likelihood Methods μ could have been obtained by treating σ 2 as known. Example 4.4 discusses the information for a general location and scale family. For this general family, of which the normal is a member, the information matrix is diagonal provided the underlying pdf is symmetric. In the next theorem, we summarize the asymptotic behavior of the maximum likelihood estimator of the vector θ. It shows that the mles are asymptotically eﬃcient estimates. Theorem 4.1. Let X1 , . . . , Xn be iid with pdf f (x; θ) for θ ∈ Ω. Assume the regularity conditions hold. Then 1. The likelihood equation, ∂ l(θ) = 0, ∂θ P n such that θ n → has a solution θ θ.

2. For any sequence which satisﬁes (1), √ D n − θ) → n(θ Np (0, I−1 (θ)). The proof of this theorem can be found in other books; see, for example, Lehmann and Casella (1998). As in the scalar case, the theorem does not assure that the maximum likelihood estimates are unique. But if the sequence of solutions are unique, then they are both consistent and asymptotically normal. In applications, we can often verify uniqueness. We immediately have the following corollary, Corollary 4.1. Let X1 , . . . , Xn be iid with pdf f (x; θ) for θ ∈ Ω. Assume the regun be a sequence of consistent solutions of the likelihood larity conditions hold. Let θ equation. Then θ n are asymptotically eﬃcient estimates; that is, for j = 1, . . . , p, √ D n(θn,j − θj ) → N (0, [I−1 (θ)]jj ). Let g be a transformation g(θ) = (g1 (θ), . . . , gk (θ)) such that 1 ≤ k ≤ p and that the k × p matrix of partial derivatives ∂gi , i = 1, . . . k, j = 1, . . . , p, B= ∂θj = g(θ). has continuous elements and does not vanish in a neighborhood of θ. Let η is the mle of η = g(θ). By Theorem 4.6, Then η √ D n( η − η) → Nk (0, BI−1 (θ)B ). (4.23) √ η − η) is Hence the information matrix for n( &−1 % −1 , (4.24) I(η) = BI (θ)B provided the inverse exists. For a simple example of this result, reconsider Example 4.3.

363

Maximum Likelihood Methods Example 4.6 (Information for the Variance of a Normal Distribution). Suppose X1 , . . . , Xn are iid N (μ, σ 2 ). Recall from Example 4.3 that the information matrix was I(μ, σ) = diag{σ −2 , 2σ −2 } . Consider the transformation g(μ, σ) = σ 2 . Hence the matrix of partials B is the row vector [0 2σ]. Thus the information for σ 2 is 2

I(σ ) =

[ 0 2σ ]

1 σ2

0

0

2 σ2

−1

0 2σ

−1 =

1 . 2σ 4

The Rao-Cram´er lower bound for the variance of an estimator of σ 2 is (2σ 4 )/n. Recall that the sample variance is unbiased for σ 2 , but its variance is (2σ 4 )/(n − 1). Hence, it is not eﬃcient for ﬁnite samples, but it is asymptotically eﬃcient.

EXERCISES 4.1. Let X1 , X2 , and X3 have a multinomial distribution in which n = 25, k = 4, and the unknown probabilities are θ1 , θ2 , and θ3 , respectively. Here we can, for convenience, let X4 = 25 − X1 − X2 − X3 and θ4 = 1 − θ1 − θ2 − θ3 . If the observed values of the random variables are x1 = 4, x2 = 11, and x3 = 7, ﬁnd the maximum likelihood estimates of θ1 , θ2 , and θ3 . 4.2. Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be independent random samples from N (θ1 , θ3 ) and N (θ2 , θ4 ) distributions, respectively. (a) If Ω ⊂ R3 is deﬁned by Ω = {(θ1 , θ2 , θ3 ) : −∞ < θi < ∞, i = 1, 2; 0 < θ3 = θ4 < ∞}, ﬁnd the mles of θ1 , θ2 , and θ3 . (b) If Ω ⊂ R2 is deﬁned by Ω = {(θ1 , θ3 ) : −∞ < θ1 = θ2 < ∞; 0 < θ3 = θ4 < ∞}, ﬁnd the mles of θ1 and θ3 . 4.3. Let X1 , X2 , . . . , Xn be iid, each with the distribution having pdf f (x; θ1 , θ2 ) = (1/θ2 )e−(x−θ1 )/θ2 , θ1 ≤ x < ∞, −∞ < θ2 < ∞, zero elsewhere. Find the maximum likelihood estimators of θ1 and θ2 . 4.4. The Pareto distribution is a frequently used model in the study of incomes and has the distribution function 1 − (θ1 /x)θ2 θ1 ≤ x F (x; θ1 , θ2 ) = 0 elsewhere, where θ1 > 0 and θ2 > 0. If X1 , X2 , . . . , Xn is a random sample from this distribution, ﬁnd the maximum likelihood estimators of θ1 and θ2 . (Hint: This exercise deals with a nonregular case.)

364

Maximum Likelihood Methods 4.5. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n from the uniform distribution of the continuous type over the closed interval [θ − ρ, θ + ρ]. Find the maximum likelihood estimators for θ and ρ. Are these two unbiased estimators? 4.6. Let X1 , X2 , . . . , Xn be a random sample from N (μ, σ 2 ). (a) If the constant b is deﬁned by the equation P (X ≤ b) = 0.90, ﬁnd the mle of b. (b) If c is given constant, ﬁnd the mle of P (X ≤ c). 4.7. Consider two Bernoulli distributions with unknown parameters p1 and p2 . If Y and Z equal the numbers of successes in two independent random samples, each of size n, from the respective distributions, determine the mles of p1 and p2 if we know that 0 ≤ p1 ≤ p2 ≤ 1. 4.8. Show that if Xi follows the model (4.14), then its pdf is b−1 f ((x − a)/b). 4.9. Verify the partial derivatives and the entries of the information matrix for the location and scale family as given in Example 4.4. 4.10. Suppose the pdf of X is of a location and scale family as deﬁned in Example 4.4. Show that if f (z) = f (−z), then the entry I12 of the information matrix is 0. Then argue that in this case the mles of a and b are asymptotically independent. 4.11. Suppose X1 , X2 , . . . , Xn are iid N (μ, σ 2 ). Show that Xi follows a location and scale family as given in Example 4.4. Obtain the entries of the information matrix as given in this example and show that they agree with the information matrix determined in Example 4.3.

5

Multiparameter Case: Testing

In the multiparameter case, hypotheses of interest often specify θ to be in a subregion of the space. For example, suppose X has a N (μ, σ 2 ) distribution. The full space is Ω = {(μ, σ 2 ) : σ 2 > 0, −∞ < μ < ∞}. This is a two-dimensional space. We may be interested though in testing that μ = μ0 , where μ0 is a speciﬁed value. Here we are not concerned about the parameter σ 2 . Under H0 , the parameter space is the one-dimensional space ω = {(μ0 , σ 2 ) : σ 2 > 0} . We say that H0 is deﬁned in terms of one constraint on the space Ω. In general, let X1 , . . . , Xn be iid with pdf f (x; θ) for θ ∈ Ω ⊂ Rp . As in the last section, we assume that the regularity conditions listed in (1.1), (2.1), (2.2), are satisﬁed. In this section, we invoke these by the phrase under regularity conditions. The hypotheses of interest are H0 : θ ∈ ω versus H1 : θ ∈ Ω ∩ ω c ,

(5.1)

where ω ⊂ Ω is deﬁned in terms of q, 0 < q ≤ p, independent constraints of the form g1 (θ) = a1 , . . . , gq (θ) = aq . The functions g1 , . . . , gq must be continuously

365

Maximum Likelihood Methods diﬀerentiable. This implies that ω is a (p − q)-dimensional space. Based on Theorem 1.1, the true parameter maximizes the likelihood function, so an intuitive test statistic is given by the likelihood ratio Λ=

maxθ ∈ω L(θ) . maxθ ∈Ω L(θ)

(5.2)

Large values (close to 1) of Λ suggest that H0 is true, while small values indicate H1 is true. For a speciﬁed level α, 0 < α < 1, this suggests the decision rule Reject H0 in favor of H1 if Λ ≤ c,

(5.3)

where c is such that α = maxθ ∈ω Pθ [Λ ≤ c]. As in the scalar case, this test often has optimal properties; see Section 3. To determine c, we need to determine the distribution of Λ or a function of Λ when H0 is true. denote the maximum likelihood estimator when the parameter space is Let θ 0 denote the maximum likelihood estimator when the the full space Ω and let θ and =L θ parameter space is the reduced space ω. For convenience, deﬁne L(Ω) 0 . Then we can write the likelihood ratio test (LRT) statistic as L( ω) = L θ Λ=

L( ω) . L(Ω)

(5.4)

Example 5.1 (LRT for the Mean of a Normal pdf). Let X1 , . . . , Xn be a random sample from a normal distribution with mean μ and variance σ 2 . Suppose we are interested in testing H0 : μ = μ0 versus H1 : μ = μ0 , (5.5) where μ0 is speciﬁed. Let Ω = {(μ, σ 2 ) : −∞ < μ < ∞, σ 2 > 0} denote the full model parameter space. The reduced model parameter space is the one-dimensional > 0} . By Example 4.1, the mles of μ and σ 2 under Ω subspace ω = {(μ0 , σ 2 ) : σ 2 n 2 = (1/n) i=1 (Xi − X)2 , respectively. Under Ω, the maximum are μ = X and σ value of the likelihood function is = L(Ω)

1 1 exp{−(n/2)}. n/2 2 (2π) ( σ )n/2

(5.6)

Following Example n 4.1, it is easy to show that under the reduced parameter space ω, σ 02 = (1/n) i=1 (Xi − μ0 )2 . Thus the maximum value of the likelihood function under ω is 1 1 exp{−(n/2)}. (5.7) L( ω) = (2π)n/2 ( σ02 )n/2 i.e, The likelihood ratio test statistic is the ratio of L( ω ) to L(Ω); Λ=

366

n/2 n (X − X)2 ni=1 i . 2 i=1 (Xi − μ0 )

(5.8)

Maximum Likelihood Methods The likelihood ratio test rejects H0 if Λ ≤ c, but this is equivalent to rejecting H0 if Λ−2/n ≥ c . Next, consider the identity n i=1

Substituting (5.9) for H0 if

2

(Xi − μ0 ) = n

n

(Xi − X)2 + n(X − μ0 )2 .

(5.9)

i=1

i=1 (Xi

− μ0 )2 , after simpliﬁcation, the test becomes reject

n(X − μ0 )2 1 + n ≥ c , 2 (X − X) i i=1

or equivalently, reject H0 if ⎧ ⎫2 √ ⎨ ⎬ n(X − μ0 ) ≥ c = (c − 1)(n − 1). ⎩ ⎭ n 2 i=1 (Xi − X) /(n − 1) Let T denote the expression within braces on the left side of this inequality. Then the decision rule is equivalent to Reject H0 in favor of H1 if |T | ≥ c∗ ,

(5.10)

where α = PH0 [|T | ≥ c∗ ]. Of course, this is a two-sided version of the t-test. If we take c to be tα/2,n−1 , the upper α/2-critical value of a t-distribution with n − 1 degrees of freedom, then our test has exact level α. Other examples of likelihood ratio tests for normal distributions can be found in the exercises. We are not always as fortunate as in Example 5.1 to obtain the likelihood ratio test in a simple form. Often it is diﬃcult or perhaps impossible to obtain its ﬁnite sample distribution. But, as the next theorem shows, we can always obtain an asymptotic test based on it. Theorem 5.1. Let X1 , . . . , Xn be iid with pdf f (x; θ) for θ ∈ Ω ⊂ Rp . Assume n be a sequence of consistent solutions of the the regularity conditions hold. Let θ 0,n be a likelihood equation when the parameter space is the full space Ω. Let θ sequence of consistent solutions of the likelihood equation when the parameter space is the reduced space ω, which has dimension p − q. Let Λ denote the likelihood ratio test statistic given in (5.4). Under H0 , (5.1), −2 log Λ → χ2 (q). D

(5.11)

A proof of this theorem can be found in Rao (1973). There are analogs of the Wald-type and scores-type tests, also. The Wald-type test statistic is formulated in terms of the constraints, which deﬁne H0 , evaluated at the mle under Ω. We do not formally state it here, but as the following example shows, it is often a straightforward formulation. The interested reader can ﬁnd a discussion of these tests in Lehmann (1999). A careful reading of the development of this chapter shows that much of it remains the same if X is a random vector. The next example demonstrates this.

367

Maximum Likelihood Methods Example 5.2 (Application of a Multinomial Distribution). As an example, consider a poll for a presidential race with k candidates. Those polled are asked to select the person for which they would vote if the election were held tomorrow. Assuming that those polled are selected independently of one another and that each can select one and only one candidate, the multinomial model seems appropriate. In this problem, suppose we are interested in comparing how the two “leaders” are doing. In fact, say the null hypothesis of interest is that they are equally favorable. This can be modeled with a multinomial model which has the three categories: (1) and (2) for the two leading candidates and (3) for all other candidates. Our observation is a vector (X1 , X2 ), where Xi is 1 or 0 depending on whether category i is selected or not. If both are 0, then category (3) has been selected. Let pi denote the probability that category i is selected. Then the pmf of (X1 , X2 ) is the trinomial density, f (x1 , x2 ; p1 , p2 ) = px1 1 px2 2 (1 − p1 − p2 )1−x1 −x2 ,

(5.12)

for xi = 0, 1, i = 1, 2; x1 + x2 ≤ 1, where the parameter space is Ω = {(p1 , p2 ) : 0 < pi < 1, p1 + p2 < 1}. Suppose (X11 , X21 ), . . . , (X1n , X2n ) is a random sample from this distribution. We shall consider the hypotheses H0 : p1 = p2 versus H1 : p1 = p2 .

(5.13)

n We ﬁrst derive the likelihood ratio test. Let Tj = i=1 Xji for j = 1, 2. From Example 4.5, we know that the maximum likelihood estimates are pj = Tj /n, for j = 1, 2. The value of the likelihood function (4.21) at the mles under Ω is ˆ = pˆnpˆ1 pˆnpˆ2 (1 − pˆ1 − pˆ2 )n(1−pˆ1 −pˆ2 ) . L Ω 1 2 Under the null hypothesis, let p be the common value of p1 and p2 . The pmf of (X1 , X2 ) is f (x1 , x2 ; p) = px1 +x2 (1 − 2p)1−x1 −x2 ;

x1 , x2 = 0, 1; x1 + x2 ≤ 1,

(5.14)

where the parameter space is ω = {p : 0 < p < 1/2}. The likelihood under ω is L(p) = pt1 +t2 (1 − 2p)n−t1 −t2 .

(5.15)

Diﬀerentiating log L(p) with respect to p and setting the derivative to 0 results in the following maximum likelihood estimate, under ω: p0 =

p1 + p2 t 1 + t2 = , 2n 2

(5.16)

where p1 and p2 are the mles under Ω. The likelihood function evaluated at the mle under ω simpliﬁes to L (ˆ ω) =

368

pˆ1 + pˆ2 2

n(pˆ1 +pˆ2 )

(1 − pˆ1 − pˆ2 )n(1−pˆ1 −pˆ2 ) .

(5.17)

Maximum Likelihood Methods The reciprocal of the likelihood ratio test statistic then simpliﬁes to np1 np2 2 p2 2 p1 −1 . Λ = p1 + p2 p1 + p2

(5.18)

Based on Theorem 5.11, an asymptotic level α test rejects H0 if 2 log Λ−1 > χ2α (1). This is an example where the Wald’s test can easily be formulated. The constraint under H0 is p1 − p2 = 0. Hence, the Wald-type statistic is W = p1 − p2 , which can be expressed as W = [1, −1][ p1 ; p2 ] . Recall that the information matrix and its inverse were found for k categories in Example 4.5. From Theorem 4.1, we then have p1 p1 (1 − p1 ) −p1 p2 p1 is approximately N2 , n1 . (5.19) p2 p2 −p1 p2 p2 (1 − p2 ) As shown in Example 4.5, the ﬁnite sample moments are the same as the asymptotic moments. Hence the variance of W is 1 p1 (1 − p1 ) 1 −p1 p2 Var(W ) = [1, −1] −p1 p2 p2 (1 − p2 ) −1 n =

p1 + p2 − (p1 − p2 )2 . n

Because W is asymptotically normal, an asymptotic level α test for the hypotheses (5.13) is to reject H0 if χ2W ≥ χ2α (1), where χ2W =

( p1 − p2 )2 . ( p1 + p2 − ( p1 − p2 )2 )/n

It also follows that an asymptotic (1 − α)100% conﬁdence interval for the diﬀerence p1 − p2 is 1/2 p1 + p2 − ( p1 − p2 )2 p1 − p2 ± zα/2 . n Returning to the polling situation discussed at the beginning of this example, we would say the race is too close to call if 0 is in this conﬁdence interval. Example 5.3 (Two-Sample Binomial Proportions). In Example 5.2, we developed tests for p1 = p2 based on a single sample from a multinomial distribution. Now consider the situation where X1 , X2 , . . . , Xn1 is a random sample from a b(1, p1 ) distribution, Y1 , Y2 , . . . , Yn2 is a random sample from a b(1, p2 ) distribution, and the Xi s and Yj s are mutually independent. The hypotheses of interest are H0 : p1 = p2 versus H1 : p1 = p2 .

(5.20)

This situation occurs in practice when, for instance, we are comparing the president’s rating from one month to the next. The full and reduced model parameter spaces are given respectively by Ω = {(p1 , p2 ) : 0 < pi < 1, i = 1, 2} and ω = {(p, p) : 0 < p < 1}. The likelihood function for the full model simpliﬁes to L(p1 , p2 ) = pn1 1 x (1 − p1 )n1 −n1 x pn2 2 y (1 − p2 )n2 −n2 y .

(5.21)

369

Maximum Likelihood Methods It follows immediately that the mles of p1 and p2 are x and y, respectively. Note, for the reduced model, that we can combine the samples into one large sample from a b(n, p) distribution, where n = n1 + n2 is the combined sample size. Hence, for the reduced model, the mle of p is n2 n1 n1 x + n2 y i=1 xi + i=1 yi , (5.22) = p = n1 + n2 n i.e., a weighted average of the individual sample proportions. Using this, the reader is asked to derive the LRT for the hypotheses (5.20) in Exercise 5.9. We next derive the Wald-type test. Let p1 = x and p2 = y. From the Central Limit Theorem, we have √ ni ( p i − pi ) D → Zi , i = 1, 2, pi (1 − pi ) where Z1 and Z2 are iid N (0, 1) random variables. Assume for i = 1, 2 that, as n → ∞, ni /n → λi , where 0 < λi < 1 and λ1 + λ2 = 1. As Exercise 5.10 shows, √ 1 1 D n[( p1 − p2 ) − (p1 − p2 )] → N 0, p1 (1 − p1 ) + p2 (1 − p2 ) . (5.23) λ1 λ2 It follows that the random variable ( p1 − p2 ) − (p1 − p2 ) Z= p1 (1−p1 ) 2) + p2 (1−p n1 n2

(5.24)

has an approximate N (0, 1) distribution. Under H0 , p1 − p2 = 0. We could use Z as a test statistic, provided we replace the parameters p1 (1 − p1 ) and p2 (1 − p2 ) in its denominator with a consistent estimate. Recall that pi → pi , i = 1, 2, in probability. Thus under H0 , the statistic Z∗ =

p1 − p2 p 1 (1− p1 ) n1

+

p 2 (1− p2 ) n2

(5.25)

has an approximate N (0, 1) distribution. Hence an approximate level α test is to reject H0 if |z ∗ | ≥ zα/2 . Another consistent estimator of the denominator is discussed in Exercise 5.11.

EXERCISES 5.1. In Example 5.1 let n = 10, and let the experimental value of the random 10 variables yield x = 0.6 and 1 (xi − x)2 = 3.6. If the test derived in that example is used, do we accept or reject H0 : θ1 = 0 at the 5% signiﬁcance level? 5.2. Let X1 , X2 , . . . , Xn be a random sample from the distribution N (θ1 , θ2 ). Show that the likelihood ratio principle for testing H0 : θ2 = θ2 speciﬁed, and θ1 un: θ2 = θ2 , θ1 unspeciﬁed, leads to a test that rejects when speciﬁed against H1 n n 2 2 1 (xi − x) ≤ c1 or 1 (xi − x) ≥ c2 , where c1 < c2 are selected appropriately.

370

Maximum Likelihood Methods 5.3. Let X1 , . . . , Xn and Y1 , . . . , Ym be independent random samples from the distributions N (θ1 , θ3 ) and N (θ2 , θ4 ), respectively. (a) Show that the likelihood ratio for testing H0 : θ1 = θ2 , θ3 = θ4 against all alternatives is given by

n

n/2 2

(xi − x) /n

1 n

m 1

(xi − u)2 +

m

1

m/2 2

(yi − y) /m

5

(yi − u)2

(n+m)/2 , (m + n)

1

where u = (nx + my)/(n + m). (b) Show that the likelihood ratio test for testing H0 : θ3 = θ4 , θ1 and θ2 unspeciﬁed, against H1 : θ3 = θ4 , θ1 and θ2 unspeciﬁed, can be based on the random variable n (Xi − X)2 /(n − 1) F =

1

m

. 2

(Yi − Y ) /(m − 1)

1

5.4. Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym be independent random samples from the two normal distributions N (0, θ1 ) and N (0, θ2 ). (a) Find the likelihood ratio Λ for testing the composite hypothesis H0 : θ1 = θ2 against the composite alternative H1 : θ1 = θ2 . (b) This Λ is a function of what F -statistic that would actually be used in this test? 5.5. Let X and Y be two independent random variables with respective pdfs 1 −x/θi 0 < x < ∞, 0 < θi < ∞ θi e f (x; θi ) = 0 elsewhere, for i = 1, 2. To test H0 : θ1 = θ2 against H1 : θ1 = θ2 , two independent samples of sizes n1 and n2 , respectively, were taken from these distributions. Find the likelihood ratio Λ and show that Λ can be written as a function of a statistic having an F -distribution, under H0 . 5.6. Consider the two uniform distributions with respective pdfs 1 −θi < x < θi , −∞ < θi < ∞ 2θi f (x; θi ) = 0 elsewhere, for i = 1, 2. The null hypothesis is H0 : θ1 = θ2 , while the alternative is H1 : θ1 = θ2 . Let X1 < X2 < · · · < Xn1 and Y1 < Y2 < · · · < Yn2 be the order statistics of two

371

Maximum Likelihood Methods independent random samples from the respective distributions. Using the likelihood ratio Λ, ﬁnd the statistic used to test H0 against H1 . Find the distribution of −2 log Λ when H0 is true. Note that in this nonregular case, the number of degrees of freedom is two times the diﬀerence of the dimensions of Ω and ω. 5.7. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a random sample from a bivariate normal distribution with μ1 , μ2 , σ12 = σ22 = σ 2 , ρ = 12 , where μ1 , μ2 , and σ 2 > 0 are unknown real numbers. Find the likelihood ratio Λ for testing H0 : μ1 = μ2 = 0, σ 2 unknown against all alternatives. The likelihood ratio Λ is a function of what statistic that has a well-known distribution? 5.8. Let n independent trials of an experiment be such that x1 , x2 , . . . , xk are the respective numbers of times that the experiment ends in the mutually exclusive and exhaustive events C1 , C2 , . . . , Ck . If pi = P (Ci ) is constant throughout the n trials, then the probability of that particular sequence of trials is L = px1 1 px2 2 · · · pxkk . (a) Recalling that p1 + p2 + · · · + pk = 1, show that the likelihood ratio for testing H0 : pi = pi0 > 0, i = 1, 2, . . . , k, against all alternatives is given by Λ=

k (pi0 )xi . (xi /n)xi i=1

(b) Show that −2 log Λ =

k xi (xi − npi0 )2 i=1

(npi )2

,

where pi is between pi0 and xi /n. Hint: Expand log pi0 in a Taylor’s series with the remainder in the term involving (pi0 − xi /n)2 . (c) For large n, argue that xi /(npi )2 is approximated by 1/(npi0 ) and hence −2 log Λ ≈

k (xi − npi0 )2 i=1

npi0

when H0 is true.

Theorem 5.1 says that the right-hand member of this last equation deﬁnes a statistic that has an approximate chi-square distribution with k − 1 degrees of freedom. Note that dimension of Ω – dimension of ω = (k − 1) − 0 = k − 1. 5.9. Finish the derivation of the LRT found in Example 5.3. Simplify as much as possible. 5.10. Show that expression (5.23) of Example 5.3 is true.

372

Maximum Likelihood Methods 5.11. As discussed in Example 5.3, Z, (5.25), can be used as a test statistic provided we have a consistent estimator of p1 (1 − p1 ) and p2 (1 − p2 ) when H0 is true. In the example, we discussed an estimator which is consistent under both H0 and H1 . Under H0 , though, p1 (1 − p1 ) = p2 (1 − p2 ) = p(1 − p), where p = p1 = p2 . Show that the statistic (5.22) is a consistent estimator of p, under H0 . Thus determine another test of H0 . 5.12. A machine shop that manufactures toggle levers has both a day and a night shift. A toggle lever is defective if a standard nut cannot be screwed onto the threads. Let p1 and p2 be the proportion of defective levers among those manufactured by the day and night shifts, respectively. We shall test the null hypothesis, H0 : p1 = p2 , against a two-sided alternative hypothesis based on two random samples, each of 1000 levers taken from the production of the respective shifts. Use the test statistic Z ∗ given in Example 5.3. (a) Sketch a standard normal pdf illustrating the critical region having α = 0.05. (b) If y1 = 37 and y2 = 53 defectives were observed for the day and night shifts, respectively, calculate the value of the test statistic and the approximate pvalue (note that this is a two-sided test). Locate the calculated test statistic on your ﬁgure in part (a) and state your conclusion. Obtain the approximate p-value of the test. 5.13. For the situation given in part (b) of Exercise 5.12, calculate the tests deﬁned in Exercises 5.9 and 5.11. Obtain the approximate p-values of all three tests. Discuss the results.

6

The EM Algorithm

In practice, we are often in the situation where part of the data is missing. For example, we may be observing lifetimes of mechanical parts which have been put on test and some of these parts are still functioning when the statistical analysis is carried out. In this section, we introduce the EM algorithm, which frequently can be used in these situations to obtain maximum likelihood estimates. Our presentation is brief. For further information, the interested reader can consult the literature in this area, including the monograph by McLachlan and Krishnan (1997). Although, for convenience, we write in terms of continuous random variables, the theory in this section holds for the discrete case as well. Suppose we consider a sample of n items, where n1 of the items are observed, while n2 = n − n1 items are not observable. Denote the observed items by X = (X1 , X2 , . . . , Xn1 ) and unobserved items by Z = (Z1 , Z2 , . . . , Zn2 ). Assume that the Xi s are iid with pdf f (x|θ), where θ ∈ Ω. Assume that Zj s and the Xi s are mutually independent. The conditional notation will prove useful here. Let g(x|θ) denote the joint pdf of X. Let h(x, z|θ) denote the joint pdf of the observed and unobserved items. Let k(z|θ, x) denote the conditional pdf of the missing data given

373

Maximum Likelihood Methods the observed data. By the deﬁnition of a conditional pdf, we have the identity k(z|θ, x) =

h(x, z|θ) . g(x|θ)

(6.1)

The observed likelihood function is L(θ|x) = g(x|θ). The complete likelihood function is deﬁned by (6.2) Lc (θ|x, z) = h(x, z|θ). Our goal is maximize the likelihood function L(θ|x) by using the complete likelihood Lc (θ|x, z) in this process. Using (6.1), we derive the following basic identity for an arbitrary but ﬁxed θ0 ∈ Ω: log L(θ|x) = log L(θ|x)k(z|θ0 , x) dz = log g(x|θ)k(z|θ0 , x) dz = [log h(x, z|θ) − log k(z|θ, x)]k(z|θ0 , x) dz = log[h(x, z|θ)]k(z|θ0 , x) dz − log[k(z|θ, x)]k(z|θ0 , x) dz =

Eθ0 [log Lc (θ|x, Z)|θ0 , x] − Eθ0 [log k(Z|θ, x)|θ0 , x],

(6.3)

where the expectations are taken under the conditional pdf k(z|θ0 , x). Deﬁne the ﬁrst term on the right side of (6.3) to be the function Q(θ|θ0 , x) = Eθ0 [log Lc (θ|x, Z)|θ0 , x].

(6.4)

The expectation which deﬁnes the function Q is called the E step of the EM algorithm. Recall that we want to maximize log L(θ|x). As discussed below, we need only maximize Q(θ|θ0 , x). This maximization is called the M step of the EM algorithm. Denote by θ(0) an initial estimate of θ, perhaps based on the observed likelihood. Let θ(1) be the argument which maximizes Q(θ|θ(0) , x). This is the ﬁrst-step estimate of θ. Proceeding this way, we obtain a sequence of estimates θ(m) . We formally deﬁne this algorithm as follows: Algorithm 6.1 (EM Algorithm). Let θ(m) denote the estimate on the mth step. To compute the estimate on the (m + 1)st step, do 1. Expectation Step: Compute Q(θ|θ(m) , x) = Eθ(m) [log Lc (θ|x, Z)|θm , x],

(6.5)

where the expectation is taken under the conditional pdf k(z|θ(m) , x). 2. Maximization Step: Let θ(m+1) = Argmax Q(θ|θ(m) , x).

374

(6.6)

Maximum Likelihood Methods Under strong assumptions, it can be shown that θ(m) converges in probability to the maximum likelihood estimate, as m → ∞. We will not show these results, but as the next theorem shows, θ(m+1) always increases the likelihood over θ(m) . Theorem 6.1. The sequence of estimates θ(m) , deﬁned by Algorithm 6.1, satisﬁes L(θ(m+1) |x) ≥ L(θ(m) |x).

(6.7)

Proof: Because θ(m+1) maximizes Q(θ|θ(m) , x), we have Q(θ(m+1) |θ(m) , x) ≥ Q(θ(m) |θ(m) , x); that is,

Eθ(m) [log Lc (θ(m+1) |x, Z)] ≥ Eθ(m) [log Lc (θ(m) |x, Z)],

(6.8)

where the expectation is taken under the pdf k(z|θ(m) , x). By expression (6.3), we can complete the proof by showing that Eθ(m) [log k(Z|θ(m+1) , x)] ≤ Eθ(m) [log k(Z|θ(m) , x)].

(6.9)

Keep in mind that these expectations are taken under the conditional pdf of Z given θ(m) and x. An application of Jensen’s inequality yields k(Z|θ(m+1) , x) k(Z|θ(m+1) , x) Eθ(m) log ≤ log Eθ(m) k(Z|θ(m) , x) k(Z|θ(m) , x) k(z|θ(m+1) , x) k(z|θ(m) , x) dz = log k(z|θ(m) , x) = log(1) = 0. (6.10) This last result establishes (6.9) and, hence, ﬁnishes the proof. As an example, suppose X1 , X2 , . . . , Xn1 are iid with pdf f (x − θ), for −∞ < x < ∞, where −∞ < θ < ∞. Denote the cdf of Xi by F (x − θ). Let Z1 , Z2 , . . . , Zn2 denote the censored observations. For these observations, we only know that Zj > a, for some a which is known, and that the Zj s are independent of the Xi s. Then the observed and complete likelihoods are given by L(θ|x)

=

[1 − F (a − θ)]n2

n1

f (xi − θ)

(6.11)

i=1

Lc (θ|x, z)

=

n1

f (xi − θ)

i=1

n2

f (zi − θ).

(6.12)

i=1

By expression (6.1), the conditional distribution Z given X is the ratio of (6.12) to (6.11); that is, 6n2 6n1 i=1 f (xi − θ) 6i=1 f (zi − θ) k(z|θ, x) = n1 [1 − F (a − θ)]n2 i=1 f (xi − θ) n2 = [1 − F (a − θ)]−n2 f (zi − θ), a < zi , i = 1, . . . , n2 . (6.13) i=1

375

Maximum Likelihood Methods Thus, Z and X are independent, and Z1 , . . . , Zn2 are iid with the common pdf f (z − θ)/[1 − F (a − θ)], for z > a. Based on these observations and expression (6.13), we have the following derivation: Q(θ|θ0 , x)

Eθ0 [log Lc (θ|x, Z)] n n2 1 = Eθ0 log f (xi − θ) + log f (Zi − θ)

=

i=1

= =

n1 i=1 n1

i=1

log f (xi − θ) + n2 Eθ0 [log f (Z − θ)] log f (xi − θ)

i=1

∞

+ n2 a

log f (z − θ)

f (z − θ0 ) dz. 1 − F (a − θ0 )

(6.14)

This last result is the E step of the EM algorithm. For the M step, we need the partial derivative of Q(θ|θ0 , x) with respect to θ. This is easily found to be n ∞ 1 f (xi − θ) f (z − θ) f (z − θ0 ) ∂Q =− + n2 dz . (6.15) ∂θ f (xi − θ) f (z − θ) 1 − F (a − θ0 ) a i=1 Assuming that θ0 = θ0 , the ﬁrst-step EM estimate would be the value of θ, say θ(1) , which solves ∂Q ∂θ = 0. In the next example, we obtain the solution for a normal model. Example 6.1. Assume the censoring model given above, but now assume that X has a N (θ, 1) distribution. Then f (x) = φ(x) = (2π)−1/2 exp{−x2 /2}. It is easy to show that f (x)/f (x) = −x. Letting Φ(z) denote, as usual, the cdf of a standard normal random variable, by (6.15) the partial derivative of Q(θ|θ0 , x) with respect to θ for this model simpliﬁes to ∞ n1 1 exp{−(z − θ0 )2 /2} ∂Q = dz (xi − θ) + n2 (z − θ) √ ∂θ 2π 1 − Φ(a − θ0 ) a i=1 ∞ 1 exp{−(z − θ0 )2 /2} dz − n2 (θ − θ0 ) = n1 (x − θ) + n2 (z − θ0 ) √ 2π 1 − Φ(a − θ0 ) a n2 φ(a − θ0 ) − n2 (θ − θ0 ). = n1 (x − θ) + 1 − Φ(a − θ0 ) Solving ∂Q/∂θ = 0 for θ determines the EM step estimates. In particular, given that θ(m) is the EM estimate on the mth step, the (m + 1)st step estimate is n1 n2 (m) n2 φ(a − θ(m) ) x+ + θ(m+1) = , θ n n n 1 − Φ(a − θ(m) ) where n = n1 + n2 .

376

(6.16)

Maximum Likelihood Methods For our second example, consider a mixture problem involving normal distributions. Suppose Y1 has a N (μ1 , σ12 ) distribution and Y2 has a N (μ2 , σ22 ) distribution. Let W be a Bernoulli random variable independent of Y1 and Y2 and with probability of success = P (W = 1). Suppose the random variable we observe is X = (1 − W )Y1 + W Y2 . In this case, the vector of parameters is given by θ = (μ1 , μ2 , σ1 , σ2 , ). The pdf of the mixture random variable X is f (x) = (1 − )f1 (x) + f2 (x),

−∞ < x < ∞,

(6.17)

where fj (x) = σj−1 φ[(x − μj )/σj ], j = 1, 2, and φ(z) is the pdf of a standard normal random variable. Suppose we observe a random sample X = (X1 , X2 , . . . , Xn ) from this mixture distribution with pdf f (x). Then the log of the likelihood function is l(θ|x) =

n

log[(1 − )f1 (xi ) + f2 (xi )].

(6.18)

i=1

In this mixture problem, the unobserved data are the random variables which identify the distribution membership. For i = 1, 2, . . . , n, deﬁne the random variables 0 if Xi has pdf f1 (x) Wi = 1 if Xi has pdf f2 (x). These variables, of course, constitute the random sample on the Bernoulli random variable W . Accordingly, assume that W1 , W2 , . . . , Wn are iid Bernoulli random variables with probability of success . The complete likelihood function is f1 (xi ) f2 (xi ). Lc (θ|x, w) = Wi =0

Wi =1

Hence the log of the complete likelihood function is lc (θ|x, w) = log f1 (xi ) + log f2 (xi ) =

Wi =0 n

Wi =1

[(1 − wi ) log f1 (xi ) + wi log f2 (xi )].

(6.19)

i=1

For the E step of the algorithm, we need the conditional expectation of Wi given x under θ 0 ; that is, Eθ 0 [Wi |θ 0 , x] = P [Wi = 1|θ 0 , x]. An estimate of this expectation is the likelihood of xi being drawn from distribution f2 (x), which is given by γi =

f2,0 (xi ) , (1 − )f1,0 (xi ) + f2,0 (xi )

(6.20)

where the subscript 0 signiﬁes that the parameters at θ 0 are being used. Expression (6.20) is intuitively evident; see McLachlan and Krishnan (1997) for more

377

Maximum Likelihood Methods discussion. Replacing wi by γi in expression (6.19), the M step of the algorithm is to maximize Q(θ|θ 0 , x) =

n

[(1 − γi ) log f1 (xi ) + γi log f2 (xi )].

(6.21)

i=1

This maximization is easy to obtain by taking partial derivatives of Q(θ|θ 0 , x) with respect to the parameters. For example, ∂Q = (1 − γi )(−1/2σ12 )(−2)(xi − μ1 ). ∂μ1 i=1 n

Setting this to 0 and solving for μ1 yields the estimate of μ1 . The estimates of the other mean and the variances can be obtained similarly. These estimates are n (1 − γi )xi i=1 μ 1 = n (1 − γi ) ni=1 (1 − γi )(xi − μ 1 )2 i=1 σ 12 = n i=1 (1 − γi ) n γi x i i=1 μ 2 = n γi ni=1 γ 2 )2 i=1i (xi − μ σ 22 = . n i=1 γi n Since γi is an estimate of P [Wi = 1|θ 0 , x], the average n−1 i=1 γi is an estimate . of = P [Wi = 1]. This average is our estimate of EXERCISES 6.1. Rao (page 368, 1973) considers a problem in the estimation of linkages in genetics. McLachlan and Krishnan (1997) also discuss this problem and we present their model. For our purposes, it can be described as a multinomial model with the four categories C1 , C2 , C3 , and C4 . For a sample of size n, let X = (X1 , X2 , X3 , X4 ) 4 denote the observed frequencies of the four categories. Hence, n = i=1 Xi . The probability model is C1 1 2

+ 14 θ

C2 1 4

− 14 θ

1 4

C3

C4

− 14 θ

1 4θ

where the parameter θ satisﬁes 0 ≤ θ ≤ 1. In this exercise, we obtain the mle of θ. (a) Show that likelihood function is given by x 1 x2 +x3 x4 1 1 1 1 1 n! + θ − θ θ . L(θ|x) = x1 !x2 !x3 !x4 ! 2 4 4 4 4

378

(6.22)

Maximum Likelihood Methods (b) Show that the log of the likelihood function can be expressed as a constant (not involving parameters) plus the term x1 log[2 + θ] + [x2 + x3 ] log[1 − θ] + x4 log θ. (c) Obtain the partial derivative with respect to θ of the last expression, set the result to 0, and solve for the mle. (This will result in a quadratic equation which has one positive and one negative root.) 6.2. In this exercise, we set up an EM algorithm to determine the mle for the situation described in Exercise 6.1. Split category C1 into the two subcategories C11 and C12 with probabilities 1/2 and θ/4, respectively. Let Z11 and Z12 denote the respective “frequencies.” Then X1 = Z11 + Z12 . Of course, we cannot observe Z11 and Z12 . Let Z = (Z11 , Z12 ) . (a) Obtain the complete likelihood Lc (θ|x, z). (b) Using the last result and (6.22), show that the conditional pmf k(z|θ, x) is binomial with parameters x1 and probability of success θ/(2 + θ). (c) Obtain the E step of the EM algorithm given an initial estimate θ(0) of θ. That is, obtain Q(θ|θ(0) , x) = Eθ(0) [log Lc (θ|x, Z)|θ(0) , x]. Recall that this expectation is taken using the conditional pmf k(z|θ(0) , x). Keep in mind the next step; i.e., we need only terms that involve θ. (d) For the M step of the EM algorithm, solve the equation ∂Q(θ|θ(0) , x)/∂θ = 0. Show that the solution is x1 θ(0) + 2x4 + x4 θ(0) . θ(1) = nθ(0) + 2(x2 + x3 + x4 )

(6.23)

6.3. For the setup of Exercise 6.2, show that the following estimator of θ is unbiased: θ7 = n−1 (X1 − X2 − X3 + X4 ).

(6.24)

6.4. Rao (page 368, 1973) presents data for the situation described in Exercise 6.1. The observed frequencies are x = (125, 18, 20, 34) . (a) Using computational packages (for example, R), with (6.24) as the initial estimate, write a program that obtains the stepwise EM estimates θ(k) . (b) Using the data from Rao, compute the EM estimate of θ with your program. List the sequence of EM estimates, {θk }, that you obtained. Did your sequence of estimates converge?

379

Maximum Likelihood Methods (c) Show that the mle using the likelihood approach in Exercise 6.1 is the positive root of the equation 197θ2 − 15θ − 68 = 0. Compare it with your EM solution. They should be the same within roundoﬀ error. 6.5. Suppose X1 , X2 , . . . , Xn1 is a random sample from a N (θ, 1) distribution. Besides these n1 observable items, suppose there are n2 missing items, which we denote by Z1 , Z2 , . . . , Zn2 . Show that the ﬁrst-step EM estimate is n1 x + n2 θ(0) , θ(1) = n where θ(0) is an initial estimate of θ and n = n1 + n2 . Note that if θ(0) = x, then θ(k) = x for all k. 6.6. Consider the situation described in Example 6.1. But suppose we have left censoring. That is, if Z1 , Z2 , . . . , Zn2 are the censored items, then all we know is that each Zj < a. Obtain the EM algorithm estimate of θ. 6.7. Suppose the following data follow the model of Example 6.1. 2.01 0.07

0.74 −0.04

0.68 −0.21

1.50+ 0.05

1.50+ 0.67

1.47 −0.09

1.50+ 0.14

1.52

where the superscript + denotes that the observation was censored at 1.50. Write a computer program to obtain the EM algorithm estimate of θ. 6.8. The following data are observations of the random variable X = (1 − W )Y1 + W Y2 , where W has a Bernoulli distribution with probability of success 0.70; Y1 has a N (100, 202 ) distribution; Y2 has a N (120, 252 ) distribution; W and Y1 are independent; and W and Y2 are independent. 119.0 114.1 145.7

96.0 136.2 95.9

146.2 136.4 97.3

138.6 184.8 136.4

143.4 79.8 109.2

98.2 151.9 103.2

124.5 114.2

Program the EM algorithm for this mixing problem as discussed at the end of the section. Use a dotplot to obtain initial estimates of the parameters. Compute the estimates. How close are they to the true parameters?

Answers to Selected Exercises 1.1 X/3.

380

1.2

6n (a) −n/ log( i=1 Xi ). (b)Y1 = min{X1 , . . . , Xn }.

1.4

(a) Yn = max{X1 , . . . , Xn }. (b) (2n + 1)/(2n).

(c)

1/2Yn .

1.5 1 − exp{−2/X}. 53 , 1.6 p = 125 5 ' 5 ( x=3 x

px (1 − p)5−x .

Maximum Likelihood Methods 1.8 x = 2.109; x2 e−x /2.

1.9 max 12 , X . 2.7 (a)

4 θ2 .

2.8 (a)

1 2θ 2 .

3.15 (a)

'

( 1 nx 3x

4.5 (Y1 + Yn )/2, (Yn − Y1 )/2; no. 4.6

(n−1)/nS

2 3(1−x)

n−nx .

3.16 (a) nx log(2/x) − n(2 − x). nα 3.17 x/α β0 n . × exp − i=1 xi β10 − αx 4.1

4 11 7 25 , 25 , 25 .

4.2 (a) x, y, %n 1 (x − x)2 n+m m i=1 i 2 & + i=1 (yi − y) . (b) nx+my , n+m n 1 2 i=1 (xi − θ1 ) n+m m + i=1 (yi − θ1 )2 . n 4.3 θ1 = min{Xi }, n1 i=1 (Xi − θ1 ). 4.4

(a) X + 1.282 n−1 n S; c−X . (b) Φ √

θ1 = min{X 6 i }, n n . n/ log X / θ i 1 i=1

4.7 If ny11 ≤ ny22 , then p1 = ny11 and 2 p2 = ny22 ; else, p1 = p2 = ny11 +y +n2 . 5.1 t = 3 > 2.262; reject H0 . n

X2

i 5.4 (b) c i=1 . m Y2 i=1

i

5.5 c X . Y 5.6 c

[max{−X1 ,Xn1 }]n1 [max{−Y1 ,Yn2 }]n2 [max{−X1 ,−Y1 ,Xn1 ,Yn2 }]n1 +n2 2

,

χ (2).

6.8 The R function mixnormal found in Appendix B produced these results (ﬁrst row are initial estimates, second row are the estimates after 500 iterations): μ1 105.00 98.76

μ2 130.00 133.96

σ1 15.00 9.88

σ2 25.00 21.50

π 0.600 0.704

381

382

Suﬃciency 1

Measures of Quality of Estimators

In this chapter, we present some optimal point estimates and tests for certain situations. We ﬁrst consider point estimation. In this chapter, we ﬁnd it convenient to use the letter f to denote a pmf as well as a pdf. It is clear from the context whether we are discussing the distributions of discrete or continuous random variables. Suppose f (x; θ) for θ ∈ Ω is the pdf (pmf) of a continuous (discrete) random variable X. Consider a point estimator Yn = u(X1 , . . . , Xn ) based on a sample X1 , . . . , Xn . You should know several properties of point estimators. Recall that Yn is a consistent estimator of θ if Yn converges to θ in probability; i.e., Yn is close to θ for large sample sizes. This is deﬁnitely a desirable property of a point estimator. Another property was unbiasedness, which says that Yn is an unbiased estimator of θ if E(Yn ) = θ. Recall that maximum likelihood estimators may not be unbiased, although generally they are asymptotically unbiased. If two estimators of θ are unbiased, it would seem that we would choose the one with the smaller variance. This would be especially true if they were both approximately normal because the one with the smaller asymptotic variance (and hence asymptotic standard error) would tend to produce shorter asymptotic conﬁdence intervals for θ. This leads to the following deﬁnition: Deﬁnition 1.1. For a given positive integer n, Y = u(X1 , X2 , . . . , Xn ) is called a minimum variance unbiased estimator (MVUE) of the parameter θ if Y is unbiased, that is, E(Y ) = θ, and if the variance of Y is less than or equal to the variance of every other unbiased estimator of θ. Example 1.1. As an illustration, let X1 , X2 , . . . , X9 denote a random sample from a distribution that is N (θ, σ 2 ), where −∞ < θ < ∞. Because the statistic

From Chapter 7 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

383

Suﬃciency 2

X = (X1 + X2 + · · · + X9 )/9 is N (θ, σ9 ), X is an unbiased estimator of θ. The statistic X1 is N (θ, σ 2 ), so X1 is also an unbiased estimator of θ. Although the vari2 ance σ9 of X is less than the variance σ 2 of X1 , we cannot say, with n = 9, that X is the minimum variance unbiased estimator (MVUE) of θ; that deﬁnition requires that the comparison be made with every unbiased estimator of θ. To be sure, it is quite impossible to tabulate all other unbiased estimators of this parameter θ, so other methods must be developed for making the comparisons of the variances. A beginning on this problem is made in this chapter. Let us now discuss the problem of point estimation of a parameter from a slightly diﬀerent standpoint. Let X1 , X2 , . . . , Xn denote a random sample of size n from a distribution that has the pdf f (x; θ), θ ∈ Ω. The distribution may be of either the continuous or the discrete type. Let Y = u(X1 , X2 , . . . , Xn ) be a statistic on which we wish to base a point estimate of the parameter θ. Let δ(y) be that function of the observed value of the statistic Y which is the point estimate of θ. Thus the function δ decides the value of our point estimate of θ and δ is called a decision function or a decision rule. One value of the decision function, say δ(y), is called a decision. Thus a numerically determined point estimate of a parameter θ is a decision. Now a decision may be correct or it may be wrong. It would be useful to have a measure of the seriousness of the diﬀerence, if any, between the true value of θ and the point estimate δ(y). Accordingly, with each pair, [θ, δ(y)], θ ∈ Ω, we associate a nonnegative number L[θ, δ(y)] that reﬂects this seriousness. We call the function L the loss function. The expected (mean) value of the loss function is called the risk function. If fY (y; θ), θ ∈ Ω, is the pdf of Y , the risk function R(θ, δ) is given by ∞ L[θ, δ(y)]fY (y; θ) dy R(θ, δ) = E{L[θ, δ(y)]} = −∞

if Y is a random variable of the continuous type. It would be desirable to select a decision function that minimizes the risk R(θ, δ) for all values of θ, θ ∈ Ω. But this is usually impossible because the decision function δ that minimizes R(θ, δ) for one value of θ may not minimize R(θ, δ) for another value of θ. Accordingly, we need either to restrict our decision function to a certain class or to consider methods of ordering the risk functions. The following example, while very simple, dramatizes these diﬃculties. Example 1.2. Let X1 , X2 , . . . , X25 be a random sample from a distribution that is N (θ, 1), for −∞ < θ < ∞. Let Y = X, the mean of the random sample, and let L[θ, δ(y)] = [θ − δ(y)]2 . We shall compare the two decision functions given by δ1 (y) = y and δ2 (y) = 0 for −∞ < y < ∞. The corresponding risk functions are R(θ, δ1 ) = E[(θ − Y )2 ] =

1 25

and R(θ, δ2 ) = E[(θ − 0)2 ] = θ2 .

384

Suﬃciency If, in fact, θ = 0, then δ2 (y) = 0 is an excellent decision and we have R(0, δ2 ) = 0. However, if θ diﬀers from zero by very much, it is equally clear that δ2 = 0 is a poor 1 . In general, decision. For example, if, in fact, θ = 2, R(2, δ2 ) = 4 > R(2, δ1 ) = 25 1 1 we see that R(θ, δ2 ) < R(θ, δ1 ), provided that − 5 < θ < 5 , and that otherwise R(θ, δ2 ) ≥ R(θ, δ1 ). That is, one of these decision functions is better than the other for some values of θ, while the other decision function is better for other values of θ. If, however, we had restricted our consideration to decision functions δ such that E[δ(Y )] = θ for all values of θ, θ ∈ Ω, then the decision function δ2 (y) = 0 is not allowed. Under this restriction and with the given L[θ, δ(y)], the risk function is the variance of the unbiased estimator δ(Y ), and we are confronted with the problem of ﬁnding the MVUE. Later in this chapter we show that the solution is δ(y) = y = x. Suppose, however, that we do not want to restrict ourselves to decision functions δ, such that E[δ(Y )] = θ for all values of θ, θ ∈ Ω. Instead, let us say that the decision function that minimizes the maximum of the risk function is the best decision function. Because, in this example, R(θ, δ2 ) = θ2 is unbounded, δ2 (y) = 0 is not, in accordance with this criterion, a good decision function. On the other hand, with −∞ < θ < ∞, we have 1 )= maxR(θ, δ1 ) = max( 25 θ

θ

1 25 .

Accordingly, δ1 (y) = y = x seems to be a very good decision in accordance with 1 is small. As a matter of fact, it can be proved that δ1 is this criterion because 25 the best decision function, as measured by the minimax criterion, when the loss function is L[θ, δ(y)] = [θ − δ(y)]2 . In this example we illustrated the following: 1. Without some restriction on the decision function, it is diﬃcult to ﬁnd a decision function that has a risk function which is uniformly less than the risk function of another decision. 2. One principle of selecting a best decision function is called the minimax principle. This principle may be stated as follows: If the decision function given by δ0 (y) is such that, for all θ ∈ Ω, maxR[θ, δ0 (y)] ≤ maxR[θ, δ(y)] θ

θ

for every other decision function δ(y), then δ0 (y) is called a minimax decision function. With the restriction E[δ(Y )] = θ and the loss function L[θ, δ(y)] = [θ − δ(y)]2 , the decision function that minimizes the risk function yields an unbiased estimator with minimum variance. If, however, the restriction E[δ(Y )] = θ is replaced by some other condition, the decision function δ(Y ), if it exists, which minimizes E{[θ − δ(Y )]2 } uniformly in θ is sometimes called the minimum mean-squared-error estimator. Exercises 1.6–1.8 provide examples of this type of estimator. There are two additional observations about decision rules and loss functions that should be made at this point. First, since Y is a statistic, the decision rule

385

Suﬃciency δ(Y ) is also a statistic, and we could have started directly with a decision rule based on the observations in a random sample, say, δ1 (X1 , X2 , . . . , Xn ). The risk function is then given by R(θ, δ1 )

= =

E{L[θ, δ1 (X1 , . . . , Xn )]} ∞ ∞ ··· L[θ, δ1 (x1 , . . . , xn )]f (x1 ; θ) · · · f (xn ; θ) dx1 · · · dxn −∞

−∞

if the random sample arises from a continuous-type distribution. We do not do this, because, as we show in this chapter, it is rather easy to ﬁnd a good statistic, say Y , upon which to base all of the statistical inferences associated with a particular model. Thus we thought it more appropriate to start with a statistic that would be familiar, like the mle Y = X in Example 1.2. The second decision rule of that example could be written δ2 (X1 , X2 , . . . , Xn ) = 0, a constant no matter what values of X1 , X2 , . . . , Xn are observed. The second observation is that we have only used one loss function, namely, the squared-error loss function L(θ, δ) = (θ − δ)2 . The absolute-error loss function L(θ, δ) = |θ − δ| is another popular one. The loss function deﬁned by L(θ, δ)

=

0,

|θ − δ| ≤ a,

=

b,

|θ − δ| > a,

where a and b are positive constants, is sometimes referred to as the goalpost loss function. The reason for this terminology is that football fans recognize that it is similar to kicking a ﬁeld goal: There is no loss (actually a three-point gain) if within a units of the middle but b units of loss (zero points awarded) if outside that restriction. In addition, loss functions can be asymmetric as well as symmetric, as the three previous ones have been. That is, for example, it might be more costly to underestimate the value of θ than to overestimate it. (Many of us think about this type of loss function when estimating the time it takes us to reach an airport to catch a plane.) Some of these loss functions are considered when studying Bayesian estimates. Let us close this section with an interesting illustration that raises a question leading to the likelihood principle, which many statisticians believe is a quality characteristic that estimators should enjoy. Suppose that two statisticians, A and B, observe 10 independent trials of a random experiment ending in success or failure. Let the probability of success on each trial be θ, where 0 < θ < 1. Let us say that each statistician observes one success in these 10 trials. Suppose, however, that A had decided to take n = 10 such observations in advance and found only one success, while B had decided to take as many observations as needed to get the ﬁrst success, which happened on the 10th trial. The model of A is that Y is b(n = 10, θ) and y = 1 is observed. On the other hand, B is considering the random variable Z that has a geometric pmf g(z) = (1 − θ)z−1 θ, z = 1, 2, 3, . . ., and z = 10 is observed. In either case, an estimate of θ could be the relative frequency of success given by 1 1 y = = . n z 10

386

Suﬃciency Let us observe, however, that one of the corresponding estimators, Y /n and 1/Z, is biased. We have 1 1 Y = E(Y ) = (10θ) = θ, E 10 10 10 while E

1 = Z =

∞ 1

z z=1

(1 − θ)z−1 θ

θ + 12 (1 − θ)θ + 13 (1 − θ)2 θ + · · · > θ.

That is, 1/Z is a biased estimator while Y /10 is unbiased. Thus A is using an unbiased estimator while B is not. Should we adjust B’s estimator so that it, too, is unbiased? It is interesting to note that if we maximize the two respective likelihood functions, namely, 10 y θ (1 − θ)10−y L1 (θ) = y and L2 (θ) = (1 − θ)z−1 θ, 1 with n = 10, y = 1, and z = 10, we get exactly the same answer, θˆ = 10 . This 9 must be the case, because in each situation we are maximizing (1 − θ) θ. Many statisticians believe that this is the way it should be and accordingly adopt the likelihood principle: Suppose two diﬀerent sets of data from possibly two diﬀerent random experiments lead to respective likelihood ratios, L1 (θ) and L2 (θ), that are proportional to each other. These two data sets provide the same information about the parameter θ and a statistician should obtain the same estimate of θ from either. In our special illustration, we note that L1 (θ) ∝ L2 (θ), and the likelihood principle states that statisticians A and B should make the same inference. Thus believers in the likelihood principle would not adjust the second estimator to make it unbiased.

EXERCISES 1.1. Show that the mean X of a random sample of size n from a distribution having pdf f (x; θ) = (1/θ)e−(x/θ) , 0 < x < ∞, 0 < θ < ∞, zero elsewhere, is an unbiased estimator of θ and has variance θ2 /n. 1.2. Let X1 , X2 , . . . , Xn denote a random sample from a normal distribution with n 2 mean zero and variance θ, 0 < θ < ∞. Show that 1 Xi /n is an unbiased 2 estimator of θ and has variance 2θ /n.

387

Suﬃciency 1.3. Let Y1 < Y2 < Y3 be the order statistics of a random sample of size 3 from the uniform distribution having pdf f (x; θ) = 1/θ, 0 < x < θ, 0 < θ < ∞, zero elsewhere. Show that 4Y1 , 2Y2 , and 43 Y3 are all unbiased estimators of θ. Find the variance of each of these unbiased estimators. 1.4. Let Y1 and Y2 be two independent unbiased estimators of θ. Assume that the variance of Y1 is twice the variance of Y2 . Find the constants k1 and k2 so that k1 Y1 + k2 Y2 is an unbiased estimator with the smallest possible variance for such a linear combination. 1.5. In Example 1.2 of this section, take L[θ, δ(y)] = |θ−δ(y)|. Show that R(θ, δ1 ) = 1 2/π and R(θ, δ2 ) = |θ|. Of these two decision functions δ1 and δ2 , which yields 5 the smaller maximum risk? 1.6. Let X1 , X2 , . . . , Xn denote a random n sample from a Poisson distribution with parameter θ, 0 < θ < ∞. Let Y = 1 Xi and let L[θ, δ(y)] = [θ − δ(y)]2 . If we restrict our considerations to decision functions of the form δ(y) = b + y/n, where b does not depend on y, show that R(θ, δ) = b2 + θ/n. What decision function of this form yields a uniformly smaller risk than every other decision function of this form? With this solution, say δ, and 0 < θ < ∞, determine maxθ R(θ, δ) if it exists. 1.7. Let X1 , X2 , . . . , Xn denote a random sample nfrom a distribution that is N (μ, θ), 0 < θ < ∞, where μ is unknown. Let Y = 1 (Xi − X)2 /n and let L[θ, δ(y)] = [θ − δ(y)]2 . If we consider decision functions of the form δ(y) = by, where b does not depend upon y, show that R(θ, δ) = (θ2 /n2 )[(n2 − 1)b2 − 2n(n − 1)b + n2 ]. Show that b = n/(n + 1) yields a minimum risk decision function of this form. Note that nY /(n + 1) is not an unbiased estimator of θ. With δ(y) = ny/(n + 1) and 0 < θ < ∞, determine maxθ R(θ, δ) if it exists. 1.8. Let X1 , X2 , . . . , Xndenote a random sample from a distribution that is b(1, θ), n 0 ≤ θ ≤ 1. Let Y = 1 Xi and let L[θ, δ(y)] = [θ − δ(y)]2 . Consider decision functions of the form δ(y) = by, where b does not depend upon y. Prove that R(θ, δ) = b2 nθ(1 − θ) + (bn − 1)2 θ2 . Show that max R(θ, δ) = θ

b4 n2 , 4[b2 n − (bn − 1)2 ]

provided that the value b is such that b2 n > (bn − 1)2 . Prove that b = 1/n does not minimize maxθ R(θ, δ). 1.9. Let X1 , X2 , . . . , Xn be a random sample from a Poisson distribution with mean θ > 0. (a) Statistician A observes the sample to be the values x1 , x2 , . . . , xn with sum y= xi . Find the mle of θ. (b) Statistician B loses the sample values x1 , x2 , . . . , xn but remembers the sum y1 and the fact that the sample arose from a Poisson distribution. Thus

388

Suﬃciency B decides to create some fake observations, which he calls z1 , z2 , . . . , zn (as he knows they will probably not equal the original x-values) as follows. He notes that the conditional probability of independent Poisson random vari zi = y1 , is ables Z1 , Z2 , . . . , Zn being equal to z1 , z2 , . . . , zn , given z1 z2 z n θ z1 e−θ θ z2 e−θ θ zn e−θ 1 1 y1 ! 1 z1 ! z2 ! · · · zn ! = · · · (nθ)y1 e−nθ z1 !z2 ! · · · zn ! n n n y1 ! since Y1 = Zi has a Poisson distribution with mean nθ. The latter distribution is multinomial with y1 independent trials, each terminating in one of n mutually exclusive and exhaustive ways, each of which has the same probability 1/n. Accordingly, B runs such a multinomial experiment y1 independent trials and obtains z1 , z2 , . . . , zn . Find the likelihood function using these zvalues. Is it proportional to that of statistician A? Hint: Here the likelihood function is the product of this conditional pdf and Zi . the pdf of Y1 =

2

A Suﬃcient Statistic for a Parameter

Suppose that X1 , X2 , . . . , Xn is a random sample from a distribution that has pdf f (x; θ), θ ∈ Ω. We can construct statistics to make statistical inferences as illustrated by point and interval estimation and tests of statistical hypotheses. We note that a statistic, for example, Y = u(X1 , X2 , . . . , Xn ), is a form of data reduction. To illustrate, instead of listing all of the individual observations X1 , X2 , . . . , Xn , we might prefer to give only the sample mean X or the sample variance S 2 . Thus statisticians look for ways of reducing a set of data so that these data can be more easily understood without losing the meaning associated with the entire set of observations. It is interesting to note that a statistic Y = u(X1 , X2 , . . . , Xn ) really partitions the sample space of X1 , X2 , . . . , Xn . For illustration, suppose we say that the sample was observed and x = 8.32. There are many points in the sample space which have that same mean of 8.32, and we can consider them as belonging to the set {(x1 , x2 , . . . , xn ) : x = 8.32}. As a matter of fact, all points on the hyperplane x1 + x2 + · · · + xn = (8.32)n yield the mean of x = 8.32, so this hyperplane is the set. However, there are many values that X can take, and thus there are many such sets. So, in this sense, the sample mean X, or any statistic Y = u(X1 , X2 , . . . , Xn ), partitions the sample space into a collection of sets. Often in the study of statistics the parameter θ of the model is unknown; thus, we need to make some statistical inference about it. In this section we consider a statistic denoted by Y1 = u1 (X1 , X2 , . . . , Xn ), which we call a suﬃcient statistic and which we ﬁnd is good for making those inferences. This suﬃcient statistic partitions the sample space in such a way that, given (X1 , X2 , . . . , Xn ) ∈ {(x1 , x2 , . . . , xn ) : u1 (x1 , x2 , . . . , xn ) = y1 },

389

Suﬃciency the conditional probability of X1 , X2 , . . . , Xn does not depend upon θ. Intuitively, this means that once the set determined by Y1 = y1 is ﬁxed, the distribution of another statistic, say Y2 = u2 (X1 , X2 , . . . , Xn ), does not depend upon the parameter θ because the conditional distribution of X1 , X2 , . . . , Xn does not depend upon θ. Hence it is impossible to use Y2 , given Y1 = y1 , to make a statistical inference about θ. So, in a sense, Y1 exhausts all the information about θ that is contained in the sample. This is why we call Y1 = u1 (X1 , X2 , . . . , Xn ) a suﬃcient statistic. To understand clearly the deﬁnition of a suﬃcient statistic for a parameter θ, we start with an illustration. Example 2.1. Let X1 , X2 , . . . , Xn denote a random sample from the distribution that has pmf x θ (1 − θ)1−x x = 0, 1; 0 < θ < 1 f (x; θ) = 0 elsewhere. The statistic Y1 = X1 + X2 + · · · + Xn has the pmf n y1 n−y1 y1 = 0, 1, . . . , n y1 θ (1 − θ) fY1 (y1 ; θ) = 0 elsewhere. What is the conditional probability P (X1 = x1 , X2 = x2 , . . . , Xn = xn |Y1 = y1 ) = P (A|B), say, where y1 = 0, 1, 2, . . . , n? Unless the sum of the integers x1 , x2 , . . . , xn (each of probability obviously equals which equals zero or 1) is equal to y1 , the conditional xi , we have that A ⊂ B, so that zero because A ∩ B = φ. But in the case y1 = A ∩ B = A and P (A|B) = P (A)/P (B); thus, the conditional probability equals θx1 (1 − θ)1−x1 θx2 (1 − θ)1−x2 · · · θxn (1 − θ)1−xn n y1 θ (1 − θ)n−y1 y1

=

=

θ xi (1 − θ)n− xi n θ xi (1 − θ)n− xi xi 1 . n xi

Since y1 = x1 + x2 + · · · + xn equals the number of ones in the n independent trials, this is the conditional probability of selecting a particular arrangement of y1 ones and (n − y1 ) zeros. Note that this conditional probability does not depend upon the value of the parameter θ. In general, let fY1 (y1 ; θ) be the pmf of the statistic Y1 = u1 (X1 , X2 , . . . , Xn ), where X1 , X2 , . . . , Xn is a random sample arising from a distribution of the discrete type having pmf f (x; θ), θ ∈ Ω. The conditional probability of X1 = x1 , X2 = x2 , . . . , Xn = xn , given Y1 = y1 , equals f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) , fY1 [u1 (x1 , x2 , . . . , xn ); θ]

390

Suﬃciency provided that x1 , x2 , . . . , xn are such that the ﬁxed y1 = u1 (x1 , x2 , . . . , xn ), and equals zero otherwise. We say that Y1 = u1 (X1 , X2 , . . . , Xn ) is a suﬃcient statistic for θ if and only if this ratio does not depend upon θ. While, with distributions of the continuous type, we cannot use the same argument, we do, in this case, accept the fact that if this ratio does not depend upon θ, then the conditional distribution of X1 , X2 , . . . , Xn , given Y1 = y1 , does not depend upon θ. Thus, in both cases, we use the same deﬁnition of a suﬃcient statistic for θ. Deﬁnition 2.1. Let X1 , X2 , . . . , Xn denote a random sample of size n from a distribution that has pdf or pmf f (x; θ), θ ∈ Ω. Let Y1 = u1 (X1 , X2 , . . . , Xn ) be a statistic whose pdf or pmf is fY1 (y1 ; θ). Then Y1 is a suﬃcient statistic for θ if and only if f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) = H(x1 , x2 , . . . , xn ), fY1 [u1 (x1 , x2 , . . . , xn ); θ] where H(x1 , x2 , . . . , xn ) does not depend upon θ ∈ Ω. Remark 2.1. In most cases, X1 , X2 , . . . , Xn represent the observations of a random sample; that is, they are iid. It is not necessary, however, in more general situations, that these random variables be independent; as a matter of fact, they do not need to be identically distributed. Thus, more generally, the deﬁnition of suﬃciency of a statistic Y1 = u1 (X1 , X2 , . . . , Xn ) would be extended to read that f (x1 , x2 , . . . , xn ; θ) = H(x1 , x2 , . . . , xn ) fY1 [u1 (x1 , x2 , . . . , xn ); θ)] does not depend upon θ ∈ Ω, where f (x1 , x2 , . . . , xn ; θ) is the joint pdf or pmf of X1 , X2 , . . . , Xn . There are even a few situations in which we need an extension. We now give two examples that are illustrative of the deﬁnition. Example 2.2. Let X1 , X2 , . . . , Xn be a random sample from a gamma distribution with α = 2 and β = θ > 0. Because the mgf associated n with this distribution is given by M (t) = (1 − θt)−2 , t < 1/θ, the mgf of Y1 = i=1 Xi is E[et(X1 +X2 +···+Xn ) ]

=

E(etX1 )E(etX2 ) · · · E(etXn )

=

[(1 − θt)−2 ]n = (1 − θt)−2n .

Thus Y1 has a gamma distribution with α = 2n and β = θ, so that its pdf is 2n−1 −y1 /θ 1 e 0 < y1 < ∞ Γ(2n)θ 2n y1 fY1 (y1 ; θ) = 0 elsewhere. Thus we have 2−1 −x1 /θ 2−1 −x2 /θ 2−1 −xn /θ x2 e xn e x1 e · · · Γ(2n) x1 x2 · · · x n Γ(2)θ2 Γ(2)θ2 Γ(2)θ2 = , n (x + x + · · · + x )2n−1 2n−1 −(x1 +x2 +···+xn )/θ [Γ(2)] (x1 + x2 + · · · + xn ) e 1 2 n Γ(2n)θ2n

391

Suﬃciency where 0 < xi < ∞, i = 1, 2, . . . , n. Since this ratio does not depend upon θ, the sum Y1 is a suﬃcient statistic for θ. Example 2.3. Let Y1 < Y2 < · · · < Yn denote the order statistics of a random sample of size n from the distribution with pdf f (x; θ) = e−(x−θ) I(θ,∞) (x). Here we use the indicator function of a set A deﬁned by 1 x∈A IA (x) = 0 x ∈ A. This means, of course, that f (x; θ) = e−(x−θ) , θ < x < ∞, zero elsewhere. The pdf of Y1 = min(Xi ) is fY1 (y1 ; θ) = ne−n(y1 −θ) I(θ,∞) (y1 ). Note that θ < min{xi } if and only if θ < xi , for all i = 1, . . . , n. Notationally this n can be expressed as I(θ,∞) (min xi ) = i=1 I(θ,∞) (xi ). Thus we have that

n

−(xi −θ) I(θ,∞) (xi ) i=1 e −n(min x −θ) i ne I(θ,∞) (min xi )

=

e−x1 −x2 −···−xn . ne−n min xi

Since this ratio does not depend upon θ, the ﬁrst order statistic Y1 is a suﬃcient statistic for θ. If we are to show by means of the deﬁnition that a certain statistic Y1 is or is not a suﬃcient statistic for a parameter θ, we must ﬁrst of all know the pdf of Y1 , say fY1 (y1 ; θ). In many instances it may be quite diﬃcult to ﬁnd this pdf. Fortunately, this problem can be avoided if we prove the following factorization theorem of Neyman. Theorem 2.1 (Neyman). Let X1 , X2 , . . . , Xn denote a random sample from a distribution that has pdf or pmf f (x; θ), θ ∈ Ω. The statistic Y1 = u1 (X1 , . . . , Xn ) is a suﬃcient statistic for θ if and only if we can ﬁnd two nonnegative functions, k1 and k2 , such that f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) = k1 [u1 (x1 , x2 , . . . , xn ); θ]k2 (x1 , x2 , . . . , xn ),

(2.1)

where k2 (x1 , x2 , . . . , xn ) does not depend upon θ. P roof. We shall prove the theorem when the random variables are of the continuous type. Assume that the factorization is as stated in the theorem. In our proof we shall make the one-to-one transformation y1 = u1 (x1 , x2 , . . . , xn ), y2 = u2 (x1 , x2 , . . . , xn ), . . . , yn = un (x1 , x2 , . . . , xn ) having the inverse functions x1 = w1 (y1 , y2 , . . . , yn ), x2 = w2 (y1 , y2 , . . . , yn ), . . . , xn = wn (y1 , y2 , . . . , yn ) and Jacobian J; see the note after the proof. The pdf of the statistic Y1 , Y2 , . . . , Yn is then given by g(y1 , y2 , . . . , yn ; θ) = k1 (y1 ; θ)k2 (w1 , w2 , . . . , wn )|J|,

392

Suﬃciency where wi = wi (y1 , y2 , . . . , yn ), i = 1, 2, . . . , n. The pdf of Y1 , say fY1 (y1 ; θ), is given by fY1 (y1 ; θ)

∞

= −∞

=

···

k1 (y1 ; θ)

∞

−∞ ∞

g(y1 , y2 , . . . , yn ; θ) dy2 · · · dyn ∞ ··· |J|k2 (w1 , w2 , . . . , wn ) dy2 · · · dyn .

−∞

−∞

Now the function k2 does not depend upon θ, nor is θ involved in either the Jacobian J or the limits of integration. Hence the (n − 1)-fold integral in the right-hand member of the preceding equation is a function of y1 alone, for example, m(y1 ). Thus fY1 (y1 ; θ) = k1 (y1 ; θ)m(y1 ). If m(y1 ) = 0, then fY1 (y1 ; θ) = 0. If m(y1 ) > 0, we can write k1 [u1 (x1 , x2 , . . . , xn ); θ] =

fY1 [u1 (x1 , . . . , xn ); θ] , m[u1 (x1 , . . . , xn )]

and the assumed factorization becomes f (x1 ; θ) · · · f (xn ; θ) = fY1 [u1 (x1 , . . . , xn ); θ]

k2 (x1 , . . . , xn ) . m[u1 (x1 , . . . , xn )]

Since neither the function k2 nor the function m depends upon θ, then in accordance with the deﬁnition, Y1 is a suﬃcient statistic for the parameter θ. Conversely, if Y1 is a suﬃcient statistic for θ, the factorization can be realized by taking the function k1 to be the pdf of Y1 , namely, the function fY1 . This completes the proof of the theorem. Note that the assumption of a one-to-one transformation made in the proof is not needed; see Lehmann (1986) for a more rigorous prrof. This theorem characterizes suﬃciency and, as the following examples show, is usually much easier to work with than the deﬁnition of suﬃciency. Example 2.4. Let X1 , X2 , . . . , Xn denote a random sample from a distribution n that is N (θ, σ 2 ), −∞ < θ < ∞, where the variance σ 2 > 0 is known. If x = 1 xi /n, then n

(xi − θ)2 =

i=1

n

[(xi − x) + (x − θ)]2 =

i=1

n

(xi − x)2 + n(x − θ)2

i=1

because 2

n i=1

(xi − x)(x − θ) = 2(x − θ)

n

(xi − x) = 0.

i=1

393

Suﬃciency Thus the joint pdf of X1 , X2 , . . . , Xn may be written n n 1 √ exp − (xi − θ)2 /2σ 2 σ 2π i=1 ⎫ n ⎧ ⎪ ⎪ 2 2 ⎪ ⎪ ⎪ exp − (xi − x) /2σ ⎪ ⎪ ⎪ ⎬ ⎨ i=1 2 2 √ = {exp[−n(x − θ) /2σ ]} . n ⎪ ⎪ (σ 2π) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ Because the ﬁrst factor of the right-hand member of this equation depends upon x1 , x2 , . . . , xn only through x, and the second factor does not depend upon θ, the factorization theorem implies that the mean X of the sample is, for any particular value of σ 2 , a suﬃcient statistic for θ, the mean of the normal distribution. We could have used the deﬁnition in the preceding example because we know that X is N (θ, σ 2 /n). Let us now consider an example in which the use of the deﬁnition is inappropriate. Example 2.5. Let X1 , X2 , . . . , Xn denote a random sample from a distribution with pdf θxθ−1 0 < x < 1 f (x; θ) = 0 elsewhere,

n where 0 < θ. By the factorization theorem, u1 (X1 , X2 , . . . , Xn ) = i=1 Xi is a suﬃcient statistic for θ. The joint pdf of X1 , X2 , . . . , Xn is n θ−1 ⎡ n θ ⎤ 1 , θn xi = ⎣θ n xi ⎦ n i=1 xi i=1 i=1 where 0 < xi < 1, i = 1, 2, . . . , n. In the factorization theorem, let n θ n xi k1 [u1 (x1 , x2 , . . . , xn ); θ] = θ i=1

and 1 k2 (x1 , x2 , . . . , xn ) = n i=1

xi

.

Since k2 (x1 , x2 , . . . , xn ) does not depend upon θ, the product statistic for θ.

n i=1

Xi is a suﬃcient

There is a tendency for some readers to apply incorrectly the factorization theorem in those instances in which the domain of positive probability density depends upon the parameter θ. This is due to the fact that they do not give proper consideration to the domain of the function k2 (x1 , x2 , . . . , xn ). This is illustrated in the next example.

394

Suﬃciency Example 2.6. In Example 2.3 with f (x; θ) = e−(x−θ) I(θ,∞) (x), it was found that the ﬁrst order statistic Y1 is a suﬃcient statistic for θ. To illustrate our point about not considering the domain of the function, take n = 3 and note that e−(x1 −θ) e−(x2 −θ) e−(x3 −θ) = [e−3 max xi +3θ ][e−x1 −x2 −x3 +3 max xi ] or a similar expression. Certainly, in the latter formula, there is no θ in the second factor and it might be assumed that Y3 = max Xi is a suﬃcient statistic for θ. Of course, this is incorrect because we should have written the joint pdf of X1 , X2 , X3 as 3 3 −(xi −θ) 3θ [e I(θ,∞) (xi )] = [e I(θ,∞) (min xi )] exp − xi i=1

i=1

because I(θ,∞) (min xi ) = I(θ,∞) (x1 )I(θ,∞) (x2 )I(θ,∞) (x3 ). A similar statement cannot be made with max xi . Thus Y1 = min Xi is the suﬃcient statistic for θ, not Y3 = max Xi .

EXERCISES 2.1. Let X1 , X2 , . . . , Xn be iid N (0, θ), 0 < θ < ∞. Show that statistic for θ.

n 1

Xi2 is a suﬃcient

2.2. Prove that the sum of the observations of a random sample of size n from a Poisson distribution having parameter θ, 0 < θ < ∞, is a suﬃcient statistic for θ. 2.3. Show that the nth order statistic of a random sample of size n from the uniform distribution having pdf f (x; θ) = 1/θ, 0 < x < θ, 0 < θ < ∞, zero elsewhere, is a suﬃcient statistic for θ. Generalize this result by considering the pdf f (x; θ) = Q(θ)M (x), 0 < x < θ, 0 < θ < ∞, zero elsewhere. Here, of course,

θ

M (x) dx = 0

1 . Q(θ)

2.4. Let X1 , X2 , . . . , Xn be a random sample of size n from a geometric distribution that has pmf f (x; θ) = (1 − θ)x θ, x = 0, 1, 2, . . . , 0 < θ < 1, zero elsewhere. Show n that 1 Xi is a suﬃcient statistic for θ. 2.5. Show that the sum of the observations of a random sample of size n from a gamma distribution that has pdf f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, 0 < θ < ∞, zero elsewhere, is a suﬃcient statistic for θ. 2.6. Let X1 , X2 , . . . , Xn be a random sample of size n from a beta distribution with parameters α = θ and β = 5. Show that the product X1 X2 · · · Xn is a suﬃcient statistic for θ. 2.7. Show that the product of the sample observations is a suﬃcient statistic for θ > 0 if the random sample is taken from a gamma distribution with parameters α = θ and β = 6.

395

Suﬃciency 2.8. What is the suﬃcient statistic for θ if the sample arises from a beta distribution in which α = β = θ > 0? 2.9. We consider a random sample X1 , X2 , . . . , Xn from a distribution with pdf f (x; θ) = (1/θ) exp(−x/θ), 0 < x < ∞, zero elsewhere, where 0 < θ. Possibly, in a life-testing situation, however, we only observe the ﬁrst r order statistics Y1 < Y2 < · · · < Yr . (a) Record the joint pdf of these order statistics and denote it by L(θ). ˆ by maximizing L(θ). (b) Under these conditions, ﬁnd the mle, θ, ˆ (c) Find the mgf and pdf of θ. (d) With a slight extension of the deﬁnition of suﬃciency, is θˆ a suﬃcient statistic?

3

Properties of a Suﬃcient Statistic

Suppose X1 , X2 , . . . , Xn is a random sample on a random variable with pdf or pmf f (x; θ), where θ ∈ Ω. In this section we discuss how suﬃciency is used to determine MVUEs. First note that a suﬃcient estimate is not unique in any sense. For if Y1 = u1 (X1 , X2 , . . . , Xn ) is a suﬃcient statistic and Y2 = g(Y1 ), where g(x) is a one-to-one function, is a statistic, then f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ)

=

k1 [u1 (y1 ); θ]k2 (x1 , x2 , . . . , xn )

=

k1 [u1 (g −1 (y2 )); θ]k2 (x1 , x2 , . . . , xn );

hence, by the factorization theorem, Y2 is also suﬃcient. However, as the theorem below shows, suﬃciency can lead to a best point estimate. If X1 and X2 are random variables such that the variance of X2 exists, then E[X2 ] = E[E(X2 |X1 )] and Var(X2 ) ≥ Var[E(X2 |X1 )]. For the adaptation in the context of suﬃcient statistics, we let the suﬃcient statistic Y1 be X1 and Y2 , an unbiased statistic of θ, be X2 . Thus, with E(Y2 |y1 ) = ϕ(y1 ), we have θ = E(Y2 ) = E[ϕ(Y1 )] and Var(Y2 ) ≥ Var[ϕ(Y1 )].

396

Suﬃciency That is, through this conditioning, the function ϕ(Y1 ) of the suﬃcient statistic Y1 is an unbiased estimator of θ having a smaller variance than that of the unbiased estimator Y2 . We summarize this discussion more formally in the following theorem, which can be attributed to Rao and Blackwell. Theorem 3.1 (Rao–Blackwell). Let X1 , X2 , . . . , Xn , n a ﬁxed positive integer, denote a random sample from a distribution (continuous or discrete) that has pdf or pmf f (x; θ), θ ∈ Ω. Let Y1 = u1 (X1 , X2 , . . . , Xn ) be a suﬃcient statistic for θ, and let Y2 = u2 (X1 , X2 , . . . , Xn ), not a function of Y1 alone, be an unbiased estimator of θ. Then E(Y2 |y1 ) = ϕ(y1 ) deﬁnes a statistic ϕ(Y1 ). This statistic ϕ(Y1 ) is a function of the suﬃcient statistic for θ; it is an unbiased estimator of θ; and its variance is less than or equal to that of Y2 . This theorem tells us that in our search for an MVUE of a parameter, we may, if a suﬃcient statistic for the parameter exists, restrict that search to functions of the suﬃcient statistic. For if we begin with an unbiased estimator Y2 alone, then we can always improve on this by computing E(Y2 |y1 ) = ϕ(y1 ) so that ϕ(Y1 ) is an unbiased estimator with a smaller variance than that of Y2 . After Theorem 3.1, many students believe that it is necessary to ﬁnd ﬁrst some unbiased estimator Y2 in their search for ϕ(Y1 ), an unbiased estimator of θ based upon the suﬃcient statistic Y1 . This is not the case at all, and Theorem 3.1 simply convinces us that we can restrict our search for a best estimator to functions of Y1 . Furthermore, there is a connection between suﬃcient statistics and maximum likelihood estimates, as shown in the following theorem: Theorem 3.2. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that has pdf or pmf f (x; θ), θ ∈ Ω. If a suﬃcient statistic Y1 = u1 (X1 , X2 , . . . , Xn ) for θ exists and if a maximum likelihood estimator θˆ of θ also exists uniquely, then θˆ is a function of Y1 = u1 (X1 , X2 , . . . , Xn ). Proof. Let fY1 (y1 ; θ) be the pdf or pmf of Y1 . Then by the deﬁnition of suﬃciency, the likelihood function L(θ; x1 , x2 , . . . , xn )

= f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ) = fY1 [u1 (x1 , x2 , . . . , xn ); θ]H(x1 , x2 , . . . , xn ),

where H(x1 , x2 , . . . , xn ) does not depend upon θ. Thus L and fY1 , as functions of θ, are maximized simultaneously. Since there is one and only one value of θ that maximizes L and hence fY1 [u1 (x1 , x2 , . . . , xn ); θ], that value of θ must be a function of u1 (x1 , x2 , . . . , xn ). Thus the mle θˆ is a function of the suﬃcient statistic Y1 = u1 (X1 , X2 , . . . , Xn ). We know that, generally, mles are asymptotically unbiased estimators of θ. Hence, one way to proceed is to ﬁnd a suﬃcient statistic and then ﬁnd the mle. Based on this, we can often obtain an unbiased estimator which is a function of the suﬃcient statistic. This process is illustrated in the following example.

397

Suﬃciency Example 3.1. Let X1 , . . . , Xn be iid with pdf −θx 0 < x < ∞, θ > 0 θe f (x; θ) = 0 elsewhere. Suppose we want an MVUE of θ. The joint pdf (likelihood function) is L(θ; x1 , . . . , xn ) = θn e−θ

n

xi

for xi > 0, i = 1, . . . , n. n Hence, by the factorization theorem, the statistic Y1 = i=1 Xi is suﬃcient. The log of the likelihood function is i=1

,

l(θ) = n log θ − θ

n

xi .

i=1

Taking the partial with respect to θ of l(θ) and setting it to 0 results in the mle of θ, which is given by 1 Y2 = . X Note that Y2 = n/Y1 is a function of the suﬃcient statistic Y1 . Also, since Y2 is the mle of θ, it is asymptotically unbiased. Hence, as a ﬁrst step, we shall determine its expectation. In this problem, Xi are iid Γ(1, 1/θ) random variables; hence, n Y1 = i=1 Xi is Γ(n, 1/θ). Therefore, ∞ n 1 1 θ = nE n t−1 tn−1 e−θt dt; =n E(Y2 ) = E Γ(n) X X i 0 i=1 making the change of variable z = θt and simplifying results in n n 1 =θ Γ(n − 1) = θ . E(Y2 ) = E (n − 1)! n−1 X n Thus the statistic [(n − 1)Y2 ]/n = (n − 1)/ i=1 Xi is an MVUE of θ. In the next two sections, we discover that, in most instances, if there is one function ϕ(Y1 ) that is unbiased, ϕ(Y1 ) is the only unbiased estimator based on the suﬃcient statistic Y1 . Remark 3.1. Since the unbiased estimator ϕ(Y1 ), where ϕ(Y1 ) = E(Y2 |y1 ), has a variance smaller than that of the unbiased estimator Y2 of θ, students sometimes reason as follows. Let the function Υ(y3 ) = E[ϕ(Y1 )|Y3 = y3 ], where Y3 is another statistic, which is not suﬃcient for θ. By the Rao–Blackwell theorem, we have E[Υ(Y3 )] = θ and Υ(Y3 ) has a smaller variance than does ϕ(Y1 ). Accordingly, Υ(Y3 ) must be better than ϕ(Y1 ) as an unbiased estimator of θ. But this is not true, because Y3 is not suﬃcient; thus, θ is present in the conditional distribution of Y1 , given Y3 = y3 , and the conditional mean Υ(y3 ). So although indeed E[Υ(Y3 )] = θ, Υ(Y3 ) is not even a statistic because it involves the unknown parameter θ and hence cannot be used as an estimate.

398

Suﬃciency Example 3.2. Let X1 , X2 , X3 be a random sample from an exponential distribution with mean θ > 0, so that the joint pdf is 3 1 e−(x1 +x2 +x3 )/θ , 0 < xi < ∞, θ i = 1, 2, 3, zero elsewhere. From the factorization theorem, we see that Y1 = X1 + X2 + X3 is a suﬃcient statistic for θ. Of course, E(Y1 ) = E(X1 + X2 + X3 ) = 3θ, and thus Y1 /3 = X is a function of the suﬃcient statistic that is an unbiased estimator of θ. In addition, let Y2 = X2 + X3 and Y3 = X3 . The one-to-one transformation deﬁned by x1 = y1 − y2 , x 2 = y2 − y3 , x 3 = y 3 has Jacobian equal to 1 and the joint pdf of Y1 , Y2 , Y3 is 3 1 g(y1 , y2 , y3 ; θ) = e−y1 /θ , 0 < y3 < y2 < y1 < ∞, θ zero elsewhere. The marginal pdf of Y1 and Y3 is found by integrating out y2 to obtain 3 1 g13 (y1 , y3 ; θ) = (y1 − y3 )e−y1 /θ , 0 < y3 < y1 < ∞, θ zero elsewhere. The pdf of Y3 alone is g3 (y3 ; θ) =

1 −y3 /θ e , 0 < y3 < ∞, θ

zero elsewhere, since Y3 = X3 is an observation of a random sample from this exponential distribution. Accordingly, the conditional pdf of Y1 , given Y3 = y3 , is g1|3 (y1 |y3 )

= =

zero elsewhere. Thus Y1 y3 = E 3 = =

g13 (y1 , y3 ; θ) g3 (y3 ; θ) 2 1 (y1 − y3 )e−(y1 −y3 )/θ , 0 < y3 < y1 < ∞, θ

Y 1 − Y3 Y3 y3 + E y3 E 3 3 ∞ 2 1 y3 1 (y1 − y3 )2 e−(y1 −y3 )/θ dy1 + 3 θ 3 y3 3 2θ y3 1 Γ(3)θ y3 = + = Υ(y3 ). + 3 θ2 3 3 3

399

Suﬃciency Of course, E[Υ(Y3 )] = θ and var[Υ(Y3 )] ≤ var(Y1 /3), but Υ(Y3 ) is not a statistic, as it involves θ and cannot be used as an estimator of θ. This illustrates the preceding remark.

EXERCISES 3.1. In each of Exercises 2.1–2.4, show that the mle of θ is a function of the suﬃcient statistic for θ. 3.2. Let Y1 < Y2 < Y3 < Y4 < Y5 be the order statistics of a random sample of size 5 from the uniform distribution having pdf f (x; θ) = 1/θ, 0 < x < θ, 0 < θ < ∞, zero elsewhere. Show that 2Y3 is an unbiased estimator of θ. Determine the joint pdf of Y3 and the suﬃcient statistic Y5 for θ. Find the conditional expectation E(2Y3 |y5 ) = ϕ(y5 ). Compare the variances of 2Y3 and ϕ(Y5 ). 3.3. If X1 , X2 is a random sample of size 2 from a distribution having pdf f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, 0 < θ < ∞, zero elsewhere, ﬁnd the joint pdf of the suﬃcient statistic Y1 = X1 + X2 for θ and Y2 = X2 . Show that Y2 is an unbiased estimator of θ with variance θ2 . Find E(Y2 |y1 ) = ϕ(y1 ) and the variance of ϕ(Y1 ). 3.4. Let f (x, y) = (2/θ2 )e−(x+y)/θ , 0 < x < y < ∞, zero elsewhere, be the joint pdf of the random variables X and Y . (a) Show that the mean and the variance of Y are, respectively, 3θ/2 and 5θ2 /4. (b) Show that E(Y |x) = x + θ. In accordance with the theory, the expected value of X + θ is that of Y , namely, 3θ/2, and the variance of X + θ is less than that of Y . Show that the variance of X + θ is in fact θ2 /4. 3.5. In each of Exercises 2.1–2.3, compute the expected value of the given suﬃcient statistic and, in each case, determine an unbiased estimator of θ that is a function of that suﬃcient statistic alone. 3.6. Let X1 , X2 , . . . , Xn be a random sample from a Poisson ndistribution with mean θ. Find the conditional expectation E(X1 + 2X2 + 3X3 | 1 Xi ).

4

Completeness and Uniqueness

Let X1 , X2 , . . . , Xn be a random sample from the Poisson distribution that has pmf x −θ θ e x = 0, 1, 2, . . . ; θ > 0 x! f (x; θ) = 0 elsewhere. n From Exercise 2.2, we know that Y1 = i=1 Xi is a suﬃcient statistic for θ and its pmf is g1 (y1 ; θ) =

400

(nθ)y1 e−nθ y1 !

0

y1 = 0, 1, 2, . . . elsewhere.

Suﬃciency Let us consider the family {g1 (y1 ; θ) : θ > 0} of probability mass functions. Suppose that the function u(Y1 ) of Y1 is such that E[u(Y1 )] = 0 for every θ > 0. We shall show that this requires u(y1 ) to be zero at every point y1 = 0, 1, 2, . . . . That is, E[u(Y1 )] = 0 for θ > 0 requires 0 = u(0) = u(1) = u(2) = u(3) = · · · . We have for all θ > 0 that 0 = E[u(Y1 )]

∞

=

y1 =0

=

e

−nθ

u(y1 )

(nθ)y1 e−nθ y1 !

(nθ)2 nθ + u(2) + ··· . u(0) + u(1) 1! 2!

Since e−nθ does not equal zero, we have shown that 2 n u(2) 2 θ + ··· . 0 = u(0) + [nu(1)]θ + 2 However, if such an inﬁnite (power) series converges to zero for all θ > 0, then each of the coeﬃcients must equal zero. That is, u(0) = 0,

nu(1) = 0,

n2 u(2) = 0, . . . , 2

and thus 0 = u(0) = u(1) = u(2) = · · · , as we wanted to show. Of course, the condition E[u(Y1 )] = 0 for all θ > 0 does not place any restriction on u(y1 ) when y1 is not a nonnegative integer. So we see that, in this illustration, E[u(Y1 )] = 0 for all θ > 0 requires that u(y1 ) equals zero except on a set of points that has probability zero for each pmf g1 (y1 ; θ), 0 < θ. From the following deﬁnition we observe that the family {g1 (y1 ; θ) : 0 < θ} is complete. Deﬁnition 4.1. Let the random variable Z of either the continuous type or the discrete type have a pdf or pmf that is one member of the family {h(z; θ) : θ ∈ Ω}. If the condition E[u(Z)] = 0, for every θ ∈ Ω, requires that u(z) be zero except on a set of points that has probability zero for each h(z; θ), θ ∈ Ω, then the family {h(z; θ) : θ ∈ Ω} is called a complete family of probability density or mass functions. Remark 4.1. The existence of E[u(X)] implies that the integral (or sum) converges absolutely. This absolute convergence was tacitly assumed in our deﬁnition of completeness and it is needed to prove that certain families of probability density functions are complete. In order to show that certain families of probability density functions of the continuous type are complete, we must appeal to the same type of theorem in analysis that we used when we claimed that the moment generating function uniquely determines a distribution. This is illustrated in the next example.

401

Suﬃciency Example 4.1. Consider the family of pdfs {h(z; θ) : 0 < θ < ∞}. Suppose Z has a pdf in this family given by 1 −z/θ 0 0. θ 0 Readers acquainted with the theory of transformations recognize the integral in the left-hand member as being essentially the Laplace transform of u(z). In that theory we learn that the only function u(z) transforming to a function of θ which is identically equal to zero is u(z) = 0, except (in our terminology) on a set of points that has probability zero for each h(z; θ), θ > 0. That is, the family {h(z; θ) : 0 < θ < ∞} is complete. Let the parameter θ in the pdf or pmf f (x; θ), θ ∈ Ω, have a suﬃcient statistic Y1 = u1 (X1 , X2 , . . . , Xn ), where X1 , X2 , . . . , Xn is a random sample from this distribution. Let the pdf or pmf of Y1 be fY1 (y1 ; θ), θ ∈ Ω. It has been seen that if there is any unbiased estimator Y2 (not a function of Y1 alone) of θ, then there is at least one function of Y1 that is an unbiased estimator of θ, and our search for a best estimator of θ may be restricted to functions of Y1 . Suppose it has been veriﬁed that a certain function ϕ(Y1 ), not a function of θ, is such that E[ϕ(Y1 )] = θ for all values of θ, θ ∈ Ω. Let ψ(Y1 ) be another function of the suﬃcient statistic Y1 alone, so that we also have E[ψ(Y1 )] = θ for all values of θ, θ ∈ Ω. Hence E[ϕ(Y1 ) − ψ(Y1 )] = 0,

θ ∈ Ω.

If the family {fY1 (y1 ; θ) : θ ∈ Ω} is complete, the function of ϕ(y1 ) − ψ(y1 ) = 0, except on a set of points that has probability zero. That is, for every other unbiased estimator ψ(Y1 ) of θ, we have ϕ(y1 ) = ψ(y1 ) except possibly at certain special points. Thus, in this sense [namely ϕ(y1 ) = ψ(y1 ), except on a set of points with probability zero], ϕ(Y1 ) is the unique function of Y1 , which is an unbiased estimator of θ. In accordance with the Rao–Blackwell theorem, ϕ(Y1 ) has a smaller variance than every other unbiased estimator of θ. That is, the statistic ϕ(Y1 ) is the MVUE of θ. This fact is stated in the following theorem of Lehmann and Scheﬀ´e. Theorem 4.1 (Lehmann and Scheﬀ´e). Let X1 , X2 , . . . , Xn , n a ﬁxed positive integer, denote a random sample from a distribution that has pdf or pmf f (x; θ), θ ∈ Ω, let Y1 = u1 (X1 , X2 , . . . , Xn ) be a suﬃcient statistic for θ, and let the family {fY1 (y1 ; θ) : θ ∈ Ω} be complete. If there is a function of Y1 that is an unbiased estimator of θ, then this function of Y1 is the unique MVUE of θ. Here “unique” is used in the sense described in the preceding paragraph.

402

Suﬃciency The statement that Y1 is a suﬃcient statistic for a parameter θ, θ ∈ Ω, and that the family {fY1 (y1 ; θ) : θ ∈ Ω} of probability density functions is complete is lengthy and somewhat awkward. We shall adopt the less descriptive, but more convenient, terminology that Y1 is a complete suﬃcient statistic for θ. In the next section, we study a fairly large class of probability density functions for which a complete suﬃcient statistic Y1 for θ can be determined by inspection. Example 4.2 (Uniform Distribution). Let X1 , X2 , . . . , Xn be a random sample from the uniform distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, θ > 0, and zero elsewhere. As Exercise 2.3 shows, Yn = max{X1 , X2 , . . . , Xn } is a suﬃcient statistic for θ. It is easy to show that the pdf of Yn is n−1 nyn 0 < yn < θ θn (4.1) g(yn ; θ) = 0 elsewhere. To show that Yn is complete, suppose for any function u(t) and any θ that E[u(Yn )] = 0; i.e., θ ntn−1 u(t) n dt. 0= θ 0 Since θ > 0, this equation is equivalent to θ u(t)tn−1 dt. 0= 0

Taking partial derivatives of both sides with respect to θ and using the Fundamental Theorem of Calculus, we have 0 = u(θ)θn−1 . Since θ > 0, u(θ) = 0, for all θ > 0. Thus Yn is a complete and suﬃcient statistic for θ. It is easy to show that θ ny n−1 n θ. y n dy = E(Yn ) = θ n + 1 0 Therefore, the MVUE of θ is ((n + 1)/n)Yn .

EXERCISES 4.1. If az 2 + bz + c = 0 for more than two values of z, then a = b = c = 0. Use this result to show that the family {b(2, θ) : 0 < θ < 1} is complete. 4.2. Show that each of the following families is not complete by ﬁnding at least one nonzero function u(x) such that E[u(X)] = 0, for all θ > 0. (a)

f (x; θ) =

1 2θ

0

−θ < x < θ, elsewhere.

where 0 < θ < ∞

403

Suﬃciency (b) N (0, θ), where 0 < θ < ∞. 4.3. Let X1 , X2 , . . . , Xn represent a random sample from the discrete distribution having the pmf x θ (1 − θ)1−x x = 0, 1, 0 < θ < 1 f (x; θ) = 0 elsewhere. n Show that Y1 = 1 Xi is a complete suﬃcient statistic for θ. Find the unique function of Y1 that is the MVUE of θ. Hint: Display E[u(Y1 )] = 0, show that the constant term u(0) is equal to zero, divide both members of the equation by θ = 0, and repeat the argument. 4.4. Consider the family of probability density functions {h(z; θ) : θ ∈ Ω}, where h(z; θ) = 1/θ, 0 < z < θ, zero elsewhere. (a) Show that the family is complete provided that Ω = {θ : 0 < θ < ∞}. Hint: For convenience, assume that u(z) is continuous and note that the derivative of E[u(Z)] with respect to θ is equal to zero also. (b) Show that this family is not complete if Ω = {θ : 1 < θ < ∞}. Hint: Concentrate on the interval 0 < z < 1 and ﬁnd a nonzero function u(z) on that interval such that E[u(Z)] = 0 for all θ > 1. 4.5. Show that the ﬁrst order statistic Y1 of a random sample of size n from the distribution having pdf f (x; θ) = e−(x−θ) , θ < x < ∞, −∞ < θ < ∞, zero elsewhere, is a complete suﬃcient statistic for θ. Find the unique function of this statistic which is the MVUE of θ. 4.6. Let a random sample of size n be taken from a distribution of the discrete type with pmf f (x; θ) = 1/θ, x = 1, 2, . . . , θ, zero elsewhere, where θ is an unknown positive integer. (a) Show that the largest observation, say Y , of the sample is a complete suﬃcient statistic for θ. (b) Prove that [Y n+1 − (Y − 1)n+1 ]/[Y n − (Y − 1)n ] is the unique MVUE of θ. 4.7. Let X have the pdf fX (x; θ) = 1/(2θ), for −θ < x < θ, zero elsewhere, where θ > 0. (a) Is the statistic Y = |X| a suﬃcient statistic for θ? Why? (b) Let fY (y; θ) be the pdf of Y . Is the family {fY (y; θ) : θ > 0} complete? Why? n |x| θ (1 − θ)n−|x| , for x = ±1, ±2, . . . , ±n, 4.8. Let X have the pmf p(x; θ) = 12 |x| n p(0, θ) = (1 − θ) , and zero elsewhere, where 0 < θ < 1. (a) Show that this family {p(x; θ) : 0 < θ < 1} is not complete.

404

Suﬃciency (b) Let Y = |X|. Show that Y is a complete and suﬃcient statistic for θ. 4.9. Let X1 , . . . , Xn be iid with pdf f (x; θ) = 1/(3θ), −θ < x < 2θ, zero elsewhere, where θ > 0. (a) Find the mle θ! of θ. (b) Is θ! a suﬃcient statistic for θ? Why? ! the unique MVUE of θ? Why? (c) Is (n + 1)θ/n 4.10. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n from a distribution with pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere. By Example 4.2, the statistic Yn is a complete suﬃcient statistic for θ and it has pdf g(yn ; θ) =

nynn−1 , 0 < yn < θ, θn

and zero elsewhere. (a) Find the distribution function Hn (z; θ) of Z = n(θ − Yn ). (b) Find the limn→∞ Hn (z; θ) and thus the limiting distribution of Z.

5

The Exponential Class of Distributions

In this section we discuss an important class of distributions, called the exponential class. As we show, this class possesses complete and suﬃcient statistics which are readily determined from the distribution. Consider a family {f (x; θ) : θ ∈ Ω} of probability density or mass functions, where Ω is the interval set Ω = {θ : γ < θ < δ}, where γ and δ are known constants (they may be ±∞), and where exp[p(θ)K(x) + H(x) + q(θ)] x∈S f (x; θ) = (5.1) 0 elsewhere, where S is the support of X. In this section we are concerned with a particular class of the family called the regular exponential class. Deﬁnition 5.1 (Regular Exponential Class). A pdf of the form (5.1) is said to be a member of the regular exponential class of probability density or mass functions if 1. S, the support of X, does not depend upon θ 2. p(θ) is a nontrivial continuous function of θ ∈ Ω 3. Finally, (a) if X is a continuous random variable, then each of K (x) ≡ 0 and H(x) is a continuous function of x ∈ S,

405

Suﬃciency (b) if X is a discrete random variable, then K(x) is a nontrivial function of x ∈ S. For example, each member of the family {f (x; θ) : 0 < θ < ∞}, where f (x; θ) is N (0, θ), represents a regular case of the exponential class of the continuous type because f (x; θ)

= =

2 1 e−x /2θ 2πθ √ 1 2 exp − x − log 2πθ , 2θ

√

−∞ < x < ∞.

On the other hand, consider the uniform density function given by exp{− log θ} x ∈ (0, θ) f (x; θ) = 0 elsewhere. This can be written in the form (5.1), but the support is the interval (0, θ), which depends on θ. Hence the uniform family is not a regular exponential family. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that represents a regular case of the exponential class. The joint pdf or pmf of X1 , X2 , . . . , Xn is n n exp p(θ) K(xi ) + H(xi ) + nq(θ) 1

1

for xi ∈ S, i = 1, 2, . . . , n and zero elsewhere. At points in the S of X, this joint pdf or pmf may be written as the product of the two nonnegative functions n n K(xi ) + nq(θ) exp H(xi ) . exp p(θ) 1

1

n In accordance with the factorization theorem, Theorem 2.1, Y1 = 1 K(Xi ) is a suﬃcient statistic for the parameter θ. Besides the fact that Y1 is a suﬃcient statistic, we can obtain the general form of the distribution of Y1 and its mean and variance. We summarize these results in a theorem. The details of the proof are given in Exercises 5.5 and 5.8. Exercise 5.6 obtains the mgf of Y1 in the case that p(θ) = θ. Theorem 5.1. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that represents a regular case of the n exponential class, with pdf or pmf given by (5.1). Consider the statistic Y1 = i=1 K(Xi ). Then 1. The pdf or pmf of Y1 has the form fY1 (y1 ; θ) = R(y1 ) exp[p(θ)y1 + nq(θ)],

(5.2)

for y1 ∈ SY1 and some function R(y1 ). Neither SY1 nor R(y1 ) depends on θ.

2. E(Y1 ) = −n pq (θ) (θ) .

406

Suﬃciency 1 3. Var(Y1 ) = n p (θ) 3 {p (θ)q (θ) − q (θ)p (θ)} .

Example 5.1. Let X have a Poisson distribution with parameter θ ∈ (0, ∞). Then the support of X is the set S = {0, 1, 2, . . .}, which does not depend on θ. Further, the pmf of X on its support is f (x, θ) = e−θ

θx = exp{(log θ)x + log(1/x!) + (−θ)}. x!

Hence the Poisson distribution is a member of the regular exponential class, with p(θ) = log(θ), q(θ) = −θ, and K(x) = x. Therefore, n if X1 , X2 , . . . , Xn denotes a random sample on X, then the statistic Y1 = i=1 Xi is suﬃcient. But since p (θ) = 1/θ and q (θ) = −1, Theorem 5.1 veriﬁes that the mean of Y1 is nθ. It is easy to verify that the variance of Y1 is nθ also. Finally, we can show that the function R(y1 ) in Theorem 5.1 is given by R(y1 ) = ny1 (1/y1 !). Forthe regular case of the exponential class, we have shown that the statistic n Y1 = 1 K(Xi ) is suﬃcient for θ. We now use the form of the pdf of Y1 given in Theorem 5.1 to establish the completeness of Y1 . Theorem 5.2. Let f (x; θ), γ < θ < δ, be a pdf or pmf of a random variable X whose distribution is a regular case of the exponential class. Then if X1 , X2 , . . . , Xn (where n is a ﬁxedpositive integer) is a random sample from the distribution of X, n the statistic Y1 = 1 K(Xi ) is a suﬃcient statistic for θ and the family {fY1 (y1 ; θ) : γ < θ < δ} of probability density functions of Y1 is complete. That is, Y1 is a complete suﬃcient statistic for θ. Proof: We have shown above that Y1 is suﬃcient. For completeness, suppose that E[u(Y1 )] = 0. Expression (5.2) of Theorem 5.1 gives the pdf of Y1 . Hence we have the equation SY1

u(y1 )R(y1 ) exp{p(θ)y1 + nq(θ)} dy1 = 0

or equivalently since exp{nq(θ)} = 0, u(y1 )R(y1 ) exp{p(θ)y1 } dy1 = 0 SY1

for all θ. However, p(θ) is a nontrivial continuous function of θ, and thus this integral is essentially a type of Laplace transform of u(y1 )R(y1 ). The only function of y1 transforming to the 0 function is the zero function (except for a set of points with probability zero in our context). That is, u(y1 )R(y1 ) ≡ 0. However, R(y1 ) = 0 for all y1 ∈ SY1 because it is factor in the pdf of Y1 . Hence u(y1 ) ≡ 0 (except for a set of points with probability zero). Therefore, Y1 is a complete suﬃcient statistic for θ.

407

Suﬃciency This theorem has useful implications. In a regular n case of form (5.1), we can see by inspection that the suﬃcient statistic is Y1 = 1 K(Xi ). If we can see how to form a function of Y1 , say ϕ(Y1 ), so that E[ϕ(Y1 )] = θ, then the statistic ϕ(Y1 ) is unique and is the MVUE of θ. Example 5.2. Let X1 , X2 , . . . , Xn denote a random sample from a normal distribution that has pdf (x − θ)2 1 , −∞ < x < ∞, −∞ < θ < ∞, f (x; θ) = √ exp − 2σ 2 σ 2π or √ x2 θ2 θ 2− f (x; θ) = exp . x − − log 2πσ σ2 2σ 2 2σ 2 Here σ 2 is any ﬁxed positive number. This is a regular case of the exponential class with p(θ)

=

H(x)

=

θ , K(x) = x, σ2 √ x2 − 2 − log 2πσ 2 , 2σ

q(θ) = −

θ2 . 2σ 2

Accordingly, Y1 = X1 + X2 + · · · + Xn = nX is a complete suﬃcient statistic for the mean θ of a normal distribution for every ﬁxed value of the variance σ 2 . Since E(Y1 ) = nθ, then ϕ(Y1 ) = Y1 /n = X is the only function of Y1 that is an unbiased estimator of θ; and being a function of the suﬃcient statistic Y1 , it has a minimum variance. That is, X is the unique MVUE of θ. Incidentally, since Y1 is a one-to-one function of X, X itself is also a complete suﬃcient statistic for θ. Example 5.3 (Example 5.1, Continued). Reconsider the discussion concerning the Poisson distribution with parameter θ found in Example 5.1. Based on this n X was suﬃcient. It follows from Theorem discussion, the statistic Y1 = i=1 i 5.2 that its family of distributions is complete. Since E(Y1 ) = nθ, it follows that X = n−1 Y1 is the unique MVUE of θ.

EXERCISES 5.1. Write the pdf f (x; θ) =

1 3 −x/θ x e , 0 < x < ∞, 6θ4

0 < θ < ∞,

zero elsewhere, in the exponential form. If X1 , X2 , . . . , Xn is a random sample from this distribution, ﬁnd a complete suﬃcient statistic Y1 for θ and the unique function ϕ(Y1 ) of this statistic that is the MVUE of θ. Is ϕ(Y1 ) itself a complete suﬃcient statistic? 5.2. Let X1 , X2 , . . . , Xn denote a random sample of size n > 1 from a distribution n with pdf f (x; θ) = θe−θx , 0 < x < ∞, zero elsewhere, and θ > 0. Then Y = 1 Xi is a suﬃcient statistic for θ. Prove that (n − 1)/Y is the MVUE of θ.

408

Suﬃciency 5.3. Let X1 , X2 , . . . , Xn denote a random sample of size n from a distribution with pdf f (x; θ) = θxθ−1 , 0 < x < 1, zero elsewhere, and θ > 0. (a) Show that the geometric mean (X1 X2 · · · Xn )1/n of the sample is a complete suﬃcient statistic for θ. (b) Find the maximum likelihood estimator of θ, and observe that it is a function of this geometric mean. 5.4. Let X denote the mean of the random sample X1 , X2 , . . . , Xn from a gammatype distribution with parameters α > 0 and β = θ > 0. Compute E[X1 |x]. Hint: Can you ﬁnd directly a function ψ(X) of X such that E[ψ(X)] = θ? Is E(X1 |x) = ψ(x)? Why? 5.5. Let X be a random variable with the pdf of a regular case of the exponential class, given by f (x; θ) = exp[θK(x) + H(x) + q(θ)], a < x < b, γ < θ < δ. Show that E[K(X)] = −q (θ)/p (θ), provided these derivatives exist, by diﬀerentiating both members of the equality b exp[p(θ)K(x) + H(x) + q(θ)] dx = 1 a

with respect to θ. By a second diﬀerentiation, ﬁnd the variance of K(X). 5.6. Given that f (x; θ) = exp[θK(x) + H(x) + q(θ)], a < x < b, γ < θ < δ, represents a regular case of the exponential class, show that the moment-generating function M (t) of Y = K(X) is M (t) = exp[q(θ) − q(θ + t)], γ < θ + t < δ. 5.7. In the preceding exercise, given that E(Y ) = E[K(X)] = θ, prove that Y is N (θ, 1). Hint: Consider M (0) = θ and solve the resulting diﬀerential equation. 5.8. If X1 , X2 , . . . , Xn is a random sample from a distribution that hasa pdf which n is a regular case of the exponential class, show that the pdf of Y1 = 1 K(Xi ) is of the form fY1 (y1 ; θ) = R(y1 ) exp[p(θ)y1 + nq(θ)]. Hint: Let Y2 = X2 , . . . , Yn = Xn be n − 1 auxiliary random variables. Find the joint pdf of Y1 , Y2 , . . . , Yn and then the marginal pdf of Y1 . 5.9. Let Y denote the median and let X denote the mean of a random sample of size n = 2k + 1 from a distribution that is N (μ, σ 2 ). Compute E(Y |X = x). Hint: See Exercise 5.4. 5.10. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pdf f (x; θ) = θ2 xe−θx , 0 < x < ∞, where θ > 0. n (a) Argue that Y = 1 Xi is a complete suﬃcient statistic for θ. (b) Compute E(1/Y ) and ﬁnd the function of Y which is the unique MVUE of θ. 5.11. Let X1 , X2 , . . . , Xn , n > 2, be a random sample from the binomial distribution b(1, θ).

409

Suﬃciency (a) Show that Y1 = X1 + X2 + · · · + Xn is a complete suﬃcient statistic for θ. (b) Find the function ϕ(Y1 ) which is the MVUE of θ. (c) Let Y2 = (X1 + X2 )/2 and compute E(Y2 ). (d) Determine E(Y2 |Y1 = y1 ). 5.12. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pmf p(x; θ) = θx (1 − θ), x = 0, 1, 2, . . ., zero elsewhere, where 0 ≤ θ ≤ 1. ˆ of θ. (a) Find the mle, θ, n (b) Show that 1 Xi is a complete suﬃcient statistic for θ. (c) Determine the MVUE of θ.

6

Functions of a Parameter

Up to this point we have sought an MVUE of a parameter θ. Not always, however, are we interested in θ but rather in a function of θ. There are several techniques we can use to the ﬁnd the MVUE. One is by inspection of the expected value of a suﬃcient statistic. This is how we found the MVUEs in Examples 5.2 and 5.3 of the last section. In this section and its exercises, we oﬀer more examples of the inspection technique. The second technique is based on the conditional expectation of an unbiased estimate given a suﬃcient statistic. The third example illustrates this technique. Recall that under regularity conditions we obtained the asymptotic distribution theory for maximum likelihood estimators (mles). This allows certain asymptotic inferences (conﬁdence intervals and tests) for these estimators. Such a simple theory is not available for MVUEs. As Theorem 3.2 shows, though, sometimes we can determine the relationship between the mle and the MVUE. In these situations, we can often obtain the asymptotic distribution for the MVUE based on the asymptotic distribution of the mle. We illustrate this for some of the following examples. Example 6.1. Let X1 , X2 , . . . , Xn denote the observations of a random sample of sizen > 1 from a distribution that is b(1, θ), 0 < θ < 1. We know that if n Y = 1 Xi , then Y /n is the unique minimum variance unbiased estimator of θ. Now suppose we want to estimate the variance of Y /n, which is θ(1 − θ)/n. Let δ = θ(1 − θ). Because Y is a suﬃcient statistic for θ, it is known that we can restrict our search to functions of Y . The maximum likelihood estimate of δ, which is given by δ˜ = (Y /n)(1 − Y /n), is a function of the suﬃcient statistic and seems to be a reasonable starting point. The expectation of this statistic is given by 1 1 ˜ =E Y 1− Y = E(Y ) − 2 E(Y 2 ). E[δ] n n n n

410

Suﬃciency Now E(Y ) = nθ and E(Y 2 ) = nθ(1 − θ) + n2 θ2 . Hence Y θ(1 − θ) Y 1− = (n − 1) . E n n n If we multiply both members of this equation by n/(n−1), we ﬁnd that the statistic δˆ = (n/(n − 1))(Y /n)(1 − Y /n) = (n/(n − 1))δ˜ is the unique MVUE of δ. Hence ˆ the MVUE of δ/n, the variance of Y /n, is δ/n. ˜ ˆ ˜ It is interesting to compare √ ˜ the mle δ with δ. Recall that the mle δ is a consistent estimate of δ and that n(δ − δ) is asymptotically normal. Because 1 P → δ · 0 = 0, δˆ − δ˜ = δ˜ n−1 it follows that δˆ is also a consistent estimator of δ. Further, √ √ √ n ˜ P ˆ ˜ δ → 0. n(δ − δ) − n(δ − δ) = (6.1) n−1 √ √ ˜ Hence n(δˆ − δ) has the same asymptotic distribution √ as˜ n(δ − δ). Using the Δmethod we can obtain the asymptotic distribution of n(δ −δ). Let g(θ) = θ(1−θ). Then √ ˜ g (θ) = 1 − 2θ. Hence, by Theorem (6.1), the asymptotic distribution of n(δ − δ) is given by √

n(δˆ − δ) → N (0, θ(1 − θ)(1 − 2θ)2 ), D

provided θ = 1/2; see Exercise 6.11 for the case θ = 1/2. In the next example, we consider the uniform (0, θ) distribution and obtain the MVUE for all diﬀerentiable functions of θ. This example was sent to us by Professor Bradford Crain of Portland State University. Example 6.2. Suppose X1 , X2 , . . . , Xn are iid random variables with the common uniform (0, θ) distribution. Let Yn = max{X1 , X2 , . . . , Xn }. In Example 4.2, we showed that Yn is a complete and suﬃcient statistic of θ and the pdf of Yn is given by (4.1). Let g(θ) be any diﬀerentiable function of θ. Then the MVUE of g(θ) is the statistic u(Yn ), which satisﬁes the equation

θ

g(θ) =

u(y) 0

or equivalently,

ny n−1 dy, θn

θ > 0,

θ

g(θ)θn =

u(y)ny n−1 dy,

θ > 0.

0

Diﬀerentiating both sides of this equation with respect to θ, we obtain nθn−1 g(θ) + θn g (θ) = u(θ)nθn−1 .

411

Suﬃciency Solving for u(θ), we obtain u(θ) = g(θ) +

θg (θ) . n

Therefore, the MVUE of g(θ) is u(Yn ) = g(Yn ) +

Yn g (Yn ). n

(6.2)

For example, if g(θ) = θ, then u(Yn ) = Yn +

n+1 Yn = Yn , n n

which agrees with the result obtained in Example 4.2. Other examples are given in Exercise 6.4. A somewhat diﬀerent but also very important problem in point estimation is considered in the next example. In the example the distribution of a random variable X is described by a pdf f (x; θ) that depends upon θ ∈ Ω. The problem is to estimate the fractional part of the probability for this distribution, which is at, or to the left of, a ﬁxed point c. Thus we seek an MVUE of F (c; θ), where F (x; θ) is the cdf of X. Example 6.3. Let X1 , X2 , . . . , Xn be a random sample of size n > 1 from a distribution that is N (θ, 1). Suppose that we wish to ﬁnd an MVUE of the function of θ deﬁned by c 2 1 √ e−(x−θ) /2 dx = Φ(c − θ), P (X ≤ c) = 2π −∞ where c is a ﬁxed constant. There are many unbiased estimators of Φ(c − θ). We ﬁrst exhibit one of these, say u(X1 ), a function of X1 alone. We shall then compute the conditional expectation, E[u(X1 )|X = x] = ϕ(x), of this unbiased statistic, given the suﬃcient statistic X, the mean of the sample. In accordance with the theorems of Rao–Blackwell and Lehmann–Scheﬀ´e, ϕ(X) is the unique MVUE of Φ(c − θ). Consider the function u(x1 ), where 1 x1 ≤ c u(x1 ) = 0 x1 > c. The expected value of the random variable u(X1 ) is given by E[u(X1 )] = 1 · P [X1 − θ ≤ c − θ] = Φ(c − θ). That is, u(X1 ) is an unbiased estimator of Φ(c − θ). We shall next discuss the joint distribution of X1 and X and the conditional distribution of X1 , given X = x. This conditional distribution enables us to compute

412

Suﬃciency the conditional expectation E[u(X1 )|X = x] = ϕ(x). In accordance with Exercise 6.7, the joint distribution of X1 and X is bivariate normal with mean √ vector (θ, θ), variances σ12 = 1 and σ22 = 1/n, and correlation coeﬃcient ρ = 1/ n. Thus the conditional pdf of X1 , given X = x, is normal with linear conditional mean θ+

ρσ1 (x − θ) = x σ2

and with variance σ12 (1 − ρ2 ) =

n−1 . n

The conditional expectation of u(X1 ), given X = x, is then " ∞ n(x1 − x)2 n 1 √ exp − dx1 ϕ(x) = u(x1 ) n − 1 2π 2(n − 1) −∞ c " n(x1 − x)2 n 1 √ exp − dx1 . = n − 1 2π 2(n − 1) −∞ √ √ The change of variable z = n(x1 − x)/ n − 1 enables us to write this conditional expectation as

c

ϕ(x) = −∞

2 1 √ e−z /2 dz = Φ(c ) = Φ 2π

√ n(c − x) √ , n−1

√ where c = n(c − x)/ n − 1. Thus√the unique√MVUE of Φ(c − θ) is, for every ﬁxed constant c, given by ϕ(X) = Φ[ n(c − X)/ n − 1]. In this example the mle of Φ(c − θ) is Φ(c − X). These two estimators are close because n/(n − 1) → 1, as n → ∞. √

Remark 6.1. We should like to draw the attention of the reader to a rather important fact. This has to do with the adoption of a principle, such as the principle of unbiasedness and minimum variance. A principle is not a theorem; and seldom does a principle yield satisfactory results in all cases. So far, this principle has provided quite satisfactory results. To see that this is not always the case, let X have a Poisson distribution with parameter θ, 0 < θ < ∞. We may look upon X as a random sample of size 1 from this distribution. Thus X is a complete suﬃcient statistic for θ. We seek the estimator of e−2θ that is unbiased and has minimum variance. Consider Y = (−1)X . We have E(Y ) = E[(−1)X ] =

∞ (−θ)x e−θ x=0

x!

= e−2θ .

Accordingly, (−1)X is the MVUE of e−2θ . Here this estimator leaves much to be desired. We are endeavoring to elicit some information about the number e−2θ , where 0 < e−2θ < 1; yet our point estimate is either −1 or +1, each of which is a very poor estimate of a number between 0 and 1. We do not wish to leave the reader with the impression that an MVUE is bad. That is not the case at all. We merely

413

Suﬃciency wish to point out that if one tries hard enough, one can ﬁnd instances where such a statistic is not good. Incidentally, the maximum likelihood estimator of e−2θ is, in the case where the sample size equals 1, e−2X , which is probably a much better estimator in practice than is the unbiased estimator (−1)X .

EXERCISES 6.1. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (θ, 1), −∞ < θ < ∞. Find the MVUE of θ2 . 2 Hint: First determine E(X ). 6.2. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (0, θ). Then Y = Xi2 is a complete suﬃcient statistic for θ. Find the MVUE of θ2 . 6.3. In the notation of Example 6.3 of this section, does P (−c ≤ X ≤ c) have an MVUE? Here c > 0. 6.4. Let X1 , X2 , . . . , Xn be a random sample from a uniform (0, θ) distribution. Continuing with Example 6.2, ﬁnd the MVUEs for the following functions of θ. (a) g(θ) =

θ2 12 ,

i.e., the variance of the distribution.

(b) g(θ) = θ1 , i.e., the pdf of the distribution. (c) For t real, g(θ) =

etθ −1 tθ ,

i.e., the mgf of the distribution.

6.5. Let X1 , X2 , . . . , Xn be a random sample from a Poisson distribution with parameter θ > 0. (a) Find the MVUE of P (X ≤ 1) = (1 + θ)e−θ . Hint: Let u(x1 ) = 1, x1 ≤ 1, zero elsewhere, and ﬁnd E[u(X1 )|Y = y], n where Y = 1 Xi . (b) Express the MVUE as a function of the mle of θ. (c) Determine the asymptotic distribution of the mle of θ. (d) Obtain the mle of P (X ≤ 1). Then determine its asymptotic distribution. 6.6. Let X1 , X2 , . . . , Xn denote a random sample from a Poisson distribution with parameter θ > 0. From Remark 6.1, we know that E[(−1)X1 ] = e−2θ . (a) Show that E[(−1)X1 |Y1 = y1 ] = (1 − 2/n)y1 , where Y1 = X1 + X2 + · · · + Xn . Hint: First show that the conditional pdf of X1 , X2 , . . . , Xn−1 , given Y1 = y1 , is multinomial, and hence that of X1 , given Y1 = y1 , is b(y1 , 1/n). (b) Show that the mle of e−2θ is e−2X .

414

Suﬃciency (c) Since y1 = nx, show that (1 − 2/n)y1 is approximately equal to e−2x when n is large. 6.7. As in Example 6.3, let X1 , X2 , . . . , Xn be a random sample of size n > 1 from a distribution that is N (θ, 1). Show that the joint distribution of X1 and X 2 2 is bivariate normal with mean √ vector (θ, θ), variances σ1 = 1 and σ2 = 1/n, and correlation coeﬃcient ρ = 1/ n. 6.8. Let a random sample of size n be taken from a distribution that has the pdf f (x; θ) = (1/θ) exp(−x/θ)I(0,∞) (x). Find the mle and MVUE of P (X ≤ 2). 6.9. Let X1 , X2 , . . . , Xn be a random sample with the common pdf f (x) = θ−1 e−x/θ , for x > 0, zero elsewhere; that is, f (x) is a Γ(1, θ) pdf. n (a) Show that the statistic X = n−1 i=1 Xi is a complete and suﬃcient statistic for θ. (b) Determine the MVUE of θ. (c) Determine the mle of θ. (d) Often, though, this pdf is written as f (x) = τ e−τ x , for x > 0, zero elsewhere. Thus τ = 1/θ. Determine the mle of τ . n (e) Show that the statistic X = n−1 i=1 Xi is a complete and suﬃcient statistic for τ . Show that (n − 1)/(nX) is the MVUE of τ = 1/θ. Hence, as usual, the reciprocal of the mle of θ is the mle of 1/θ, but, in this situation, the reciprocal of the MVUE of θ is not the MVUE of 1/θ. (f ) Compute the variances of each of the unbiased estimators in parts (b) and (e). 6.10. Consider the situation of the last exercise, but suppose we have the following two independent random samples: (1) X1 , X2 , . . . , Xn is a random sample with the common pdf fX (x) = θ−1 e−x/θ , for x > 0, zero elsewhere, and (2) Y1 , Y2 , . . . , Yn is a random sample with common pdf fY (y) = τ e−τ y , for y > 0, zero elsewhere. Assume that τ = 1/θ. The last exercise suggests that, for some constant c, Z = cX/Y might be an unbiased estimator of θ2 . Find this constant c and the variance of Z. Hint: Show that X/(θ2 Y ) has an F -distribution. 6.11. Obtain the asymptotic distribution of the MVUE in Example 6.1 for the case θ = 1/2.

7

The Case of Several Parameters

In many of the interesting problems we encounter, the pdf or pmf may not depend upon a single parameter θ, but perhaps upon two (or more) parameters. In general, our parameter space Ω is a subset of Rp , but in many of our examples p is 2.

415

Suﬃciency Deﬁnition 7.1. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that has pdf or pmf f (x; θ), where θ ∈ Ω ⊂ Rp . Let S denote the support of X. Let Y be an m-dimensional random vector of statistics Y = (Y1 , . . . , Ym ) , where Yi = ui (X1 , X2 , . . . , Xn ), for i = 1, . . . , m. Denote the pdf or pmf of Y by fY (y; θ) for y ∈ Rm . The random vector of statistics Y is jointly suﬃcient for θ if and only if

n i=1 f (xi ; θ) = H(x1 , x2 , . . . , xn ), for all xi ∈ S, fY (y; θ) where H(x1 , x2 , . . . , xn ) does not depend upon θ. In general, m = p, i.e., the number of suﬃcient statistics does not have to be the same as the number of parameters, but in most of our examples this is the case. As may be anticipated, the factorization theorem can be extended. In our notation it can be stated in the following manner. The vector of statistics Y is jointly suﬃcient for the parameter θ ∈ Ω if and only if we can ﬁnd two nonnegative functions k1 and k2 such that n

f (xi ; θ) = k1 (y; θ)k2 (x1 , . . . , xn ),

for all xi ∈ S,

(7.1)

i=1

where the function k2 (x1 , x2 , . . . , xn ) does not depend upon θ. Example 7.1. Let X1 , X2 , . . . , Xn be a random sample from a distribution having pdf 1 θ1 − θ2 < x < θ1 + θ2 2θ2 f (x; θ1 , θ2 ) = 0 elsewhere, where −∞ < θ1 < ∞, 0 < θ2 < ∞. Let Y1 < Y2 < · · · < Yn be the order statistics. The joint pdf of Y1 and Yn is given by fY1 ,Y2 (y1 , yn ; θ1 , θ2 ) =

n(n − 1) (yn − y1 )n−2 , (2θ2 )n

θ 1 − θ 2 < y1 < yn < θ 1 + θ 2 ,

and equals zero elsewhere. Accordingly, the joint pdf of X1 , X2 , . . . , Xn can be written, for all points in its support (all xi such that θ1 − θ2 < xi < θ1 + θ2 ),

1 2θ2

n

n(n − 1)[max(xi ) − min(xi )]n−2 = (2θ2 )n

1 n(n − 1)[max(xi ) − min(xi )]n−2

.

Since min(xi ) ≤ xj ≤ max(xi ), j = 1, 2, . . . , n, the last factor does not depend upon the parameters. Either the deﬁnition or the factorization theorem assures us that Y1 and Yn are joint suﬃcient statistics for θ1 and θ2 . The concept of a complete family of probability density functions is generalized as follows: Let {f (v1 , v2 , . . . , vk ; θ) : θ ∈ Ω}

416

Suﬃciency denote a family of pdfs of k random variables V1 , V2 , . . . , Vk that depends upon the p-dimensional vector of parameters θ ∈ Ω. Let u(v1 , v2 , . . . , vk ) be a function of v1 , v2 , . . . , vk (but not a function of any or all of the parameters). If E[u(V1 , V2 , . . . , Vk )] = 0 for all θ ∈ Ω implies that u(v1 , v2 , . . . , vk ) = 0 at all points (v1 , v2 , . . . , vk ), except on a set of points that has probability zero for all members of the family of probability density functions, we shall say that the family of probability density functions is a complete family. In the case where θ is a vector, we generally consider best estimators of functions of θ, that is, parameters δ, where δ = g(θ) for a speciﬁed function g. For example, suppose we are sampling from a N (θ1 , θ2 ) distribution, where θ2 is the variance. Let θ √ = (θ1 , θ2 ) and consider the two parameters δ1 = g1 (θ) = θ1 and δ2 = g2 (θ) = θ2 . Hence we are interested in best estimates of δ1 and δ2 . The Rao–Blackwell, Lehmann–Scheﬀ´e theory outlined in Sections 3 and 4 extends naturally to this vector case. Brieﬂy, suppose δ = g(θ) is the parameter of interest and Y is a vector of suﬃcient and complete statistics for θ. Let T be a statistic which is a function of Y, such as T = T (Y). If E(T ) = δ, then T is the unique MVUE of δ. The remainder of our treatment of the case of several parameters is restricted to probability density functions that represent what we shall call regular cases of the exponential class. Here m = p. Deﬁnition 7.2. Let X be a random variable with pdf or pmf f (x; θ), where the vector of parameters θ ∈ Ω ⊂ Rm . Let S denote the support of X. If X is continuous, assume that S = (a, b), where a or b may be −∞ or ∞, respectively. If X is discrete, assume that S = {a1 , a2 , . . .}. Suppose f (x; θ) is of the form # $ m exp p (θ)K (x) + H(x) + q(θ , θ , . . . , θ ) for all x ∈ S j 1 2 m j=1 j f (x; θ) = 0 elsewhere. (7.2) Then we say this pdf or pmf is a member of the exponential class. We say it is a regular case of the exponential family if, in addition, 1. the support does not depend on the vector of parameters θ, 2. the space Ω contains a nonempty, m-dimensional open rectangle, 3. the pj (θ), j = 1, . . . , m, are nontrivial, functionally independent, continuous functions of θ, 4. and, depending on whether X is continuous or discrete, one of the following holds, respectively: (a) if X is a continuous random variable, then the m derivatives Kj (x), for j = 1, 2, . . . , m, are continuous for a < x < b and no one is a linear homogeneous function of the others, and H(x) is a continuous function of x, a < x < b.

417

Suﬃciency (b) if X is discrete, the Kj (x), j = 1, 2, . . . , m, are nontrivial functions of x on the support S and no one is a linear homogeneous function of the others. Let X1 , . . . , Xn be a random sample on X where the pdf or pmf of X is a regular case of the exponential class with the same notation as in Deﬁnition 7.2. It follows from (7.2) that the joint pdf or pmf of the sample is given by ⎡ ⎤ n n m n f (xi ; θ) = exp ⎣ pj (θ) Kj (xi ) + nq(θ)⎦ exp H(xi ) , (7.3) i=1

j=1

i=1

i=1

for all xi ∈ S. In accordance with the factorization theorem, the statistics Y1 =

n

K1 (xi ),

Y2 =

i=1

n

K2 (xi ), . . . , Ym =

i=1

n

Km (xi )

i=1

are joint suﬃcient statistics for the m-dimensional vector of parameters θ. It is left as an exercise to prove that the joint pdf of Y = (Y1 , . . . , Ym ) is of the form ⎡ ⎤ m R(y) exp ⎣ pj (θ)yj + nq(θ)⎦ , (7.4) j=1

at points of positive probability density. These points of positive probability density and the function R(y) do not depend upon the vector of parameters θ. Moreover, in accordance with a theorem in analysis, it can be asserted that in a regular case of the exponential class, the family of probability density functions of these joint suﬃcient statistics Y1 , Y2 , . . . , Ym is complete when n > m. In accordance with a convention previously adopted, we shall refer to Y1 , Y2 , . . . , Ym as joint complete suﬃcient statistics for the vector of parameters θ. Example 7.2. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (θ1 , θ2 ), −∞ < θ1 < ∞, 0 < θ2 < ∞. Thus the pdf f (x; θ1 , θ2 ) of the distribution may be written as −1 2 θ1 θ12 x + x− − ln 2πθ2 . f (x; θ1 , θ2 ) = exp 2θ2 θ2 2θ2 Therefore, we can take K1 (x) = x2 and K2 (x) = x. Consequently, the statistics Y1 =

n 1

Xi2

and

Y2 =

n

Xi

1

are joint complete suﬃcient statistics for θ1 and θ2 . Since the relations Y2 Y1 − Y22 /n (Xi − X)2 = X, Z2 = = Z1 = n n−1 n−1

418

Suﬃciency deﬁne a one-to-one transformation, Z1 and Z2 are also joint complete suﬃcient statistics for θ1 and θ2 . Moreover, E(Z1 ) = θ1

and

E(Z2 ) = θ2 .

From completeness, we have that Z1 and Z2 are the only functions of Y1 and Y2 that are unbiased estimators of θ1 and θ2 , respectively. Hence Z1 and Z2 are the unique minimum variance estimators of θ1 and θ2 , respectively. The MVUE of the √ standard deviation θ2 is derived in Exercise 7.5. In this section we have extended the concepts of suﬃciency and completeness to the case where θ is a p-dimensional vector. We now extend these concepts to the case where X is a k-dimensional random vector. We only consider the regular exponential class. Suppose X is a k-dimensional random vector with pdf or pmf f (x; θ), where θ ∈ Ω ⊂ Rp . Let S ⊂ Rk denote the support of X. Suppose f (x; θ) is of the form # $ m exp pj (θ)Kj (x) + H(x) + q(θ) for all x ∈ S j=1 f (x; θ) = (7.5) 0 elsewhere. Then we say this pdf or pmf is a member of the exponential class. If, in addition, p = m, the support does not depend on the vector of parameters θ, and conditions similar to those of Deﬁnition 7.2 hold, then we say this pdf is a regular case of the exponential class. Suppose that X1 , . . . , Xn constitute a random sample on X. Then the statistics, Yj =

n

Kj (Xi ),

for j = 1, . . . , m,

(7.6)

i=1

are suﬃcient and complete statistics for θ. Let Y = (Y1 , . . . , Ym ) . Suppose δ = g(θ) is a parameter of interest. If T = h(Y) for some function h and E(T ) = δ then T is the unique minimum variance unbiased estimator of δ. Example 7.3 (Multinomial). Previously, we considered the mles of the multinomial distribution. In this example we determine the MVUEs of several of the parameters. Consider a random trial which can result in one, and only one, of k outcomes or categories. Let Xj be 1 or 0 depending on whether the jth outcome does or does not occur, for j = 1, . . . , k. Suppose the probability that outcome j occurs is pj ; hence, k j=1 pj = 1. Let X = (X1 , . . . , Xk−1 ) and p = (p1 , . . . , pk−1 ) . The distribution of X is multinomial and can be found in expression (4.18), which can be reexpressed as ⎧ ⎞⎫ ⎛ ⎬ ⎨k−1 p j log xj + log ⎝1 − pi ⎠ . f (x, p) = exp ⎭ ⎩ 1 − i=k pi j=1

i=k

Because this a regular case of the exponential family, the following statistics, resulting from a random sample X1 , . . . , Xn from the distribution of X, are jointly

419

Suﬃciency suﬃcient and complete for the parameters p = (p1 , . . . , pk−1 ) : Yj =

n

Xij ,

for j = 1, . . . , k − 1.

i=1

Each random variable Xij is Bernoulli with parameter pj and the variables Xij are independent for i = 1, . . . , n. Hence the variables Yj are binomial(n, pj ) for j = 1, . . . , k. Thus the MVUE of pj is the statistic n−1 Yj . Next, we shall ﬁnd the MVUE of pj pl , for j = l. Exercise 7.8 shows that the mle of pj pl is n−2 Yj Yl . Recall that the conditional distribution of Yj , given Yl , is b[n − Yl , pj /(1 − pl )]. As an initial guess at the MVUE, consider the mle, which, as shown by Exercise 7.8, is n−2 Yj Yl . Hence E[n−2 Yj Yl ]

= = = =

1 1 E[E(Yj Yl |Yl )] = 2 E[Yl E(Yj |Yl )] 2 n n 1 pj pj 1 = 2 E Yl (n − Yl ) {E[nYl ] − E[Yl2 ]} 2 n 1 − pl n 1 − pl 1 pj {n2 pl − npl (1 − pl ) − n2 p2l } n2 1 − pl (n − 1) 1 pj p j pl . npl (n − 1)(1 − pl ) = n2 1 − pl n

Hence the MVUE of pj pl is

1 n(n−1) Yj Yl .

Example 7.4 (Multivariate Normal). Let X have the multivariate normal distribution Nk (μ, Σ), where Σ is a positive deﬁnite k × k matrix. In this case θ is a {k + [k(k + 1)/2]}-dimensional vector whose ﬁrst k components consist of the mean vector μ and whose last k(k+1) components consist of the componentwise variances 2 σi2 and the covariances σij , for j ≥ i. The density of X can be written as ) k 1 1 1 fX (x) = exp − x Σ−1 x + μ Σ−1 x − μ Σ−1 μ − log |Σ| − log 2π , (7.7) 2 2 2 2 for x ∈ Rk . Hence, by (7.5), the multivariate normal pdf is a regular case of the exponential class of distributions. We need only identify the functions K(x). The second term in the exponent on the right side of (7.7) can be written as (μ Σ−1 )x; hence, K1 (x) = x. The ﬁrst term is easily seen to be a linear combination of the products xi xj , i, j = 1, 2, . . . , k, which are the entries of the matrix xx . Hence we can take K2 (x) = xx . Now, let X1 , . . . , Xn be a random sample on X. Based on (7.7) then, a set of suﬃcient and complete statistics is given by Y1 =

n i=1

Xi and Y2 =

n

Xi Xi .

(7.8)

i=1

Note that Y1 is a vector of k statistics and that Y2 is a k × k symmetric matrix. Because the matrix is symmetric, we can eliminate the bottom-half [elements (i, j)

420

Suﬃciency with i > j] of the matrix, which results in {k + [k(k + 1)]} complete suﬃcient statistics, i.e., as many complete suﬃcient statistics as there are parameters. n Based on marginal distributions, itis easy to show that X j = n−1 i=1 Xij is n the MVUE of μj and that (n − 1)−1 i=1 (Xij − X j )2 is the MVUE of σj2 . The MVUEs of the covariance parameters are obtained in Exercise 7.9. For our last example, we consider a case where the set of parameters is the cdf. Example 7.5. Let X1 , X2 , . . . , Xn be a random sample having the common continuous cdf F (x). Let Y1 < Y2 < · · · < Yn denote the corresponding order statistics. Note that given Y1 = y1 , Y2 = y2 , . . . , Yn = yn , the conditional distribution of 1 on each of the n! permutations of X1 , X2 , . . . , Xn is discrete with probability n! the vector (y1 , y2 , . . . , yn ), [because F (x) is continuous, we can assume that each of the values y1 , y2 , . . . , yn is distinct]. That is, the conditional distribution does not depend on F (x). Hence, by the deﬁnition of suﬃciency, the order statistics are suﬃcient for F (x). Furthermore, while the proof is beyond the scope of this chapter, it can be shown that the order statistics are also complete; see page 72 of Lehmann and Casella (1998). Let T = T (x1 , x2 , . . . , xn ) be any statistic which is symmetric in its arguments; i.e., T (x1 , x2 , . . . , xn ) = T (xi1 , xi2 , . . . , xin ) for any permutation (xi1 , xi2 , . . . , xin ) of (x1 , x2 , . . . , xn ). Then T is a function of the order statistics. This is useful in determining MVUEs for this situation; see Exercises 7.12 and 7.13.

EXERCISES 7.1. Let Y1 < Y2 < Y3 be the order statistics of a random sample of size 3 from the distribution with pdf + * x−θ1 1 θ1 < x < ∞, −∞ < θ1 < ∞, 0 < θ2 < ∞ exp − θ2 θ2 f (x; θ1 , θ2 ) = 0 elsewhere. Find the joint pdf of Z1 = Y1 , Z2 = Y2 , and Z3 = Y1 + Y2 + Y3 . The corresponding transformation maps the space {(y1 , y2 , y3 ) : θ1 < y1 < y2 < y3 < ∞} onto the space {(z1 , z2 , z3 ) : θ1 < z1 < z2 < (z3 − z1 )/2 < ∞}. Show that Z1 and Z3 are joint suﬃcient statistics for θ1 and θ2 . from a distribution thathas a pdf of 7.2. Let X1 , X2 , . . . , Xn be a random sample n m the form (7.2) of this section. Show that Y1 = i=1 K1 (Xi ), . . . , Ym = i=1 Km (Xi ) have a joint pdf of the form (7.4) of this section. 7.3. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) denote a random sample of size n from 2 a bivariate normal distribution with means μ 1 and μ 2 , positive n variances n σ21 and n n 2 2 σ2 , and correlation coeﬃcient ρ. Show that 1 Xi , 1 Yi , 1 Xi , 1 Yi , and n statistics for the ﬁve parameters. Are X = ncomplete2 suﬃcient n n n1 Xi Yi are joint 2 2 2 /n, Y = Y /n, S = (X − X) /(n − 1), S = (Y − Y ) /(n − 1), i i i i 1 2 1 X 1 1 1 n and 1 (Xi − X)(Yi − Y )/(n − 1)S1 S2 also joint complete suﬃcient statistics for these parameters?

421

Suﬃciency 7.4. Let the pdf f (x; θ1 , θ2 ) be of the form exp[p1 (θ1 , θ2 )K1 (x) + p2 (θ1 , θ2 )K2 (x) + H(x) + q1 (θ1 , θ2 )], a < x < b, zero elsewhere. Suppose that K1 (x) = cK2 (x). Show that f (x; θ1 , θ2 ) can be written in the form exp[p(θ1 , θ2 )K2 (x) + H(x) + q(θ1 , θ2 )], a < x < b, zero elsewhere. This is the reason why it is required that no one Kj (x) be a linear homogeneous function of the others, that is, so that the number of suﬃcient statistics equals the number of parameters. √ 7.5. In Example 7.2, ﬁnd the MVUE of the standard deviation θ2 . 7.6. Let X1 , X2 , . . . , Xn be a random sample from the uniform distribution with pdf f (x; θ1 , θ2 ) = 1/(2θ2 ), θ1 − θ2 < x < θ1 + θ2 , where −∞ < θ1 < ∞ and θ2 > 0, and the pdf is equal to zero elsewhere. (a) Show that Y1 = min(Xi ) and Yn = max(Xi ), the joint suﬃcient statistics for θ1 and θ2 , are complete. (b) Find the MVUEs of θ1 and θ2 . 7.7. Let X1 , X2 , . . . , Xn be a random sample from N (θ1 , θ2 ). (a) If the constant b is deﬁned by the equation P (X ≤ b) = 0.90, ﬁnd the mle and the MVUE of b. (b) If c is a given constant, ﬁnd the mle and the MVUE of P (X ≤ c). 7.8. In the notation of Example 7.3, show that the mle of pj pl is n−2 Yj Yl . 7.9. Refer to Example 7.4 on suﬃciency for the multivariate normal model. (a) Determine the MVUE of the covariance parameters σij . k (b) Let g = i=1 ai μi , where a1 , . . . , ak are speciﬁed constants. Find the MVUE for g. 7.10. In a personal communication, LeRoy Folks noted that the inverse Gaussian pdf 1/2 θ2 −θ2 (x − θ1 )2 , 0 < x < ∞, (7.9) exp f (x; θ1 , θ2 ) = 2πx3 2θ12 x where θ1 > 0 and θ2 > 0, is often used to model lifetimes. Find the complete suﬃcient statistics for (θ1 , θ2 ) if X1 , X2 , . . . , Xn is a random sample from the distribution having this pdf. 7.11. Let X1 , X2 , . . . , Xn be a random sample from a N (θ1 , θ2 ) distribution. (a) Show that E[(X1 − θ1 )4 ] = 3θ22 . (b) Find the MVUE of 3θ22 .

422

Suﬃciency 7.12. Let X1 , . . . , Xn be a random sample from a distribution of the continuous type with cdf F (x). Suppose the mean, n μ = E(X1 ), exists. Using Example 7.5, show that the sample mean, X = n−1 i=1 Xi , is the MVUE of μ. 7.13. Let X1 , . . . , Xn be a random sample from a distribution of the continuous type with cdf F (x). Let θ = P (X1 ≤ a) = F (a), where a is known. Show that the proportion n−1 #{Xi ≤ a} is the MVUE of θ.

8

Minimal Suﬃciency and Ancillary Statistics

In the study of statistics, it is clear that we want to reduce the data contained in the entire sample as much as possible without losing relevant information about the important characteristics of the underlying distribution. That is, a large collection of numbers in the sample is not as meaningful as a few good summary statistics of those data. Suﬃcient statistics, if they exist, are valuable because we know that the statisticians with those summary measures have as much information as the statistician with the entire sample. Sometimes, however, there are several sets of joint suﬃcient statistics, and thus we would like to ﬁnd the simplest one of these sets. For illustration, in a sense, the observations X1 , X2 , . . . , Xn , n > 2, of a random sample from N (θ1 , θ2 ) could be thought of as joint suﬃcient statistics for θ1 and θ2 . We know, however, that we can use X and S 2 as joint suﬃcient statistics for those parameters, which is a great simpliﬁcation over using X1 , X2 , . . . , Xn , particularly if n is large. In most instances in this chapter, we have been able to ﬁnd a single suﬃcient statistic for one parameter or two joint suﬃcient statistics for two parameters. Possibly the most complicated cases considered so far are given in Example 7.3, in which we ﬁnd k + k(k + 1)/2 joint suﬃcient statistics for k + k(k + 1)/2 parameters; or the multivariate normal distribution given in Example 7.4; or the use the order statistics of a random sample for some completely unknown distribution of the continuous type as in Example 7.5. What we would like to do is to change from one set of joint suﬃcient statistics to another, always reducing the number of statistics involved until we cannot go any further without losing the suﬃciency of the resulting statistics. Those statistics that are there at the end of this reduction are called minimal suﬃcient statistics. These are suﬃcient for the parameters and are functions of every other set of suﬃcient statistics for those same parameters. Often, if there are k parameters, we can ﬁnd k joint suﬃcient statistics that are minimal. In particular, if there is one parameter, we can often ﬁnd a single suﬃcient statistic which is minimal. Most of the earlier examples that we have considered illustrate this point, but this is not always the case, as shown by the following example. Example 8.1. Let X1 , X2 , . . . , Xn be a random sample from the uniform distribution over the interval (θ − 1, θ + 1) having pdf f (x; θ) = ( 12 )I(θ−1,θ+1) (x),

where − ∞ < θ < ∞.

423

Suﬃciency The joint pdf of X1 , X2 , . . . , Xn equals the product of ( 12 )n and certain indicator functions, namely, n n n 1 1 I(θ−1,θ+1) (xi ) = {I(θ−1,θ+1) [min(xi )]}{I(θ−1,θ+1) [max(xi )]}, 2 2 i=1 because θ − 1 < min(xi ) ≤ xj ≤ max(xi ) < θ + 1, j = 1, 2, . . . , n. Thus the order statistics Y1 = min(Xi ) and Yn = max(Xi ) are the suﬃcient statistics for θ. These two statistics actually are minimal for this one parameter, as we cannot reduce the number of them to less than two and still have suﬃciency. There is an observation that helps us see that almost all the suﬃcient statistics that we have studied thus far are minimal. We have noted that the mle θˆ of θ is a function of one or more suﬃcient statistics, when the latter exists. Suppose that this mle θˆ is also suﬃcient. Since this suﬃcient statistic θˆ is a function of the other suﬃcient statistics, by Theorem 3.2, it must be minimal. For example, we have 1. The mle θˆ = X of θ in N (θ, σ 2 ), σ 2 known, is a minimal suﬃcient statistic for θ. 2. The mle θˆ = X of θ in a Poisson distribution with mean θ is a minimal suﬃcient statistic for θ. 3. The mle θˆ = Yn = max(Xi ) of θ in the uniform distribution over (0, θ) is a minimal suﬃcient statistic for θ. 4. The maximum likelihood estimators θˆ1 = X and θˆ2 = [(n − 1)/n]S 2 of θ1 and θ2 in N (θ1 , θ2 ) are joint minimal suﬃcient statistics for θ1 and θ2 . From these examples we see that the minimal suﬃcient statistics do not need to be unique, for any one-to-one transformation of them also provides minimal suﬃcient statistics. The linkage between minimal suﬃcient statistics and the mle, however, does not hold in many interesting instances. We illustrate this in the next two examples. Example 8.2. Consider the model given in Example 8.1. There we noted that Y1 = min(Xi ) and Yn = max(Xi ) are joint suﬃcient statistics. Also, we have θ − 1 < Y1 < Y n < θ + 1 or, equivalently, Yn − 1 < θ < Y1 + 1. Hence, to maximize the likelihood function so that it equals ( 12 )n , θ can be any value between Yn − 1 and Y1 + 1. For example, many statisticians take the mle to be the mean of these two endpoints, namely, Y1 + Yn Yn − 1 + Y1 + 1 = , θˆ = 2 2

424

Suﬃciency which is the midrange. We recognize, however, that this mle is not unique. Some might argue that since θˆ is an mle of θ and since it is a function of the joint suﬃcient statistics, Y1 and Yn , for θ, it is a minimal suﬃcient statistic. This is not the case at all, for θˆ is not even suﬃcient. Note that the mle must itself be a suﬃcient statistic for the parameter before it can be considered the minimal suﬃcient statistic. Note that we can model the situation in the last example by X i = θ + Wi ,

(8.1)

where W1 , W2 , . . . , Wn are iid with the common uniform(−1, 1) pdf. Hence this is an example of a location model. We discuss these models in general next. Example 8.3. Consider a location model given by Xi = θ + Wi ,

(8.2)

where W1 , W2 , . . . , Wn are iid with the common pdf f (w) and common continuous cdf F (w). From Example 7.5, we know that the order statistics Y1 < Y2 < · · · < Yn are a set of complete and suﬃcient statistics for this situation. Can we obtain a smaller set of minimal suﬃcient statistics? Consider the following four situations: (a) Suppose f (w) is the N (0, 1) pdf. n Then we know that X is both the MVUE and mle of θ. Also, X = n−1 i=1 Yi , i.e., a function of the order statistics. Hence X is minimal suﬃcient. (b) Suppose f (w) = exp{−w}, for w > 0, zero elsewhere. Then the statistic Y1 is a suﬃcient statistic as well as the mle, and thus is minimal suﬃcient. (c) Suppose f (w) is the logistic pdf. The mle of θ exists and it is easy to compute. As shown on page 38 of Lehmann and Casella (1998), though, the order statistics are minimal suﬃcient for this situation. That is, no reduction is possible. (d) Suppose f (w) is the Laplace pdf. We can show that the median, Q2 is the mle of θ, but it is not a suﬃcient statistic. Further, similar to the logistic pdf, it can be shown that the order statistics are minimal suﬃcient for this situation. In general, the situation described in parts (c) and (d), where the mle is obtained rather easily while the set of minimal suﬃcient statistics is the set of order statistics and no reduction is possible, is the norm for location models. There is also a relationship between a minimal suﬃcient statistic and completeness that is explained more fully in Lehmann and Scheﬀ´e (1950). Let us say simply and without explanation that for the cases in this chapter, complete suﬃcient statistics are minimal suﬃcient statistics. The converse is not true, however, by noting that in Example 8.1, we have n−1 Yn − Y1 − = 0, for all θ. E 2 n+1

425

Suﬃciency That is, there is a nonzero function of those minimal suﬃcient statistics, Y1 and Yn , whose expectation is zero for all θ. There are other statistics that almost seem opposites of suﬃcient statistics. That is, while suﬃcient statistics contain all the information about the parameters, these other statistics, called ancillary statistics, have distributions free of the parameters and seemingly contain no information about those parameters. As an illustration, we know that the variance S 2 of a random sample from N (θ, 1) has a distribution that does not depend upon θ and hence is an ancillary statistic. Another example is the ratio Z = X1 /(X1 + X2 ), where X1 , X2 is a random sample from a gamma distribution with known parameter α > 0 and unknown parameter β = θ, because Z has a beta distribution that is free of θ. There are many examples of ancillary statistics, and we provide some rules that make them rather easy to ﬁnd with certain models, which we present in the next three examples. Example 8.4 (Location-Invariant Statistics). In Example 8.3, we introduced the location model. Recall that a random sample X1 , X2 , . . . , Xn follows this model if X i = θ + Wi ,

i = 1, . . . , n,

(8.3)

where −∞ < θ < ∞ is a parameter and W1 , W2 , . . . , Wn are iid random variables with the pdf f (w), which does not depend on θ. Then the common pdf of Xi is f (x − θ). Let Z = u(X1 , X2 , . . . , Xn ) be a statistic such that u(x1 + d, x2 + d, . . . , xn + d) = u(x1 , x2 , . . . , xn ), for all real d. Hence Z = u(W1 + θ, W2 + θ, . . . , Wn + θ) = u(W1 , W2 , . . . , Wn ) is a function of W1 , W2 , . . . , Wn alone (not of θ). Hence Z must have a distribution that does not depend upon θ. We call Z = u(X1 , X2 , . . . , Xn ) a location-invariant statistic. Assuming a location model, the following are some examples of location-invariant statistics: the sample variance = S 2 , the sample range = max{Xi } − min{Xi }, the mean deviation from the sample median = (1/n) |Xi − median(Xi )|, X1 + X2 − X3 − X4 , X1 + X3 − 2X2 , (1/n) [Xi − min(Xi )], and so on. To see that the range is location-invariant, note that max{Xi } − θ min{Xi } − θ

=

max{Xi − θ} = max{Wi }

=

min{Xi − θ} = min{Wi }.

So, range = max{Xi }−min{Xi } = max{Xi }−θ−(min{Xi }−θ) = max{Wi }−min{Wi }. Hence the distribution of the range only depends on the distribution of the Wi s and, thus, it is location-invariant. For the location invariance of other statistics, see Exercise 8.4.

426

Suﬃciency Example 8.5 (Scale-Invariant Statistics). Consider a random sample X1 , . . . , Xn which follows a scale model, i.e., a model of the form Xi = θWi ,

i = 1, . . . , n,

(8.4)

where θ > 0 and W1 , W2 , . . . , Wn are iid random variables with pdf f (w), which does not depend on θ. Then the common pdf of Xi is θ−1 f (x/θ). We call θ a scale parameter. Suppose that Z = u(X1 , X2 , . . . , Xn ) is a statistic such that u(cx1 , cx2 , . . . , cxn ) = u(x1 , x2 , . . . , xn ) for all c > 0. Then Z = u(X1 , X2 , . . . , Xn ) = u(θW1 , θW2 , . . . , θWn ) = u(W1 , W2 , . . . , Wn ). Since neither the joint pdf of W1 , W2 , . . . , Wn nor Z contains θ, the distribution of Z must not depend upon θ. We say that Z is a scale-invariant statistic. The n following are some examples of scale-invariant statistics: X1 /(X1 + X2 ), X12 / 1 Xi2 , min(Xi )/ max(Xi ), and so on. The scale invariance of the ﬁrst statistic follows from W1 (θX1 )/θ X1 = = . X1 + X2 [(θX1 ) + (θX2 )]/θ W1 + W 2 The scale invariance of the other statistics is asked for in Exercise 8.5. Example 8.6 (Location- and Scale-Invariant Statistics). Finally, consider a random sample X1 , X2 , . . . , Xn which follows a location and scale model as in Example 7.5. That is, (8.5) Xi = θ1 + θ2 Wi , i = 1, . . . , n, where Wi are iid with the common pdf f (t) which is free of θ1 and θ2 . In this case, the pdf of Xi is θ2−1 f ((x − θ1 )/θ2 ). Consider the statistic Z = u(X1 , X2 , . . . , Xn ), where u(cx1 + d, . . . , cxn + d) = u(x1 , . . . , xn ). Then Z = u(X1 , . . . , Xn ) = u(θ1 + θ2 W1 , . . . , θ1 + θ2 Wn ) = u(W1 , . . . , Wn ). Since neither the joint pdf of W1 , . . . , Wn nor Z contains θ1 and θ2 , the distribution of Z must not depend upon θ1 nor θ2 . Statistics such as Z = u(X1 , X2 , . . . , Xn ) are called location- and scale-invariant statistics. The following are four examples of such statistics: (a) T1 = [max(Xi ) − min(Xi )]/S; n−1 (b) T2 = i=1 (Xi+1 − Xi )2 /S 2 ; (c) T3 = (Xi − X)/S; (d) T4 = |Xi − Xj |/S,, ; i = j.

427

Suﬃciency n Let X − θ1 = n−1 i=1 (Xi − θ1 ). Then the location and scale invariance of the statistic in (d) follows from the two identities 2 n n Xi − θ 1 X − θ1 2 2 = θ2 − = θ22 (Wi − W )2 S θ θ 2 2 i=1 i=1 Xi − θ 1 Xj − θ 1 = θ2 (Wi − Wj ). Xi − Xj = θ 2 − θ2 θ2 See Exercise 8.6 for the other statistics. Thus, these location-invariant, scale-invariant, and location- and scale-invariant statistics provide good illustrations, with the appropriate model for the pdf, of ancillary statistics. Since an ancillary statistic and a complete (minimal) suﬃcient statistic are such opposites, we might believe that there is, in some sense, no relationship between the two. This is true, and in the next section we show that they are independent statistics. EXERCISES 8.1. Let X1 , X2 , . . . , Xn be a random sample from each of the following distributions involving the parameter θ. In each case ﬁnd the mle of θ and show that it is a suﬃcient statistic for θ and hence a minimal suﬃcient statistic. (a) b(1, θ), where 0 ≤ θ ≤ 1. (b) Poisson with mean θ > 0. (c) Gamma with α = 3 and β = θ > 0. (d) N (θ, 1), where −∞ < θ < ∞. (e) N (0, θ), where 0 < θ < ∞. 8.2. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n from the uniform distribution over the closed interval [−θ, θ] having pdf f (x; θ) = (1/2θ)I[−θ,θ] (x). (a) Show that Y1 and Yn are joint suﬃcient statistics for θ. (b) Argue that the mle of θ is θˆ = max(−Y1 , Yn ). (c) Demonstrate that the mle θˆ is a suﬃcient statistic for θ and thus is a minimal suﬃcient statistic for θ. 8.3. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n from a distribution with pdf 1 e−(x−θ1 )/θ2 I(θ1 ,∞) (x), f (x; θ1 , θ2 ) = θ2 where −∞ < θ1 < ∞ and 0 < θ2 < ∞. Find the joint minimal suﬃcient statistics for θ1 and θ2 .

428

Suﬃciency 8.4. Continuing with Example 8.4, show that the following statistics are locationinvariant: (a) The sample variance = S 2 . (b) The mean deviation from the sample median = (1/n) (c) (1/n) [Xi − min(Xi )].

|Xi − median(Xi )|.

8.5. In Example 8.5, a scale model was presented and scale invariance was deﬁned. Using the notation of this example, show that the following statistics are scaleinvariant: n (a) X12 / Xi2 . 1

(b) min{Xi }/ max{Xi }. 8.6. Obtain the location and scale invariance of the other statistics listed in Example 8.6, i.e., the statistics (a) T1 = [max(Xi ) − min(Xi )]/S. n−1 (b) T2 = i=1 (Xi+1 − Xi )2 /S 2 . (c) T3 = (Xi − X)/S. 8.7. With random samples from each of the distributions given in Exercises 8.1(d), 8.2, and 8.3, deﬁne at least two ancillary statistics that are diﬀerent from the examples given in the text. These examples illustrate, respectively, location-invariant, scale-invariant, and location- and scale-invariant statistics.

9

Suﬃciency, Completeness, and Independence

We have noted that if we have a suﬃcient statistic Y1 for a parameter θ, θ ∈ Ω, then h(z|y1 ), the conditional pdf of another statistic Z, given Y1 = y1 , does not depend upon θ. If, moreover, Y1 and Z are independent, the pdf g2 (z) of Z is such that g2 (z) = h(z|y1 ), and hence g2 (z) must not depend upon θ either. So the independence of a statistic Z and the suﬃcient statistic Y1 for a parameter θ mean that the distribution of Z does not depend upon θ ∈ Ω. That is, Z is an ancillary statistic. It is interesting to investigate a converse of that property. Suppose that the distribution of an ancillary statistic Z does not depend upon θ; then are Z and the suﬃcient statistic Y1 for θ independent? To begin our search for the answer, we know that the joint pdf of Y1 and Z is g1 (y1 ; θ)h(z|y1 ), where g1 (y1 ; θ) and h(z|y1 ) represent the marginal pdf of Y1 and the conditional pdf of Z given Y1 = y1 , respectively. Thus the marginal pdf of Z is ∞ g1 (y1 ; θ)h(z|y1 ) dy1 = g2 (z), −∞

429

Suﬃciency which, by hypothesis, does not depend upon θ. Because ∞ g2 (z)g1 (y1 ; θ) dy1 = g2 (z), −∞

if follows, by taking the diﬀerence of the last two integrals, that ∞ [g2 (z) − h(z|y1 )]g1 (y1 ; θ) dy1 = 0

(9.1)

−∞

for all θ ∈ Ω. Since Y1 is suﬃcient statistic for θ, h(z|y1 ) does not depend upon θ. By assumption, g2 (z) and hence g2 (z) − h(z|y1 ) do not depend upon θ. Now if the family {g1 (y1 ; θ) : θ ∈ Ω} is complete, Equation (9.1) would require that g2 (z) − h(z|y1 ) = 0

or

g2 (z) = h(z|y1 ).

That is, the joint pdf of Y1 and Z must be equal to g1 (y1 ; θ)h(z|y1 ) = g1 (y1 ; θ)g2 (z). Accordingly, Y1 and Z are independent, and we have proved the following theorem, which was considered in special cases by Neyman and Hogg and proved in general by Basu. Theorem 9.1. Let X1 , X2 , . . . , Xn denote a random sample from a distribution having a pdf f (x; θ), θ ∈ Ω, where Ω is an interval set. Suppose that the statistic Y1 is a complete and suﬃcient statistic for θ. Let Z = u(X1 , X2 , . . . , Xn ) be any other statistic (not a function of Y1 alone). If the distribution of Z does not depend upon θ, then Z is independent of the suﬃcient statistic Y1 . In the discussion above, it is interesting to observe that if Y1 is a suﬃcient statistic for θ, then the independence of Y1 and Z implies that the distribution of Z does not depend upon θ whether {g1 (y1 ; θ) : θ ∈ Ω} is or is not complete. Conversely, to prove the independence from the fact that g2 (z) does not depend upon θ, we deﬁnitely need the completeness. Accordingly, if we are dealing with situations in which we know that family {g1 (y1 ; θ) : θ ∈ Ω} is complete (such as a regular case of the exponential class), we can say that the statistic Z is independent of the suﬃcient statistic Y1 if and only if the distribution of Z does not depend upon θ(i.e., Z is an ancillary statistic). It should be remarked that the theorem (including the special formulation of it for regular cases of the exponential class) extends immediately to probability density functions that involve m parameters for which there exist m joint suﬃcient statistics. For example, let X1 , X2 , . . . , Xn be a random sample from a distribution having the pdf f (x; θ1 , θ2 ) that represents a regular case of the exponential class so that there are two joint complete suﬃcient statistics for θ1 and θ2 . Then any other statistic Z = u(X1 , X2 , . . . , Xn ) is independent of the joint complete suﬃcient statistics if and only if the distribution of Z does not depend upon θ1 or θ2 . We present an example of the theorem that provides an alternative proof of the independence of X and S 2 , the mean and the variance of a random sample of size n

430

Suﬃciency from a distribution that is N (μ, σ 2 ). This proof is given as if we were unaware that (n − 1)S 2 /σ 2 is χ2 (n − 1), because that fact and the independence were established. Example 9.1. Let X1 , X2 , . . . , Xn denote a random sample of size n from a distribution that is N (μ, σ 2 ). We know that the mean X of the sample is, for every known σ 2 , a complete suﬃcient statistic for the parameter μ, −∞ < μ < ∞. Consider the statistic n 1 (Xi − X)2 , S2 = n − 1 i=1 which is location-invariant. Thus S 2 must have a distribution that does not depend upon μ; and hence, by the theorem, S 2 and X, the complete suﬃcient statistic for μ, are independent. Example 9.2. Let X1 , X2 , . . . , Xn be a random sample of size n from the distribution having pdf f (x; θ)

= e−(x−θ) , θ < x < ∞, −∞ < θ < ∞, = 0 elsewhere.

Here the pdf is of the form f (x−θ), where f (w) = e−w , 0 < w < ∞, zero elsewhere. Moreover, we know (Exercise 4.5) that the ﬁrst order statistic Y1 = min(Xi ) is a complete suﬃcient statistic for θ. Hence Y1 must be independent of each locationinvariant statistic u(X1 , X2 , . . . , Xn ), enjoying the property that u(x1 + d, x2 + d, . . . , xn + d) = u(x1 , x2 , . . . , xn ) for all real d. Illustrations of such statistics are S 2 , the sample range, and 1 [Xi − min(Xi )]. n i=1 n

Example 9.3. Let X1 , X2 denote a random sample of size n = 2 from a distribution with pdf f (x; θ)

= =

1 −x/θ e , 0 < x < ∞, 0 < θ < ∞, θ 0 elsewhere.

The pdf is of the form (1/θ)f (x/θ), where f (w) = e−w , 0 < w < ∞, zero elsewhere. We know that Y1 = X1 + X2 is a complete suﬃcient statistic for θ. Hence, Y1 is independent of every scale-invariant statistic u(X1 , X2 ) with the property u(cx1 , cx2 ) = u(x1 , x2 ). Illustrations of these are X1 /X2 and X1 /(X1 + X2 ), statistics that have F - and beta distributions, respectively. Example 9.4. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (θ1 , θ2 ), −∞ < θ1 < ∞, 0 < θ2 < ∞. In Example 7.2 it was proved

431

Suﬃciency that the mean X and the variance S 2 of the sample are joint complete suﬃcient statistics for θ1 and θ2 . Consider the statistic n−1

Z=

(Xi+1 − Xi )2

1

n

= u(X1 , X2 , . . . , Xn ), (Xi − X)

2

1

which satisﬁes the property that u(cx1 + d, . . . , cxn + d) = u(x1 , . . . , xn ). That is, the ancillary statistic Z is independent of both X and S 2 . In this section we have given several examples in which the complete suﬃcient statistics are independent of ancillary statistics. Thus, in those cases, the ancillary statistics provide no information about the parameters. However, if the suﬃcient statistics are not complete, the ancillary statistics could provide some information as the following example demonstrates. Example 9.5. We refer back to Examples 8.1 and 8.2. There the ﬁrst and nth order statistics, Y1 and Yn , were minimal suﬃcient statistics for θ, where the sample arose from an underlying distribution having pdf ( 12 )I(θ−1,θ+1) (x). Often T1 = (Y1 +Yn )/2 is used as an estimator of θ, as it is a function of those suﬃcient statistics which is unbiased. Let us ﬁnd a relationship between T1 and the ancillary statistic T2 = Yn − Y 1 . The joint pdf of Y1 and Yn is g(y1 , yn ; θ) = n(n − 1)(yn − y1 )n−2 /2n ,

θ − 1 < y1 < yn < θ + 1,

zero elsewhere. Accordingly, the joint pdf of T1 and T2 is, since the absolute value of the Jacobian equals 1, /2n , h(t1 , t2 ; θ) = n(n − 1)tn−2 2

θ−1+

t2 t2 < t1 < θ + 1 − , 0 < t2 < 2, 2 2

zero elsewhere. Thus the pdf of T2 is h2 (t2 ; θ) = n(n − 1)tn−2 (2 − t2 )/2n , 0 < t2 < 2, 2 zero elsewhere, which, of course, is free of θ as T2 is an ancillary statistic. Thus, the conditional pdf of T1 , given T2 = t2 , is h1|2 (t1 |t2 ; θ) =

1 , 2 − t2

θ−1+

t2 t2 < t1 < θ + 1 − , 0 < t2 < 2, 2 2

zero elsewhere. Note that this is uniform on the interval (θ − 1 + t2 /2, θ + 1 − t2 /2); so the conditional mean and variance of T1 are, respectively, E(T1 |t2 ) = θ

432

and

var(T1 |t2 ) =

(2 − t2 )2 . 12

Suﬃciency Given T2 = t2 , we know something about the conditional variance of T1 . In particular, if that observed value of T2 is large (close to 2), then that variance is small and we can place more reliance on the estimator T1 . On the other hand, a small value of t2 means that we have less conﬁdence in T1 as an estimator of θ. It is extremely interesting to note that this conditional variance does not depend upon the sample size n but only on the given value of T2 = t2 . As the sample size increases, T2 tends to becomes larger and, in those cases, T1 has smaller conditional variance. While Example 9.5 is a special one demonstrating mathematically that an ancillary statistic can provide some help in point estimation, this does actually happen in practice, too. For illustration, we know that if the sample size is large enough, then X −μ √ T = S/ n has an approximate standard normal distribution. Of course, if the sample arises from a normal distribution, X and S are independent and T has a t-distribution with n − 1 degrees of freedom. Even if the sample arises from a symmetric distribution, X and S are uncorrelated and T has an approximate t-distribution and certainly an approximate standard normal distribution with sample sizes around 30 or 40. On the other hand, if the sample arises from a highly skewed distribution (say to the right), then X and S are highly correlated and the probability P (−1.96 < T < 1.96) is not necessarily close to 0.95 unless the sample size is extremely large (certainly much greater than 30). Intuitively, one can understand why this correlation exists if the underlying distribution is highly skewed to the right. While S has a distribution free of μ (and hence is an ancillary), a large value of S implies a large value of X, since the underlying pdf is like the one depicted in Figure 9.1. Of course, a small value of X (say less than the mode) requires a relatively small value of S. This means that unless n is extremely large, it is risky to say that 1.96s 1.96s x− √ , x+ √ n n provides an approximate 95% conﬁdence interval with data from a very skewed distribution. As a matter of fact, the authors have seen situations in which this conﬁdence coeﬃcient is closer to 80%, rather than 95%, with sample sizes of 30 to 40. EXERCISES 9.1. Let Y1 < Y2 < Y3 < Y4 denote the order statistics of a random sample of size n = 4 from a distribution having pdf f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere, where 0 < θ < ∞. Argue that the complete suﬃcient statistic Y4 for θ is independent of each of the statistics Y1 /Y4 and (Y1 + Y2 )/(Y3 + Y4 ). Hint: Show that the pdf is of the form (1/θ)f (x/θ), where f (w) = 1, 0 < w < 1, zero elsewhere.

433

Suﬃciency f(x)

x

Figure 9.1: Graph of a right skewed distribution; see also Exercise 9.14. 9.2. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from a N (θ, σ 2 ), −∞ < θ < ∞, distribution. nShow that the distribution of Z = Yn − X does not depend upon θ. Thus Y = 1 Yi /n, a complete suﬃcient statistic for θ is independent of Z. 9.3. Let X1 , X2 , . . . , Xn be iid with the distribution N (θ, σ 2 ), −∞< θ < ∞. n Provethat a necessary and suﬃcient condition that the statistics Z = 1 ai Xi and n n Y = 1 Xi , a complete suﬃcient statistic for θ, are independent is that 1 ai = 0. 9.4. Let X and Y be random variables such that E(X k ) and E(Y k ) = 0 exist for k = 1, 2, 3, . . . . If the ratio X/Y and its denominator Y are independent, prove that E[(X/Y )k ] = E(X k )/E(Y k ), k = 1, 2, 3, . . . . Hint: Write E(X k ) = E[Y k (X/Y )k ]. 9.5. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample of size n −x/θ from a distribution that has pdf f (x; θ) = (1/θ)e , 0 < x < ∞, 0 < θ < ∞, zero n elsewhere. Show that the ratio R = nY1 / 1 Yi and its denominator (a complete suﬃcient statistic for θ) are independent. Use the result of the preceding exercise to determine E(Rk ), k = 1, 2, 3, . . . . 9.6. Let X1 , X2 , . . . , X5 be iid with pdf f (x) = e−x , 0 < x < ∞, zero elsewhere. Show that (X1 + X2 )/(X1 + X2 + · · · + X5 ) and its denominator are independent. Hint: The pdf f (x) is a member of {f (x; θ) : 0 < θ < ∞}, where f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, zero elsewhere. 9.7. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from the normal distribution N (θ1 , θ2 ), −∞ < θ1 < ∞, 0 < θ2 < ∞. Show that the joint complete suﬃcient statistics X = Y and S 2 for θ1 and θ2 are independent of each of (Yn − Y )/S and (Yn − Y1 )/S. 9.8. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from a distribution with the pdf 1 x − θ1 exp − f (x; θ1 , θ2 ) = , θ2 θ2

434

Suﬃciency θ1 < x < ∞, zero elsewhere, where −∞ < θ1 < ∞, 0 < θ2 < ∞. Show that the joint complete suﬃcient statistics Y1 and X = Y for the parameters θ1 and θ2 are n independent of (Y2 − Y1 ) / 1 (Yi − Y1 ) . 9.9. Let X1 , X2 , . . . , X5 be a random sample of size n = 5 from the normal distribution N (0, θ). (a) Argue that the ratio R = (X12 + X22 )/(X12 + · · · + X52 ) and its denominator (X12 + · · · + X52 ) are independent. (b) Does 5R/2 have an F -distribution with 2 and 5 degrees of freedom? Explain your answer. (c) Compute E(R) using Exercise 9.4. 9.10. Referring to Example 9.5 of this section, determine c so that P (−c < T1 − θ < c|T2 = t2 ) = 0.95. Use this result to ﬁnd a 95% conﬁdence interval for θ, given T2 = t2 ; and note how its length is smaller when the range of t2 is larger. 9.11. Show that Y = |X| is a complete suﬃcient statistic for θ > 0, where X has the pdf fX (x; θ) = 1/(2θ), for −θ < x < θ, zero elsewhere. Show that Y = |X| and Z = sgn(X) are independent. 9.12. Let Y1 < Y2 < · · · < Yn be the order statistics of a random sample from a N (θ, σ 2 ) distribution, where σ 2 is ﬁxed but arbitrary. Then Y = X is a complete suﬃcient statistic for θ. Consider another estimator T of θ, such as T = (Yi + Yn+1−i )/2, for i = 1, 2, . . . , [n/2], or T could be any weighted average of these latter statistics. (a) Argue that T − X and X are independent random variables. (b) Show that Var(T ) = Var(X) + Var(T − X). (c) Since we know Var(X) = σ 2 /n, it might be more eﬃcient to estimate Var(T ) by estimating the Var(T − X) by Monte Carlo methods rather than doing that with Var(T ) directly, because Var(T ) ≥ Var(T − X). This is often called the Monte Carlo Swindle. 9.13. Suppose X1 , X2 , . . . , Xn is a random sample from a distribution with pdf f (x; θ) = (1/2)θ3 x2 e−θx , 0 < x < ∞, zero elsewhere, where 0 < θ < ∞: ˆ of θ. Is θˆ unbiased? (a) Find the mle, θ, n ˆ Hint: Find the pdf of Y = 1 Xi and then compute E(θ). (b) Argue that Y is a complete suﬃcient statistic for θ. (c) Find the MVUE of θ. (d) Show that X1 /Y and Y are independent.

435

Suﬃciency (e) What is the distribution of X1 /Y ? 9.14. The pdf depicted in Figure 9.1 is given by −x −(m2 +1) ) , −∞ < x < ∞, fm2 (x) = e−x (1 + m−1 2 e

(9.2)

where m2 > 0 (the pdf graphed is for m2 = 0.1). This is a member of a large family of pdfs, log F -family, which are useful in survival (lifetime) analysis; see Chapter 3 of Hettmansperger and McKean (2011). (a) Let W be a random variable with pdf (9.2). Show that W = log Y , where Y has an F -distribution with 2 and 2m2 degrees of freedom. (b) Show that the pdf becomes the logistic (1.8) if m2 = 1. (c) Consider the location model where Xi = θ + Wi

i = 1, . . . , n,

where W1 , . . . , Wn are iid with pdf (9.2). Similar to the logistic location model, the order statistics are minimal suﬃcient for this model. Show that the mle of θ exists.

Answers to Selected Exercises 1.4

1 2 3, 3.

4.2 (a) X ; (b) X

1.5 δ1 (y).

4.3 Y /n.

1.6 b = 0, does not exist.

4.5 Y1 − n1 .

1.7 does not exist.

n 2.8 i=1 [Xi (1 − Xi )].

4.7 (a) Yes; (b) yes.

−r

− θ1

r

n!θ e [ i=1 yi +(n−r)yr ] . 2.9 (a) (n−r)! r (b) r−1 [ i=1 yi + (n − r)yr ].

3.2 60y32 (y5 − y3 )/θ5 ; 0 < y3 < y5 < θ; 6y5 /5; θ2 /7; θ2 /35. 3.3

436

1 −y1 /θ ,0 < θ2 e y1 /2; θ2 /2.

y2 < y1 < ∞;

4.8 (a) E(X) = 0. 4.9 (a) max{−Y1 , 0.5Yn }; (b) yes; (c) yes. n 5.1 Y1 = i=1 Xi ; Y1 /4n; yes. 5.4 x/α. 5.9 x. 5.11 (b) Y1 /n; (c) θ; (d) Y1 /n.

n n 3.5 n−1 i=1 Xi2 ; n−1 i=1 Xi ; (n + 1)Yn /n.

6.1 X − n1 .

3.6 6X.

6.2 Y 2 /(n2 + 2n).

2

Suﬃciency + n−1 Y * Y 1 + n−1 ; n + n−1 nX * nX 1 + n−1 ; (b) n θ

(c) N θ, n .

7.6 (b)

6.5 (a)

* 6.8 1 − e−2/X ; 1 − 1 −

2/X n

+n−1 .

Y1 +Yn (n+1)(Yn −Y1 ) ; . 2 2(n−1)

n 1 7.9 (a) n−1 h=1 (Xih − X i ) − X j ); × (X jh n (b) i=1 ai X i . * n 1 + n x , 7.10 i i=1 i=1 xi , . n

6.9 (b) X; (c) X; (d) 1/X.

8.3 Y1 , ;

7.3 Yes.

9.13 (a) Γ(3n, 1/θ), ; no, ; (c) (3n − 1)/Y ; (e) Beta(3, 3n − 3).

7.5

Γ[(n−1)/2] Γ[n/2]

,

n−1 2 S.

i=1 (Yi

− Y1 )/n, .

437

438

Optimal Tests of Hypotheses 1

Most Powerful Tests

In this chapter, we discuss certain best tests. For convenience to the reader, in the next several paragraphs we quickly review concepts of testing. We are interested in a random variable X which has pdf or pmf f (x; θ), where θ ∈ Ω. We assume that θ ∈ ω0 or θ ∈ ω1 , where ω0 and ω1 are disjoint subsets of Ω and ω0 ∪ ω1 = Ω. We label the hypotheses as H0 : θ ∈ ω0 versus H1 : θ ∈ ω1 .

(1.1)

The hypothesis H0 is referred to as the null hypothesis, while H1 is referred to as the alternative hypothesis. The test of H0 versus H1 is based on a sample X1 , . . . , Xn from the distribution of X. In this chapter, we often use the vector X = (X1 , . . . , Xn ) to denote the random sample and x = (x1 , . . . , xn ) to denote the values of the sample. Let S denote the support of the random sample X = (X1 , . . . , Xn ). A test of H0 versus H1 is based on a subset C of S. This set C is called the critical region and its corresponding decision rule is Reject H0 (Accept H1 ) Retain H0 (Reject H1 )

if X ∈ C

(1.2)

if X ∈ C . c

Note that a test is deﬁned by its critical region. Conversely, a critical region deﬁnes a test. Recall that the 2 × 2 decision table summarizes the results of the hypothesis test in terms of the true state of nature. Besides the correct decisions, two errors can occur. A Type I error occurs if H0 is rejected when it is true, while a Type II error occurs if H0 is accepted when H1 is true. The size or significance

From Chapter 8 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

439

Optimal Tests of Hypotheses level of the test is the probability of a Type I error; i.e., α = max Pθ (X ∈ C). θ∈ω0

(1.3)

Note that Pθ (X ∈ C) should be read as the probability that X ∈ C when θ is the true parameter. Subject to tests having size α, we select tests that minimize Type II error or equivalently maximize the probability of rejecting H0 when θ ∈ ω1 . Recall that the power function of a test is given by γC (θ) = Pθ (X ∈ C);

θ ∈ ω1 .

(1.4)

In this chapter, we want to construct best tests for certain situations. We begin with testing a simple hypothesis H0 against a simple alternative H1 . Let f (x; θ) denote the pdf or pmf of a random variable X, where θ ∈ Ω = {θ , θ }. Let ω0 = {θ } and ω1 = {θ }. Let X = (X1 , . . . , Xn ) be a random sample from the distribution of X. We now deﬁne a best critical region (and hence a best test) for testing the simple hypothesis H0 against the alternative simple hypothesis H1 . Definition 1.1. Let C denote a subset of the sample space. Then we say that C is a best critical region of size α for testing the simple hypothesis H0 : θ = θ against the alternative simple hypothesis H1 : θ = θ if (a) Pθ [X ∈ C] = α. (b) And for every subset A of the sample space, Pθ [X ∈ A] = α ⇒ Pθ [X ∈ C] ≥ Pθ [X ∈ A]. This deﬁnition states, in eﬀect, the following: In general, there is a multiplicity of subsets A of the sample space such that Pθ [X ∈ A] = α. Suppose that there is one of these subsets, say C, such that when H1 is true, the power of the test associated with C is at least as great as the power of the test associated with every other A. Then C is deﬁned as a best critical region of size α for testing H0 against H1 . As Theorem 1.1 shows, there is a best test for this simple versus simple case. But ﬁrst, we oﬀer a simple example examining this deﬁnition in some detail. Example 1.1. Consider the one random variable X that has a binomial distribution with n = 5 and p = θ. Let f (x; θ) denote the pmf of X and let H0 : θ = 12 and H1 : θ = 34 . The following tabulation gives, at points of positive probability density, the values of f (x; 12 ), f (x; 34 ), and the ratio f (x; 12 )/f (x; 34 ).

440

Optimal Tests of Hypotheses x f (x; 1/2) f (x; 3/4) f (x; 1/2)/f (x; 3/4) x f (x; 1/2) f (x; 3/4) f (x; 1/2)/f (x; 3/4)

0 1/32 1/1024 32/1 3 10/32 270/1024 32/27

1 5/32 15/1024 32/3 4 5/32 405/1024 32/81

2 10/32 90/1024 32/9 5 1/32 243/1024 32/243

We shall use one random value of X to test the simple hypothesis H0 : θ = 12 against the alternative simple hypothesis H1 : θ = 34 , and we shall ﬁrst assign 1 . We seek a best critical region of the signiﬁcance level of the test to be α = 32 1 size α = 32 . If A1 = {x : x = 0} or A2 = {x : x = 5}, then P{θ=1/2} (X ∈ 1 and there is no other subset A3 of the space {x : A1 ) = P{θ=1/2} (X ∈ A2 ) = 32 1 . Then either A1 or A2 is x = 0, 1, 2, 3, 4, 5} such that P{θ=1/2} (X ∈ A3 ) = 32 1 the best critical region C of size α = 32 for testing H0 against H1 . We note that 1 1 and P{θ=3/4} (X ∈ A1 ) = 1024 . Thus, if the set A1 is P{θ=1/2} (X ∈ A1 ) = 32 1 used as a critical region of size α = 32 , we have the intolerable situation that the probability of rejecting H0 when H1 is true (H0 is false) is much less than the probability of rejecting H0 when H0 is true. On the other hand, if the set A2 is used as a critical region, then P{θ=1/2} (X ∈ 1 243 and P{θ=3/4} (X ∈ A2 ) = 1024 . That is, the probability of rejecting H0 A2 ) = 32 when H1 is true is much greater than the probability of rejecting H0 when H0 is true. Certainly, this is a more desirable state of aﬀairs, and actually A2 is the best 1 . The latter statement follows from the fact that when critical region of size α = 32 H0 is true, there are but two subsets, A1 and A2 , of the sample space, each of whose 1 and the fact that probability measure is 32 243 1024

= P{θ=3/4} (X ∈ A2 ) > P{θ=3/4} (X ∈ A1 ) =

1 1024 .

It should be noted in this problem that the best critical region C = A2 of size 1 is found by including in C the point (or points) at which f (x; 12 ) is small in α = 32 comparison with f (x; 34 ). This is seen to be true once it is observed that the ratio f (x; 12 )/f (x; 34 ) is a minimum at x = 5. Accordingly, the ratio f (x; 12 )/f (x; 34 ), which is given in the last line of the above tabulation, provides us with a precise tool by which to ﬁnd a best critical region C for certain given values of α. To 6 . When H0 is true, each of the subsets {x : x = 0, 1}, illustrate this, take α = 32 6 . By direct {x : x = 0, 4}, {x : x = 1, 5}, {x : x = 4, 5} has probability measure 32 computation it is found that the best critical region of this size is {x : x = 4, 5}. This reﬂects the fact that the ratio f (x; 12 )/f (x; 34 ) has its two smallest values for 6 , is x = 4 and x = 5. The power of this test, which has α = 32 P{θ=3/4} (X = 4, 5) =

405 1024

+

243 1024

=

648 1024 .

The preceding example should make the following theorem, due to Neyman and Pearson, easier to understand. It is an important theorem because it provides a systematic method of determining a best critical region.

441

Optimal Tests of Hypotheses Theorem 1.1. Neyman–Pearson Theorem. Let X1 , X2 , . . . , Xn , where n is a fixed positive integer, denote a random sample from a distribution that has pdf or pmf f (x; θ). Then the likelihood of X1 , X2 , . . . , Xn is L(θ; x) =

n

for x = (x1 , . . . , xn ).

f (xi ; θ),

i=1

Let θ and θ be distinct fixed values of θ so that Ω = {θ : θ = θ , θ }, and let k be a positive number. Let C be a subset of the sample space such that (a)

L(θ ; x) ≤ k, for each point x ∈ C. L(θ ; x)

(b)

L(θ ; x) ≥ k, for each point x ∈ C c . L(θ ; x)

(c) α = PH0 [X ∈ C]. Then C is a best critical region of size α for testing the simple hypothesis H0 : θ = θ against the alternative simple hypothesis H1 : θ = θ . Proof: We shall give the proof when the random variables are of the continuous type. If C is the only critical region of size α, the theorem is proved. If there is critical region of size α, denote it by A. For convenience, we shall let another · · · L(θ; x1 , . . . , xn ) dx1 · · · dxn be denoted by R L(θ). In this notation we wish R

to show that

L(θ ) −

C

L(θ ) ≥ 0. A

Since C is the union of the disjoint sets C ∩ A and C ∩ Ac and A is the union of the disjoint sets A ∩ C and A ∩ C c , we have L(θ ) − L(θ ) = L(θ ) + L(θ ) − L(θ ) − L(θ ) c c C A A∩C A∩C C∩A C∩A = L(θ ) − L(θ ). (1.5) C∩Ac

A∩C c

However, by the hypothesis of the theorem, L(θ ) ≥ (1/k)L(θ ) at each point of C, and hence at each point of C ∩ Ac ; thus, 1 L(θ ) ≥ L(θ ). k C∩Ac C∩Ac But L(θ ) ≤ (1/k)L(θ ) at each point of C c , and hence at each point of A ∩ C c ; accordingly, 1 L(θ ) ≤ L(θ ). k A∩C c A∩C c These inequalities imply that L(θ ) − C∩Ac

442

1 L(θ ) ≥ k c A∩C

1 L(θ ) − k c C∩A

L(θ ); A∩C c

Optimal Tests of Hypotheses and, from Equation (1.5), we obtain 1 L(θ ) − L(θ ) ≥ L(θ ) − L(θ ) . k C∩Ac C A A∩C c However, C∩Ac

L(θ ) −

L(θ )

=

A∩C c

L(θ )

L(θ ) + C∩Ac

− =

(1.6)

L(θ ) − L(θ ) c A∩C A∩C L(θ ) − L(θ ) = α − α = 0.

C

C∩A

A

If this result is substituted in inequality (1.6), we obtain the desired result, L(θ ) − L(θ ) ≥ 0. C

A

If the random variables are of the discrete type, the proof is the same with integration replaced by summation. Remark 1.1. As stated in the theorem, conditions (a), (b), and (c) are suﬃcient ones for region C to be a best critical region of size α. However, they are also necessary. We discuss this brieﬂy. Suppose there is a region A of size α that does not satisfy (a) and (b) and that is as powerful at θ = θ as C, which satisﬁes (a), (b), and (c). Then expression (1.5) would be zero, since the power at θ using A is equal to that using C. It can be proved that to have expression (1.5) equal zero, A must be of the same form as C. As a matter of fact, in the continuous case, A and C would essentially be the same region; that is, they could diﬀer only by a set having probability zero. However, in the discrete case, if PH0 [L(θ ) = kL(θ )] is positive, A and C could be diﬀerent sets, but each would necessarily enjoy conditions (a), (b), and (c) to be a best critical region of size α. It would seem that a test should have the property that its power should never fall below its signiﬁcance level; otherwise, the probability of falsely rejecting H0 (level) is higher than the probability of correctly rejecting H0 (power). We say a test having this property is unbiased, which we now formally deﬁne: Definition 1.2. Let X be a random variable which has pdf or pmf f (x; θ), where θ ∈ Ω. Consider the hypotheses given in expression (1.1). Let X = (X1 , . . . , Xn ) denote a random sample on X. Consider a test with critical region C and level α. We say that this test is unbiased if Pθ (X ∈ C) ≥ α, for all θ ∈ ω1 .

443

Optimal Tests of Hypotheses As the next corollary shows, the best test given in Theorem 1.1 is an unbiased test. Corollary 1.1. As in Theorem 1.1, let C be the critical region of the best test of H0 : θ = θ versus H1 : θ = θ . Suppose the significance level of the test is α. Let γC (θ ) = Pθ [X ∈ C] denote the power of the test. Then α ≤ γC (θ ). Proof: Consider the “unreasonable” test in which the data are ignored, but a Bernoulli trial is performed which has probability α of success. If the trial ends in success, we reject H0 . The level of this test is α. Because the power of a test is the probability of rejecting H0 when H1 is true, the power of this unreasonable test is α also. But C is the best critical region of size α and thus has power greater than or equal to the power of the unreasonable test. That is, γC (θ ) ≥ α, which is the desired result. Another aspect of Theorem 1.1 to be emphasized is that if we take C to be the set of all points x which satisfy L(θ ; x) ≤ k, L(θ ; x)

k > 0,

then, in accordance with the theorem, C is a best critical region. This inequality can frequently be expressed in one of the forms (where c1 and c2 are constants) u1 (x; θ , θ ) ≤ c1 or u2 (x; θ , θ ) ≥ c2 . Suppose that it is the ﬁrst form, u1 ≤ c1 . Since θ and θ are given constants, u1 (X; θ , θ ) is a statistic; and if the pdf or pmf of this statistic can be found when H0 is true, then the signiﬁcance level of the test of H0 against H1 can be determined from this distribution. That is, α = PH0 [u1 (X; θ , θ ) ≤ c1 ]. Moreover, the test may be based on this statistic; for if the observed vector value of X is x, we reject H0 (accept H1 ) if u1 (x) ≤ c1 . A positive number k determines a best critical region C whose size is α = PH0 [X ∈ C] for that particular k. It may be that this value of α is unsuitable for the purpose at hand; that is, it is too large or too small. However, if there is a statistic u1 (X) as in the preceding paragraph, whose pdf or pmf can be determined when H0 is true, we need not experiment with various values of k to obtain a desirable signiﬁcance level. For if the distribution of the statistic is known, or can be found, we may determine c1 such that PH0 [u1 (X) ≤ c1 ] is a desirable signiﬁcance level. An illustrative example follows.

444

Optimal Tests of Hypotheses Example 1.2. Let X = (X1 , . . . , Xn ) denote a random sample from the distribution that has the pdf (x − θ)2 1 √ , −∞ < x < ∞. exp − f (x; θ) = 2 2π It is desired to test the simple hypothesis H0 : θ = θ = 0 against the alternative simple hypothesis H1 : θ = θ = 1. Now n

√ n 2 (1/ 2π) exp − xi /2 L(θ ; x) 1 = n L(θ ; x)

√ 2 2 (1/ 2π)n exp − (xi − 1) 1

=

exp −

n

1

xi +

n 2

.

If k > 0, the set of all points (x1 , x2 , . . . , xn ) such that

n

n ≤k xi + exp − 2 1 is a best critical region. This inequality holds if and only if −

n

xi +

1

or, equivalently,

n

1

xi ≥

n ≤ log k 2

n − log k = c. 2

n In this case, a best critical region is the set C = {(x1 , x2 , . . . , xn ) : 1 xi ≥ c}, where c is a constant that can bedetermined so that the size of the critical region is n a desired number α. The event 1 Xi ≥ c is equivalent to the event X ≥ c/n = c1 , for example, so the test may be based upon the statistic X. If H0 is true, that is, θ = θ = 0, then X has a distribution that is N (0, 1/n). For a given positive integer n, the size of the sample and a given signiﬁcance level α, the number c1 can be found from Table III in Appendix: Tables of Distributions, so that PH0 (X ≥ c1 ) = α. Hence, if the experimental values n of X1 , X2 , . . . , Xn were, respectively, x1 , x2 , . . . , xn , we would compute x = 1 xi /n. If x ≥ c1 , the simple hypothesis H0 : θ = θ = 0 would be rejected at the signiﬁcance level α; if x < c1 , the hypothesis H0 would be accepted. The probability of rejecting H0 when H0 is true is α; the probability of rejecting H0 , when H0 is false, is the value of the power of the test at θ = θ = 1. That is, ∞ (x − 1)2 1 √ PH1 (X ≥ c1 ) = dx. exp − 2(1/n) 2π 1/n c1

445

Optimal Tests of Hypotheses For example, √ if n = 25 and if α is selected to be 0.05, then from Table III we ﬁnd c1 = 1.645/ 25 = 0.329. Thus the power of this best test of H0 against H1 is 0.05 when H0 is true, and is ∞ ∞ 2 (x − 1)2 1 1 √ e−w /2 dw = 0.9996, dx exp − = √ 1 1 2( 25 ) 2π 0.329 −3.355 2π 25 when H1 is true. There is another aspect of this theorem that warrants special mention. It has to do with the number of parameters that appear in the pdf. Our notation suggests that there is but one parameter. However, a careful review of the proof reveals that nowhere was this needed or assumed. The pdf or pmf may depend upon any ﬁnite number of parameters. What is essential is that the hypothesis H0 and the alternative hypothesis H1 be simple, namely, that they completely specify the distributions. With this in mind, we see that the simple hypotheses H0 and H1 do not need to be hypotheses about the parameters of a distribution, nor, as a matter of fact, do the random variables X1 , X2 , . . . , Xn need to be independent. That is, if H0 is the simple hypothesis that the joint pdf or pmf is g(x1 , x2 , . . . , xn ), and if H1 is the alternative simple hypothesis that the joint pdf or pmf is h(x1 , x2 , . . . , xn ), then C is a best critical region of size α for testing H0 against H1 if, for k > 0, 1.

g(x1 , x2 , . . . , xn ) ≤ k for (x1 , x2 , . . . , xn ) ∈ C. h(x1 , x2 , . . . , xn )

2.

g(x1 , x2 , . . . , xn ) ≥ k for (x1 , x2 , . . . , xn ) ∈ C c . h(x1 , x2 , . . . , xn )

3. α = PH0 [(X1 , X2 , . . . , Xn ) ∈ C]. An illustrative example follows. Example 1.3. Let X1 , . . . , Xn denote a random sample from a distribution which has a pmf f (x) that is positive on and only on the nonnegative integers. It is desired to test the simple hypothesis −1 e x = 0, 1, 2, . . . x! H0 : f (x) = 0 elsewhere, against the alternative simple hypothesis 1 x+1 (2) H1 : f (x) = 0

x = 0, 1, 2, . . . elsewhere.

Here g(x1 , . . . , xn ) h(x1 , . . . , xn )

=

e−n /(x1 !x2 ! · · · xn !) ( 12 )n ( 12 )x1 +x2 +···+xn

=

(2e−1 )n 2 n (xi !)

1

446

xi

.

Optimal Tests of Hypotheses If k > 0, the set of points (x1 , x2 , . . . , xn ) such that

n

xi

log 2 − log

1

n

(xi !) ≤ log k − n log(2e−1 ) = c

1

is a best critical region C. Consider the case of k = 1 and n = 1. The preceding inequality may be written 2x1 /x1 ! ≤ e/2. This inequality is satisﬁed by all points in the set C = {x1 : x1 = 0, 3, 4, 5, . . .}. Thus the power of the test when H0 is true is PH0 (X1 ∈ C) = 1 − PH0 (X1 = 1, 2) = 0.448, approximately, in accordance with Table I of Appendix: Tables of Distributions; i.e., the signiﬁcance level of this test is 0.448. The power of the test when H1 is true is given by PH1 (X1 ∈ C) = 1 − PH1 (X1 = 1, 2) = 1 − ( 14 + 18 ) = 0.625. Note that these results are consistent with Corollary 1.1. Remark 1.2. In the notation of this section, say C is a critical region such that L(θ ) and β = L(θ ), α= Cc

C

where α and β equal the respective probabilities of the Type I and Type II errors associated with C. Let d1 and d2 be two given positive constants. Consider a certain linear function of α and β, namely, L(θ ) + d2 L(θ ) = d1 L(θ ) + d2 1 − L(θ ) d1 C Cc C C = d2 + [d1 L(θ ) − d2 L(θ )]. C

If we wished to minimize this expression, we would select C to be the set of all (x1 , x2 , . . . , xn ) such that d1 L(θ ) − d2 L(θ ) < 0 or, equivalently, d2 L(θ ) < , L(θ ) d1

for all (x1 , x2 , . . . , xn ) ∈ C,

which according to the Neyman–Pearson theorem provides a best critical region with k = d2 /d1 . That is, this critical region C is one that minimizes d1 α + d2 β. There could be others, including points on which L(θ )/L(θ ) = d2 /d1 , but these would still be best critical regions according to the Neyman–Pearson theorem.

447

Optimal Tests of Hypotheses EXERCISES 1.1. In Example 1.2 of this section, let the simple hypotheses read H0 : θ = θ = 0 and H1 : θ = θ = −1. Show that the best test of H0 against H1 may be carried out by use of the statistic X, and that if n = 25 and α = 0.05, the power of the test is 0.9996 when H1 is true. 1.2. Let the random variable X have the pdf f (x; θ) = (1/θ)e−x/θ , 0 < x < ∞, zero elsewhere. Consider the simple hypothesis H0 : θ = θ = 2 and the alternative hypothesis H1 : θ = θ = 4. Let X1 , X2 denote a random sample of size 2 from this distribution. Show that the best test of H0 against H1 may be carried out by use of the statistic X1 + X2 . 1.3. Repeat Exercise 1.2 when H1 : θ = θ = 6. Generalize this for every θ > 2. 1.4. Let X1 , X2 , . . . , X10 be a random sample of size 10 from a normal distribution N (0, σ 2 ). Find a best critical region of size α = 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 2. Is this a best critical region of size 0.05 for testing H0 : σ 2 = 1 against H1 : σ 2 = 4? Against H1 : σ 2 = σ12 > 1? 1.5. If X1 , X2 , . . . , Xn is a random sample from a distribution having pdf of the form f (x; θ) = θxθ−1 , 0 < x < 1, zero elsewhere, show that a best critical n region for testing H0 : θ = 1 against H1 : θ = 2 is C = {(x1 , x2 , . . . , xn ) : c ≤ i=1 xi }. 1.6. Let X1 , X2 , . . . , X10 be a random sample from a distribution that is N (θ1 , θ2 ). Find a best test of the simple hypothesis H0 : θ1 = θ1 = 0, θ2 = θ2 = 1 against the alternative simple hypothesis H1 : θ1 = θ1 = 1, θ2 = θ2 = 4. 1.7. Let X1 , X2 , . . . , Xn denote a random sample n from a normal distribution N (θ, 100). Show that C = {(x1 , x2 , . . . , xn ) : c ≤ x = 1 xi /n} is a best critical region for testing H0 : θ = 75 against H1 : θ = 78. Find n and c so that PH0 [(X1 , X2 , . . . , Xn ) ∈ C] = PH0 (X ≥ c) = 0.05 and PH1 [(X1 , X2 , . . . , Xn ) ∈ C] = PH1 (X ≥ c) = 0.90, approximately. 1.8. If X1 , X2 , . . . , Xn is a random sample from a beta distribution with parameters α = β = θ > 0, ﬁnd a best critical region for testing H0 : θ = 1 against H1 : θ = 2. 1.9. Let X1 , X2 , . . . , Xn be iid with pmf f (x; p) = px (1 − p)1−x , x = 0, 1, zero n elsewhere. Show that C = {(x1 , . . . , xn ) : 1 xi ≤ c} is a best critical region for 1 1 = 3 . Use the Central Limit Theorem to ﬁnd n testing H0 : p = 2 against H1 : p n n and c so that approximately PH0 ( 1 Xi ≤ c) = 0.10 and PH1 ( 1 Xi ≤ c) = 0.80. 1.10. Let X1 , X2 , . . . , X10 denote a random sample of size 10 froma Poisson dis10 tribution with mean θ. Show that the critical region C deﬁned by 1 xi ≥ 3 is a best critical region for testing H0 : θ = 0.1 against H1 : θ = 0.5. Determine, for this test, the signiﬁcance level α and the power at θ = 0.5.

448

Optimal Tests of Hypotheses

2

Uniformly Most Powerful Tests

This section takes up the problem of a test of a simple hypothesis H0 against an alternative composite hypothesis H1 . We begin with an example. Example 2.1. Consider the pdf f (x; θ) =

1 −x/θ θe

0

0 2. The preceding example aﬀords an illustration of a test of a simple hypothesis H0 that is a best test of H0 against every simple hypothesis in the alternative composite hypothesis H1 . We now deﬁne a critical region, when it exists, which is a best critical region for testing a simple hypothesis H0 against an alternative composite hypothesis H1 . It seems desirable that this critical region should be a best critical region for testing H0 against each simple hypothesis in H1 . That is, the power function of the test that corresponds to this critical region should be at least as great as the power function of any other test with the same signiﬁcance level for every simple hypothesis in H1 . Definition 2.1. The critical region C is a uniformly most powerful (UMP) critical region of size α for testing the simple hypothesis H0 against an alternative composite hypothesis H1 if the set C is a best critical region of size α for testing H0 against each simple hypothesis in H1 . A test defined by this critical region C is called a uniformly most powerful (UMP) test, with significance level α, for testing the simple hypothesis H0 against the alternative composite hypothesis H1 . As will be seen presently, uniformly most powerful tests do not always exist. However, when they do exist, the Neyman–Pearson theorem provides a technique for ﬁnding them. Some illustrative examples are given here.

449

Optimal Tests of Hypotheses Example 2.2. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (0, θ), where the variance θ is an unknown positive number. It will be shown that there exists a uniformly most powerful test with signiﬁcance level α for testing the simple hypothesis H0 : θ = θ , where θ is a ﬁxed positive number, against the alternative composite hypothesis H1 : θ > θ . Thus Ω = {θ : θ ≥ θ }. The joint pdf of X1 , X2 , . . . , Xn is n/2 n 1 1 2 exp − x . L(θ; x1 , x2 , . . . , xn ) = 2πθ 2θ i=1 i Let θ represent a number greater than θ , and let k denote a positive number. Let C be the set of points where L(θ ; x1 , x2 , . . . , xn ) ≤ k, L(θ ; x1 , x2 , . . . , xn ) that is, the set of points where n/2 n θ − θ 2 θ exp − xi ≤ k θ 2θ θ 1 or, equivalently,

θ 2θ θ n log − log k = c. ≥ θ − θ 2 θ 1 n The set C = {(x1 , x2 , . . . , xn ) : 1 x2i ≥ c} is then a best critical region for testing the simple hypothesis H0 : θ = θ against the simple hypothesis θ = θ . It remains to determine c, sothat this critical region has the desired size α. If H0 is true, the n random variable 1 Xi2 /θ has a chi-square distribution with n degrees of freedom. n Appendix: Since α = Pθ ( 1 Xi2 /θ ≥ c/θ ), c/θ may be read from Table II in n Tables of Distributions and c determined. Then C = {(x1 , x2 , . . . , xn ) : 1 x2i ≥ c} is a best critical region of size α for testing H0 : θ = θ against the hypothesis greater than θ , the foregoing argument θ = θ . Moreover, for each number θ n 2 holds. That is, C = {(x1 , . . . , xn ) : 1 xi ≥ c} is a uniformly most powerful critical region of size α for testing H0 : θ = θ against H1 : θ > θ . If x1 , x2 , . . . , xn denote the experimental values of X1 , X2 , . . . , Xn , then H0 : θ = θ is rejected at the n signiﬁcance level α, and H1 : θ > θ is accepted if 1 x2i ≥ c; otherwise, H0 : θ = θ is accepted. If, in the preceding discussion, we take n = 15, α = 0.05, and θ = 3, then here the two hypotheses are H0 : θ = 3 and H1 : θ > 3. From Table II, c/3 = 25 and hence c = 75. n

x2i

Example 2.3. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that is N (θ, 1), where θ is unknown. It will be shown that there is no uniformly most powerful test of the simple hypothesis H0 : θ = θ , where θ is a ﬁxed number against the alternative composite hypothesis H1 : θ = θ . Thus Ω = {θ : −∞ < θ < ∞}.

450

Optimal Tests of Hypotheses Let θ be a number not equal to θ . Let k be a positive number and consider n

2 n/2 exp − (xi − θ ) /2 (1/2π) 1

(1/2π)n/2 exp −

n

≤ k. 2

(xi − θ ) /2

1

The preceding inequality may be written as n

n 2 2 xi + [(θ ) − (θ ) ] ≤ k exp −(θ − θ ) 2 1 or (θ − θ )

n

xi ≥

1

n 2 [(θ ) − (θ )2 ] − log k. 2

This last inequality is equivalent to n

xi ≥

1

n log k (θ + θ ) − , 2 θ − θ

provided that θ > θ , and it is equivalent to n

1

xi ≤

n log k (θ + θ ) − 2 θ − θ

if θ < θ . The ﬁrst of these two expressions deﬁnes a best critical region for testing H0 : θ = θ against the hypothesis θ = θ provided that θ > θ , while the second expression deﬁnes a best critical region for testing H0 : θ = θ against the hypothesis θ = θ provided that θ < θ . That is, a best critical region for testing the simple hypothesis against an alternative simple hypothesis, say θ = θ +1, does not serve as a best critical region for testing H0 : θ = θ against the alternative simple hypothesis θ = θ − 1. By deﬁnition, then, there is no uniformly most powerful test in the case under consideration. It should be noted that had the alternative composite hypothesis been one-sided, either H1 : θ > θ or H1 : θ < θ , a uniformly most powerful test would exist in each instance. Example 2.4. In Exercise 1.10, the reader was asked to show that if a random sample of size n = 10 is taken from a Poisson distribution with mean θ, the critical n region deﬁned by 1 xi ≥ 3 is a best critical region for testing H0 : θ = 0.1 against H1 : θ = 0.5. This critical region is also a uniformly most powerful one for testing H0 : θ = 0.1 against H1 : θ > 0.1 because, with θ > 0.1,

(0.1) xi e−10(0.1) /(x1 !x2 ! · · · xn !) ≤k (θ ) xi e−10(θ ) /(x1 !x2 ! · · · xn !)

451

Optimal Tests of Hypotheses is equivalent to

x i 0.1 e−10(0.1−θ ) ≤ k. θ The preceding inequality may be written as

n

xi (log 0.1 − log θ ) ≤ log k + 10(1 − θ )

1

or, since θ > 0.1, equivalently as n

Of course,

n 1

xi ≥

1

log k + 10 − 10θ . log 0.1 − log θ

xi ≥ 3 is of the latter form.

Let us make an important observation, although obvious when pointed out. Let X1 , X2 , . . . , Xn denote a random sample from a distribution that has pdf f (x; θ), θ ∈ Ω. Suppose that Y = u(X1 , X2 , . . . , Xn ) is a suﬃcient statistic for θ. In accordance with the factorization theorem, the joint pdf of X1 , X2 , . . . , Xn may be written L(θ; x1 , x2 , . . . , xn ) = k1 [u(x1 , x2 , . . . , xn ); θ]k2 (x1 , x2 , . . . , xn ), where k2 (x1 , x2 , . . . , xn ) does not depend upon θ. Consequently, the ratio k1 [u(x1 , x2 , . . . , xn ); θ ] L(θ ; x1 , x2 , . . . , xn ) = L(θ ; x1 , x2 , . . . , xn ) k1 [u(x1 , x2 , . . . , xn ); θ ] depends upon x1 , x2 , . . . , xn only through u(x1 , x2 , . . . , xn ). Accordingly, if there is a suﬃcient statistic Y = u(X1 , X2 , . . . , Xn ) for θ and if a best test or a uniformly most powerful test is desired, there is no need to consider tests which are based upon any statistic other than the suﬃcient statistic. This result supports the importance of suﬃciency. In the above examples, we have presented uniformly most powerful tests. For some families of pdfs and hypotheses, we can obtain general forms of such tests. We sketch these results for the general one-sided hypotheses of the form H0 : θ ≤ θ versus H1 : θ > θ .

(2.1)

The other one-sided hypotheses with the null hypothesis H0 : θ ≥ θ , is completely analogous. Note that the null hypothesis of (2.1) is a composite hypothesis. Recall that the level of a test for the hypotheses (2.1) is deﬁned by maxθ≤θ γ(θ), where γ(θ) is the power function of the test. That is, the signiﬁcance level is the maximum probability of Type I error. Let X = (X1 , . . . , Xn ) be a random sample with common pdf (or pmf) f (x; θ), θ ∈ Ω, and, hence with the likelihood function L(θ, x) =

n

f (xi ; θ),

x = (x1 , . . . , xn ) .

i=1

We consider the family of pdfs which has monotone likelihood ratio as deﬁned next.

452

Optimal Tests of Hypotheses Definition 2.2. We say that the likelihood L(θ, x) has monotone likelihood ratio (mlr) in the statistic y = u(x) if, for θ1 < θ2 , the ratio L(θ1 , x) L(θ2 , x)

(2.2)

is a monotone function of y = u(x). Assume then that our likelihood function L(θ, x) has a monotone decreasing likelihood ratio in the statistic y = u(x). Then the ratio in (2.2) is equal to g(y), where g is a decreasing function. The case where the likelihood function has a monotone increasing likelihood ratio (i.e., g is an increasing function) follows similarly by changing the sense of the inequalities below. Let α denote the signiﬁcance level. Then we claim that the following test is UMP level α for the hypotheses (2.1): Reject H0 if Y ≥ cY ,

(2.3)

where cY is determined by α = Pθ [Y ≥ cY ]. To show this claim, ﬁrst consider the simple null hypothesis H0 : θ = θ . Let θ > θ be arbitrary but ﬁxed. Let C denote the most powerful critical region for θ versus θ . By the Neyman–Pearson Theorem, C is deﬁned by L(θ , X) ≤ k if and only if X ∈ C, L(θ , X) where k is determined by α = Pθ [X ∈ C]. But by Deﬁnition 2.2, because θ > θ , L(θ , X) = g(Y ) ≤ k ⇔ Y ≥ g −1 (k), L(θ , X) where g −1 (k) satisﬁes α = Pθ [Y ≥ g −1 (k)]; i.e., cY = g −1 (k). Hence the Neyman– Pearson test is equivalent to the test deﬁned by (2.3). Furthermore, the test is UMP for θ versus θ > θ because the test only depends on θ > θ and g −1 (k) is uniquely determined under θ . Let γY (θ) denote the power function of the test (2.3). To ﬁnish, we need to show that maxθ≤θ γY (θ) = α. But this follows immediately if we can show that γY (θ) is a nondecreasing function. To see this, let θ1 < θ2 . Note that since θ1 < θ2 , the test (2.3) is the most powerful test for testing θ1 versus θ2 with the level γY (θ1 ). By Corollary 1.1, the power of the test at θ2 must not be below the level; i.e., γY (θ2 ) ≥ γY (θ1 ). Hence γY (θ) is a nondecreasing function. Since the power function is nondecreasing, it follows from Deﬁnition 1.2 that the mlr tests are unbiased tests for the hypotheses (2.1); see Exercise 2.14. Example 2.5. Let X1 , X2 , . . . , Xn be a random sample from a Bernoulli distribution with parameter p = θ, where 0 < θ < 1. Let θ < θ . Consider the ratio of likelihoods xi n (θ ) xi (1 − θ )n− xi 1 − θ θ (1 − θ ) L(θ ; x1 , x2 , . . . , xn ) = x = . L(θ ; x1 , x2 , . . . , xn ) θ (1 − θ ) 1 − θ (θ ) i (1 − θ )n− xi

453

Optimal Tests of Hypotheses Since θ /θ < 1 and (1 − θ )/(1 − θ ) < 1, so that θ (1 − θ )/θ (1 − θ ) < 1, the ratio is a decreasing function xi . Thus we have a monotone likelihood of y = ratio in the statistic Y = Xi . Consider the hypotheses H0 : θ ≤ θ versus H1 : θ > θ .

(2.4)

By our discussion above, the UMP level α decision rule for testing H0 versus H1 is given by n Reject H0 if Y = i=1 Xi ≥ c, where c is such that α = Pθ [Y ≥ c]. In the last example concerning a Bernoulli pmf, we obtained a UMP test by showing that its likelihood possesses mlr. The Bernoulli distribution is a regular case of the exponential family and our argument, under the one assumption below, can be generalized to the entire regular exponential family. To show this, suppose that the random sample X1 , X2 , . . . , Xn arises from a pdf or pmf representing a regular case of the exponential class, namely, exp[p(θ)K(x) + H(x) + q(θ)] x ∈ S f (x; θ) = 0 elsewhere, where the support of X, S, is free of θ. Further assume that p(θ) is an increasing function of θ. Then n n

exp p(θ ) K(xi ) + H(xi ) + nq(θ ) L(θ ) 1 1 = n n L(θ )

exp p(θ ) K(xi ) + H(xi ) + nq(θ ) 1

=

1

exp [p(θ ) − p(θ )]

n

K(xi ) + n[q(θ ) − q(θ )] .

1

If θ < θ , p(θ) being n an increasing function, requires this ratio to be a decreasing function of y = 1 K(xi ). Thus, we have a monotone likelihood ratio in the n statistic Y = 1 K(Xi ). Hence consider the hypotheses H0 : θ ≤ θ versus H1 : θ > θ .

(2.5)

By our discussion above concerning mlr, the UMP level α decision rule for testing H0 versus H1 is given by Reject H0 if Y =

n

K(Xi ) ≥ c,

i=1

where c is such that α = Pθ [Y ≥ c]. Furthermore, the power function of this test is an increasing function in θ.

454

Optimal Tests of Hypotheses For the record, consider the other one-sided alternative hypotheses, H0 : θ ≥ θ versus H1 : θ < θ .

(2.6)

The UMP level α decision rule is, for p(θ) an increasing function, Reject H0 if Y =

n

K(Xi ) ≤ c,

i=1

where c is such that α = Pθ [Y ≤ c]. If in the preceding situation with monotone likelihood ratio we test H0 : θ = K(xi ) ≥ c would be a uniformly most powerful θ against H1 : θ > θ , then critical region. From the likelihood ratios displayed in Examples 2.2–2.5, we see immediately that the respective critical regions n

i=1

x2i

≥ c,

n

i=1

xi ≥ c,

n

i=1

xi ≥ c,

n

xi ≥ c

i=1

are uniformly most powerful for testing H0 : θ = θ against H1 : θ > θ . There is a ﬁnal remark that should be made about uniformly most powerful tests. Of course, in Deﬁnition 2.1, the word uniformly is associated with θ; that is, C is a best critical region of size α for testing H0 : θ = θ0 against all θ values given by the composite alternative H1 . However, suppose that the form of such a region is u(x1 , x2 , . . . , xn ) ≤ c. Then this form provides uniformly most powerful critical regions for all attainable α values by, of course, appropriately changing the value of c. That is, there is a certain uniformity property, also associated with α, that is not always noted in statistics texts. EXERCISES 2.1. Let X have the pmf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test the simple hypothesis H0 : θ = 14 against the alternative composite hypothesis H1 : θ < 14 by taking a random sample of size 10 and rejecting H0 : θ = 14 if and observations are such that only 10 if the observed values x1 , x2 , . . . , x10 of the sample 1 x ≤ 1. Find the power function γ(θ), 0 < θ ≤ , of this test. i 1 4 2.2. Let X have a pdf of the form f (x; θ) = 1/θ, 0 < x < θ, zero elsewhere. Let Y1 < Y2 < Y3 < Y4 denote the order statistics of a random sample of size 4 from this distribution. Let the observed value of Y4 be y4 . We reject H0 : θ = 1 and accept H1 : θ = 1 if either y4 ≤ 12 or y4 > 1. Find the power function γ(θ), 0 < θ, of the test. 2.3. Consider a normal distribution of the form N (θ, 4). The simple hypothesis H0 : θ = 0 is rejected, and the alternative composite hypothesis H1 : θ > 0 is accepted if and only if the observed mean x of a random sample of size 25 is greater than or equal to 35 . Find the power function γ(θ), 0 ≤ θ, of this test.

455

Optimal Tests of Hypotheses 2.4. Consider the distributions N (μ1 , 400) and N (μ2 , 225). Let θ = μ1 − μ2 . Let x and y denote the observed means of two independent random samples, each of size n, from these two distributions. We reject H0 : θ = 0 and accept H1 : θ > 0 if and only if x − y ≥ c. If γ(θ) is the power function of this test, ﬁnd n and c so that γ(0) = 0.05 and γ(10) = 0.90, approximately. 2.5. Consider n Example 2.2. Show that L(θ) has a monotone likelihood ratio in the statistic i=1 Xi2 . Use this to determine the UMP test for H0 : θ = θ , where θ is a ﬁxed positive number, versus H1 : θ < θ . 2.6. If, in Example 2.2 of this section, H0 : θ = θ , where θ is a ﬁxed positive number, and H1 : θ = θ , show that there is no uniformly most powerful test for testing H0 against H1 . 2.7. Let X1 , X2 , . . . , X25 denote a random sample of size 25 from a normal distribution N (θ, 100). Find a uniformly most powerful critical region of size α = 0.10 for testing H0 : θ = 75 against H1 : θ > 75. 2.8. Let X1 , X2 , . . . , Xn denote a random sample from a normal distribution N (θ, 16). Find the sample size n and a uniformly most powerful test of H0 : θ = 25 against H1 : θ < 25 with power function γ(θ) so that approximately γ(25) = 0.10 and γ(23) = 0.90. 2.9. Consider a distribution having a pmf of the form f (x; θ) = θx (1 − θ)1−x , x = 1 1 and H1 : θ > 20 . Use the Central Limit 0, 1, zero elsewhere. Let H0 : θ = 20 Theorem to determine the sample size n of a random sample so that a uniformly most powerful test of H0 against H1 has a power function γ(θ), with approximately 1 1 ) = 0.05 and γ( 10 ) = 0.90. γ( 20 2.10. Illustrative Example 2.1 of this section dealt with a random sample of size n = 2 from a gamma distribution with α = 1, β = θ. Thus the mgf of the distribution is (1 − θt)−1 , t < 1/θ, θ ≥ 2. Let Z = X1 + X2 . Show that Z has a gamma distribution with α = 2, β = θ. Express the power function γ(θ) of Example 2.1 in terms of a single integral. Generalize this for a random sample of size n. 2.11. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pdf f (x; θ) = θxθ−1 , 0 < x< 1, zero elsewhere, where θ > 0. Show the likelihood has mlr in n the statistic i=1 Xi . Use this to determine the UMP test for H0 : θ = θ against H1 : θ < θ , for ﬁxed θ > 0. 2.12. Let X have the pdf f (x; θ) = θx (1 − θ)1−x , x = 0, 1, zero elsewhere. We test H0 : θ = 12 against H1 : θ < 12by taking a random sample X1 , X2 , . . . , X5 of n size n = 5 and rejecting H0 if Y = 1 Xi is observed to be less than or equal to a constant c. (a) Show that this is a uniformly most powerful test. (b) Find the signiﬁcance level when c = 1.

456

Optimal Tests of Hypotheses (c) Find the signiﬁcance level when c = 0. (d) By using a randomized test, as discussed in Example 6.4, modify the tests 2 . given in parts (b) and (c) to ﬁnd a test with signiﬁcance level α = 32 2.13. Let X1 , . . . , Xn denote a random sample from a gamma-type distribution with α = 2 and β = θ. Let H0 : θ = 1 and H1 : θ > 1. (a) Show that there exists a uniformly most powerful test for H0 against H1 , determine the statistic Y upon which the test may be based, and indicate the nature of the best critical region. (b) Find the pdf of the statistic Y in part (a). If we want a signiﬁcance level of 0.05, write an equation which can be used to determine the critical region. Let γ(θ), θ ≥ 1, be the power function of the test. Express the power function as an integral. 2.14. Show that the mlr test deﬁned by expression (2.3) is an unbiased test for the hypotheses (2.1).

3

Likelihood Ratio Tests

In the ﬁrst section of this chapter, we presented the most powerful tests for simple versus simple hypotheses. In the second section, we extended this theory to uniformly most powerful tests for essentially one-sided alternative hypotheses and families of distributions which have a monotone likelihood ratio. What about the general case? That is, suppose the random variable X has pdf or pmf f (x; θ), where θ is a vector of parameters in Ω. Let ω ⊂ Ω and consider the hypotheses H0 : θ ∈ ω versus H1 : θ ∈ Ω ∩ ω c .

(3.1)

There are complications in extending the optimal theory to this general situation, which are addressed in more advanced books; see, in particular, Lehmann (1986). We illustrate some of these complications with an example. Suppose X has a N (θ1 , θ2 ) distribution and that we want to test θ1 = θ1 , where θ1 is speciﬁed. In the notation of (3.1), θ = (θ1 , θ2 ), Ω = {θ : −∞ < θ1 < ∞, θ2 > 0}, and ω = {θ : θ1 = θ1 , θ2 > 0}. Notice that H0 : θ ∈ ω is a composite null hypothesis. Let X1 , . . . , Xn be a random sample on X. Assume for the moment that θ2 is known. Then H0 becomes the simple hypothesis θ1 = θ1 . This is essentially the situation discussed in Example 2.3. There it was shown that no UMP test exists for this situation. If we restrict attention to the class of unbiased tests (Deﬁnition 1.2), then a theory of best tests can be constructed; see Lehmann (1986). For our illustrative example, as Exercise 3.18 shows, the test based on the critical region θ2 zα/2 C2 = |X − θ1 | > n

457

Optimal Tests of Hypotheses is unbiased. Then it follows from Lehmann that it is an UMP unbiased level α test. In practice, though, the variance θ2 is unknown. In this case, theory for optimal tests can be constructed using the concept of what are called conditional tests. We do not pursue this any further in this text, but refer the interested reader to Lehmann (1986). Recall that the likelihood ratio tests can be used to test general hypotheses such as (3.1). There is no guarantee that they are optimal. However, as are tests based on the Neyman–Pearson Theorem, they are based on a ratio of likelihood functions. In many situations, the likelihood ratio test statistics are optimal. In the example above on testing for the mean of a normal distribution, with known variance, the likelihood ratio test is the same as the UMP unbiased test. When the variance is unknown, the likelihood ratio test results in the one-sample t-test. This is the same as the conditional test discussed in Lehmann (1986). You should be familiar with the likelihood ratio tests for several situations. For example, now the one-sample t-test to test for the mean of a normal distribution with unknown variance is derived. In the remainder of this section, we present likelihood ratio tests for other situations when sampling from normal distributions. Example 3.1. Let the independent random variables X and Y have distributions that are N (θ1 , θ3 ) and N (θ2 , θ3 ), where the means θ1 and θ2 and common variance θ3 are unknown. Then Ω = {(θ1 , θ2 , θ3 ) : −∞ < θ1 < ∞, −∞ < θ2 < ∞, 0 < θ3 < ∞}. Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Ym denote independent random samples from these distributions. The hypothesis H0 : θ1 = θ2 , unspeciﬁed, and θ3 unspeciﬁed, is to be tested against all alternatives. Then ω = {(θ1 , θ2 , θ3 ) : −∞ < θ1 = θ2 < ∞, 0 < θ3 < ∞}. Here X1 , X2 , . . . , Xn , Y1 , Y2 , . . . , Ym are n + m > 2 mutually independent random variables having the likelihood functions L(ω) =

(n+m)/2

n m

1

2 2 exp − (xi − θ1 ) + (yi − θ1 ) 2θ3 1 1

(n+m)/2

n m

1

2 2 . exp − (xi − θ1 ) + (yi − θ2 ) 2θ3 1 1

1 2πθ3

and L(Ω) =

1 2πθ3

If ∂ log L(ω)/∂θ1 and ∂ log L(ω)/∂θ3 are equated to zero, then (Exercise 3.2) n

458

m

(yi − θ1 )

=

0

n m

1

2 2 (xi − θ1 ) + (yi − θ1 ) θ3 1 1

=

n + m.

1

(xi − θ1 ) +

1

(3.2)

Optimal Tests of Hypotheses The solutions for θ1 and θ3 are, respectively, n m

−1 xi + yi u = (n + m) 1

w

=

(n + m)

−1

n

1 m

2

(xi − u) +

1

(yi − u)

2

.

1

Further, u and w maximize L(ω). The maximum is L(ˆ ω) =

e−1 2πw

(n+m)/2 .

In a like manner, if ∂ log L(Ω) , ∂θ2

∂ log L(Ω) , ∂θ1

∂ log L(Ω) ∂θ3

are equated to zero, then (Exercise 3.3) n

(xi − θ1 ) = 0

1 m

1 −(n + m) + θ3

1 n

(yi − θ2 ) = 0 2

(xi − θ1 ) +

(3.3)

m

1

(yi − θ2 )

2

= 0.

1

The solutions for θ1 , θ2 , and θ3 are, respectively, u1 u2

= =

n−1

n

m−1

xi

1 m

yi

1

w

=

(n + m)−1

n m

(xi − u1 )2 + (yi − u2 )2 , 1

1

and, further, u1 , u2 , and w maximize L(Ω). The maximum is ˆ = L(Ω)

e−1 2πw

(n+m)/2 ,

so that Λ(x1 , . . . , xn , y1 , . . . , ym ) = Λ =

L(ˆ ω) = ˆ L(Ω)

w w

(n+m)/2 .

459

Optimal Tests of Hypotheses The random variable deﬁned by Λ2/(n+m) is n

n

(Xi − X)2 +

1

m

(Yi − Y )2

1

2

{Xi − [(nX + mY )/(n + m)]} +

n

1

. 2

{Yi − [(nX + mY )/(n + m)]}

1

Now 2

2 n n

nX + mY nX + mY Xi − (Xi − X) + X − = n+m n+m 1 1 2 n

nX + mY 2 = (Xi − X) + n X − n+m 1 and 2

2 m m

nX + mY nX + mY Yi − (Yi − Y ) + Y − = n+m n+m 1 1 2 m

nX + mY = (Yi − Y )2 + m Y − . n+m 1 But 2 nX + mY m2 n = (X − Y )2 n X− n+m (n + m)2 and

nX + mY m Y − n+m

2 =

n2 m (X − Y )2 . (n + m)2

Hence the random variable deﬁned by Λ2/(n+m) may be written n

n

1

1

(Xi − X)2 +

(Xi − X)2 + m

m

(Yi − Y )2

1

(Yi − Y )2 + [nm/(n + m)](X − Y )2

1

=

1 . [nm/(n + m)](X − Y )2 1+ n m

(Xi − X)2 + (Yi − Y )2 1

460

1

Optimal Tests of Hypotheses If the hypothesis H0 : θ1 = θ2 is true, the random variable n −1/2 m

nm −1 2 2 (X − Y ) (n + m − 2) (Xi − X) + (Yi − Y ) T = n+m 1 1 (3.4) has a t-distribution with n + m − 2 degrees of freedom. Thus the random variable deﬁned by Λ2/(n+m) is n+m−2 . (n + m − 2) + T 2

The test of H0 against all alternatives may then be based on a t-distribution with n + m − 2 degrees of freedom. The likelihood ratio principle calls for the rejection of H0 if and only if Λ ≤ λ0 < 1. Thus the signiﬁcance level of the test is α = PH0 [Λ(X1 , . . . , Xn , Y1 , . . . , Ym ) ≤ λ0 ]. However, Λ(X1 , . . . , Xn , Y1 , . . . , Ym ) ≤ λ0 is equivalent to |T | ≥ c, and so α = P (|T | ≥ c; H0 ). For given values of n and m, the number c is determined from Table IV in Appendix: Tables of Distribution (with n+m−2 degrees of freedom) to yield a desired α. Then H0 is rejected at a signiﬁcance level α if and only if |t| ≥ c, where t is the observed value of T . If, for instance, n = 10, m = 6, and α = 0.05, then c = 2.124. For this last example, it was found that the likelihood ratio test could be based on a statistic which, when the hypothesis H0 is true, has a t-distribution. To help us compute the powers of these tests at parameter points other than those described by the hypothesis H0 , we turn to the following deﬁnition. Definition 3.1. Let the random variable W be N (δ, 1); let the random variable V be χ2 (r), and let W and V be independent. The quotient W T = V /r is said to have a noncentral t-distribution with r degrees of freedom and noncentrality parameter δ. If δ = 0, we say that T has a central t-distribution. In the light of this deﬁnition, let us reexamine the t-statistics of 3.1. We had

461

Optimal Tests of Hypotheses √

nX t(X1 , . . . , Xn ) =

n (Xi − X)2 /(n − 1) 1

√ nX/σ

. =

n 2 2 (Xi − X) /[σ (n − 1)] 1

√ √ Here, where of the normal distribution, W1 = nX/σ is N ( nθ1 /σ, 1), n θ1 is the2 mean 2 2 V1 = 1 (Xi − X) /σ is χ (n − 1), and W1 and V1 are independent. Thus, if θ1 = 0, we see, in accordance with the deﬁnition, that t(X1 , . . . , Xn ) has a noncentral √ t-distribution with n − 1 degrees of freedom and noncentrality parameter δ1 = nθ1 /σ. In Example 3.1 we had T = where

W2 =

W2 V2 /(n + m − 2)

nm (X − Y ) n+m

,

σ

and n

V2 =

(Xi − X)2 +

1

m

1

σ2

(Yi − Y )2 .

Here W2 is N [ nm/(n + m)(θ1 − θ2 )/σ, 1], V2 is χ2 (n + m − 2), and W2 and V2 are independent. Accordingly, if θ1 = θ2 , T has a noncentral t-distribution with n+m−2 = nm/(n + m)(θ1 − θ2 )/σ. It degrees of freedom and noncentrality parameter δ 2 √ /σ measures the deviation of θ1 from θ1 = 0 is interesting to note that δ1 = nθ1√ in units of the standard deviation σ/ n of X. The noncentrality parameter δ2 = nm/(n + m)(θ1 − θ2 )/σ is equal to the deviation of θ1 − θ2 from θ1 − θ2 = 0 in units of the standard deviation σ/ (n + m)/mn of X − Y . The package R contains functions which evaluate noncentral t-distributional quantities. For example, to obtain the value P (T ≤ t) when T has a t-distribution with a degrees of freedom and noncentrality parameter b, use the command pt(t, a, ncp=b). For the value of the associated pdf at t, use the command dt(t, a, ncp=b). There are also various tables of the noncentral t-distribution, but they are much too cumbersome to be included in this chapter. Remark 3.1. The one- and two-sample tests for normal means are the tests for normal means presented in most elementary statistics books. They are based on the assumption of normality. What if the underlying distributions are not normal? In

462

Optimal Tests of Hypotheses that case, with ﬁnite variances, the t-test statistics for these situations are asymptotically correct. For example, consider the one-sample t-test. Suppose X1 , . . . , Xn are iid with a common nonnormal pdf which has mean θ1 and ﬁnite variance σ 2 . The hypotheses remain the same, i.e., H0 : θ1 = θ1 versus H1 : θ1 = θ1 . The t-test statistic, Tn , is given by √ n(X − θ1 ) , (3.5) Tn = Sn where Sn is the sample standard deviation. Our critical region is C1 = {|Tn | ≥ tα/2,n−1 }. Recall that Sn → σ in probability. Hence, by the Central Limit Theorem, under H0 , √ σ n(X − θ1 ) D → Z, (3.6) Tn = Sn σ where Z has a standard normal distribution. Hence the asymptotic test would use the critical region C2 = {|Tn | ≥ zα/2 }. By (3.6) the critical region C2 would have approximate size α. In practice, we would use C1 . Because t critical values are generally larger than z critical values, the use of C1 would be conservative; i.e., the size of C1 would be slightly smaller than that of C2 . For nonnormal situations where the distribution is “close” to the normal distribution, the t-test is essentially valid; i,e., the true level of signiﬁcance is close to the nominal α. In terms of robustness, we would say that for these situations the t-test possesses robustness of validity. But the t-test may not possess robustness of power. For nonnormal situations, there are more powerful tests than the t-test. For distributions which are decidedly not normal, very skewed for instance, the validity of the t-test may be questionable. Example 3.2 shows that the t-test may be quite liberal (empirical α levels much larger than the nominal α level) for such situations. As Exercise 3.4 shows, the two-sample t-test is also asymptotically correct, provided the underlying distributions have the same variance. Example 3.2 (Skewed Contaminated Normal Distribution). Consider the random variable X given by (3.7) X = (1 − I )Z + I Y, where Z has a N (0, 1) distribution, Y has a N (μc , σc2 ) distribution, I has a bin(1, ) distribution, and Z, Y , and I are mutually independent. Assume that < 0.5 and σc > 1, so that Y is the contaminating random variable in the mixture. Note that if μc = 0, then X has a contaminated normal distribution, which is symmetrically distributed about 0. For μc = 0, the distribution of X, (3.7), is skewed and we call it the skewed contaminated normal distribution, SCN ( , σc , μC ). Note that E(X) = μc and in Exercise 3.15 the cdf and pdf of X are derived. In this example, we show the results of a small simulation study on the validity of the t-test for random samples from the distribution of X. Consider the one-sided hypotheses H0 : μ = μX versus H0 : μ < μX .

463

Optimal Tests of Hypotheses Table 3.1: Empirical α Levels for the Nominal 0.05 t-Test of Example 3.2

μc α

0 0.0458

Empirical α 5 10 15 0.0961 0.1238 0.1294

20 0.1301

Let X1 , X2 , . . . , Xn be a random sample from the distribution of X. As a test statistic we consider the√t-test given in expression (3.5); that is, the test statistic is Tn = (X − μX )/(Sn / n), where X and Sn are the sample mean and standard deviation of X1 , X2 , . . . , Xn , respectively. We set the level of signiﬁcance at α = 0.05 and used the decision rule: Reject H0 if Tn ≤ t0.05,n−1 . For the study, we set n = 30,

= 0.20, and σc = 25. We chose the ﬁve values of 0, 5, 10, 15, and 20 for μc , as shown in Table 3.1. For each of these ﬁve situations, we ran 10,000 simulations and recorded α , which is the number of rejections of H0 divided by the number of simulations, i.e., the empirical α level. For the test to be valid, α should be close to the nominal value of 0.05. As Table 3.1 shows, though, for all cases other than μc = 0, the t-test is quite liberal; that is, its empirical signiﬁcance level far exceeds the nominal 0.05 level (as Exercise 3.16 shows, the sampling error in the table is about 0.004). Note that when μc = 0 the distribution of X is symmetric about 0 and in this case the empirical level is close to the nominal value of 0.05. In Example 3.1, in testing the equality of the means of two normal distributions, it was assumed that the unknown variances of the distributions were equal. Let us now consider the problem of testing the equality of these two unknown variances. Example 3.3. We are given the independent random samples X1 , . . . , Xn and Y1 , . . . , Ym from the distributions, which are N (θ1 , θ3 ) and N (θ2 , θ4 ), respectively. We have Ω = {(θ1 , θ2 , θ3 , θ4 ) : −∞ < θ1 , θ2 < ∞, 0 < θ3 , θ4 < ∞}. The hypothesis H0 : θ3 = θ4 , unspeciﬁed, with θ1 and θ2 also unspeciﬁed, is to be tested against all alternatives. Then ω = {(θ1 , θ2 , θ3 , θ4 ) : −∞ < θ1 , θ2 < ∞, 0 < θ3 = θ4 < ∞}. ˆ is It is easy to show (see Exercise 3.8) that the statistic deﬁned by Λ = L(ˆ ω )/L(Ω) a function of the statistic n

F =

(Xi − X)2 /(n − 1)

1

m

.

(3.8)

2

(Yi − Y ) /(m − 1)

1

If θ3 = θ4 , this statistic F has an F -distribution with n − 1 and m − 1 degrees of freedom. The hypothesis that (θ1 , θ2 , θ3 , θ4 ) ∈ ω is rejected if the computed F ≤ c1

464

Optimal Tests of Hypotheses or if the computed F ≥ c2 . The constants c1 and c2 are usually selected so that, if θ3 = θ4 , α1 , P (F ≤ c1 ) = P (F ≥ c2 ) = 2 where α1 is the desired signiﬁcance level of this test. Example 3.4. Let the independent random variables X and Y have distributions that are N (θ1 , θ3 ) and N (θ2 , θ4 ). In Example 3.1, we derived the likelihood ratio test statistic T of the hypothesis θ1 = θ2 when θ3 = θ4 , while in Example 3.3 we obtained the likelihood ratio test statistic F of the hypothesis θ3 = θ4 . The hypothesis that θ1 = θ2 is rejected if the computed |T | ≥ c, where the constant c is selected so that α2 = P (|T | ≥ c; θ1 = θ2 , θ3 = θ4 ) is the assigned signiﬁcance level of the test. We shall show that, if θ3 = θ4 , the likelihood ratio test statistics for equality of variances and equality of means, respectively F and T , are independent. Among other things, this means that if these two tests based on F and T , respectively, are performed sequentially with signiﬁcance levels α1 and α2 , the probability of accepting both these hypotheses, when they are true, is (1 − α1 )(1 − α2 ). Thus the signiﬁcance level of this joint test is α = 1 − (1 − α1 )(1 − α2 ). suﬃciency Independence of F and T , when θ3 = θ4 , can n be established using n 2 (X − X) + (Y − Y )2 are and completeness. The statistics X, Y , and i i 1 1 joint complete suﬃcient statistics for the three parameters θ1 , θ2 , and θ3 = θ4 . Obviously, the distribution of F does not depend upon θ1 , θ2 , or θ3 = θ4 , and hence F is independent of the three joint complete suﬃcient statistics. However, T is a function of these three joint complete suﬃcient statistics alone, and, accordingly, T is independent of F . It is important to note that these two statistics are independent whether θ1 = θ2 or θ1 = θ2 . This permits us to calculate probabilities other than the signiﬁcance level of the test. For example, if θ3 = θ4 and θ1 = θ2 , then P (c1 < F < c2 , |T | ≥ c) = P (c1 < F < c2 )P (|T | ≥ c). The second factor in the right-hand member is evaluated by using the probabilities of a noncentral t-distribution. Of course, if θ3 = θ4 and the diﬀerence θ1 − θ2 is large, we would want the preceding probability to be close to 1 because the event {c1 < F < c2 , |T | ≥ c} leads to a correct decision, namely, accept θ3 = θ4 and reject θ1 = θ2 . Remark 3.2. We caution the reader on this last test for the equality of two variances. In Remark 3.1, we discussed that the one- and two-sample t-tests for means are asymptotically correct. The two-sample variance test of the last example is not, however; see, for example, page 143 of Hettmansperger and McKean (2011). If the underlying distributions are not normal, then the F -critical values may be far from valid critical values (unlike the t-critical values for the means tests as discussed in Remark 3.1). In a large simulation study, Conover, Johnson, and Johnson (1981) showed that instead of having the nominal size of α = 0.05, the F -test for variances using the F -critical values could have signiﬁcance levels as high as 0.80, in certain nonnormal situations. Thus the two-sample F -test for variances does not possess robustness of validity. It should only be used in situations where

465

Optimal Tests of Hypotheses the assumption of normality can be justiﬁed. See Exercise 3.14 for an illustrative data set. In the above examples, we were able to determine the null distribution of the test statistic. This is often impossible in practice. However, minus twice the log of the likelihood ratio test statistic is asymptotically χ2 under H0 . Hence we can obtain an approximate test in most situations. EXERCISES 8 3.1. In Example 3.1, suppose n = m = 8, x = 75.2, y = 78.6, 1 (xi − x)2 = 71.2, 8 and 1 (yi − y)2 = 54.8. If we use the test derived in that example, do we accept or reject H0 : θ1 = θ2 at the 5% signiﬁcance level? Obtain the p-value of the test. 3.2. Verify Equations (3.2) of Example 3.1 of this section. 3.3. Verify Equations (3.3) of Example 3.1 of this section. 3.4. Let X1 , . . . , Xn and Y1 , . . . , Ym follow the location model Xi Yi

=

θ1 + Z i ,

i = 1, . . . , n

=

θ2 + Zn+i ,

i = 1, . . . , m,

where Z1 , . . . , Zn+m are iid random variables with common pdf f (z). Assume that E(Zi ) = 0 and Var(Zi ) = θ3 < ∞. (a) Show that E(Xi ) = θ1 , E(Yi ) = θ2 , and Var(Xi ) = Var(Yi ) = θ3 . (b) Consider the hypotheses of Example 3.1, i.e., H0 : θ1 = θ2 versus H1 : θ1 = θ2 . Show that under H0 , the test statistic T given in expression (3.4) has a limiting N (0, 1) distribution. (c) Using part (b), determine the corresponding large sample test (decision rule) of H0 versus H1 . (This shows that the test in Example 3.1 is asymptotically correct.) 3.5. Show that the likelihood ratio principle leads to the same test when testing a simple hypothesis H0 against an alternative simple hypothesis H1 , as that given by the Neyman–Pearson theorem. Note that there are only two points in Ω. 3.6. Let X1 , X2 , . . . , Xn be a random sample from the normal distribution N (θ, 1). Show that the likelihood ratio principle for testing H0 : θ = θ , where θ is speciﬁed, against H1 : θ = θ leads to the inequality |x − θ | ≥ c. (a) Is this a uniformly most powerful test of H0 against H1 ?

466

Optimal Tests of Hypotheses (b) Is this a uniformly most powerful unbiased test of H0 against H1 ? 3.7. Let X1 , X2 , . . . , Xn be iid N (θ1 , θ2 ). Show that the likelihood ratio principle against H1 : θ2 = θ2 , θ1 for testing H0 : θ2 = θ2 speciﬁed, and θ1 unspeciﬁed, n n 2 unspeciﬁed, leads to a test that rejects when 1 (xi − x) ≤ c1 or 1 (xi − x)2 ≥ c2 , where c1 < c2 are selected appropriately. 3.8. Let X1 , . . . , Xn and Y1 , . . . , Ym be independent random samples from the distributions N (θ1 , θ3 ) and N (θ2 , θ4 ), respectively. (a) Show that the likelihood ratio for testing H0 : θ1 = θ2 , θ3 = θ4 against all alternatives is given by

n

n/2 (xi − x)2 /n

1 n

m

1

(xi − u)2 +

m

1

m/2 (yi − y)2 /m

(yi − u)2

(n+m)/2 , (m + n)

1

where u = (nx + my)/(n + m). (b) Show that the likelihood ratio for testing H0 : θ3 = θ4 with θ1 and θ2 unspeciﬁed can be based on the test statistic F given in expression (3.8). 3.9. Let Y1 < Y2 < · · · < Y5 be the order statistics of a random sample of size n = 5 from a distribution with pdf f (x; θ) = 12 e−|x−θ| , −∞ < x < ∞, for all real θ. Find the likelihood ratio test Λ for testing H0 : θ = θ0 against H1 : θ = θ0 . 3.10. A random sample X1 , X2 , . . . , Xn arises from a distribution given by H0 : f (x; θ) =

1 , 0 < x < θ, θ

zero elsewhere,

or H1 : f (x; θ) =

1 −x/θ e , θ

0 < x < ∞,

zero elsewhere.

Determine the likelihood ratio (Λ) test associated with the test of H0 against H1 . 3.11. Consider a random sample X1 , X2 , . . . , Xn from a distribution with pdf f (x; θ) = θ(1 − x)θ−1 , 0 < x < 1, zero elsewhere, where θ > 0. (a) Find the form of the uniformly most powerful test of H0 : θ = 1 against H1 : θ > 1. (b) What is the likelihood ratio Λ for testing H0 : θ = 1 against H1 : θ = 1? 3.12. Let X1 , X2 , . . . , Xn and Y1 , Y2 , . . . , Yn be independent random samples from two normal distributions N (μ1 , σ 2 ) and N (μ2 , σ 2 ), respectively, where σ 2 is the common but unknown variance.

467

Optimal Tests of Hypotheses (a) Find the likelihood ratio Λ for testing H0 : μ1 = μ2 = 0 against all alternatives. (b) Rewrite Λ so that it is a function of a statistic Z which has a well-known distribution. (c) Give the distribution of Z under both null and alternative hypotheses. 3.13. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) be a random sample from a bivariate normal distribution with μ1 , μ2 , σ12 = σ22 = σ 2 , ρ = 12 , where μ1 , μ2 , and σ 2 > 0 are unknown real numbers. Find the likelihood ratio Λ for testing H0 : μ1 = μ2 = 0, σ 2 unknown against all alternatives. The likelihood ratio Λ is a function of what statistic that has a well-known distribution? 3.14. Let X be a random variable with pdf fX (x) = (2bX )−1 exp{−|x|/bX }, for 2 = 2b2X . Next, −∞ < x < ∞ and bX > 0. First, show that the variance of X is σX −1 let Y , independent of X, have pdf fY (y) = (2bY ) exp{−|y|/bY }, for −∞ < x < ∞ and bY > 0. Consider the hypotheses 2 2 H0 : σ X = σY2 versus H1 : σX > σY2 .

To illustrate Remark 3.2 for testing these hypotheses, consider the following data set. Sample 1 represents the values of a sample drawn on X with bX = 1, while Sample 2 represents the values of a sample drawn on Y with bY = 1. Hence, in this case H0 is true. Sample 1 Sample 1 Sample 2 Sample 2

−0.389 −0.110 0.763 0.403 −1.067 −0.634 −0.775 0.213

−2.177 −0.709 −0.570 0.778 −0.577 −0.996 −1.421 1.425

0.813 0.456 −2.565 −0.115 0.361 −0.181 −0.818 −0.165

−0.001 0.135 −1.733 −0.680 0.239 0.328

(a) Obtain comparison boxplots of these two samples. Comparison boxplots consist of boxplots of both samples drawn on the same scale. Based on these plots, in particular the interquartile ranges, what do you conclude about H0 ? (b) Obtain the F -test (for a one-sided hypothesis) as discussed in Remark 3.2 at level α = 0.10. What is your conclusion? (c) The test in part (b) is not exact. Why? 3.15. For the skewed contaminated normal random variable X of Example 3.2, derive the cdf, pdf, mean, and variance of X. 3.16. For Table 3.1 of Example 3.2, show that the half-width of the 95% conﬁdence interval for a binomial proportion is 0.004 at the nominal value of 0.05.

468

Optimal Tests of Hypotheses 3.17. If computational facilities are available, perform a Monte Carlo study of the two-sided t-test for the skewed contaminated normal situation of Example 3.2. The R function rscn of Appendix: R Functions generates variates from the distribution of X. 3.18. Suppose X1 , . . . , Xn is a random sample on X which has a N (μ, σ02 ) distribution, where σ02 is known. Consider the two-sided hypotheses H0 : μ = 0 versus H1 : μ = 0. Show that the test based on the critical region C = {|X| > unbiased level α test.

σ02 /nzα/2 } is an

3.19. Assume that same situation as in the last exercise but consider the test with critical region C ∗ = {X > σ02 /nzα }. Show that the test based on C ∗ has signiﬁcance level α but that it is not an unbiased test.

4

The Sequential Probability Ratio Test

Theorem 1.1 provides us with a method for determining a best critical region for testing a simple hypothesis against an alternative simple hypothesis. Recall its statement: Let X1 , X2 , . . . , Xn be a random sample with ﬁxed sample size n from a distribution that has pdf or pmf f (x; θ), where θ = {θ : θ = θ , θ } and θ and θ are known numbers. For this section, we denote the likelihood of X1 , X2 , . . . , Xn by L(θ; n) = f (x1 ; θ)f (x2 ; θ) · · · f (xn ; θ), a notation that reveals both the parameter θ and the sample size n. If we reject H0 : θ = θ and accept H1 : θ = θ when and only when L(θ ; n) ≤ k, L(θ ; n) where k > 0, then by Theorem 1.1 this is a best test of H0 against H1 . Let us now suppose that the sample size n is not ﬁxed in advance. In fact, let the sample size be a random variable N with sample space {1, 2, , 3, . . .}. An interesting procedure for testing the simple hypothesis H0 : θ = θ against the simple hypothesis H1 : θ = θ is the following: Let k0 and k1 be two positive constants with k0 < k1 . Observe the independent outcomes X1 , X2 , X3 , . . . in a sequence, for example, x1 , x2 , x3 , . . ., and compute L(θ ; 1) L(θ ; 2) L(θ ; 3) , , ,.... L(θ ; 1) L(θ ; 2) L(θ ; 3) The hypothesis H0 : θ = θ is rejected (and H1 : θ = θ is accepted) if and only if there exists a positive integer n so that xn = (x1 , x2 , . . . , xn ) belongs to the set L(θ , j) L(θ , n) < k1 , j = 1, . . . , n − 1, and ≤ k0 . (4.1) Cn = xn : k0 < L(θ , j) L(θ , n)

469

Optimal Tests of Hypotheses On the other hand, the hypothesis H0 : θ = θ is accepted (and H1 : θ = θ is rejected) if and only if there exists a positive integer n so that (x1 , x2 , . . . , xn ) belongs to the set L(θ , j) L(θ , n) < k1 , j = 1, . . . , n − 1, and ≥ k1 . (4.2) Bn = xn : k0 < L(θ , j) L(θ , n) That is, we continue to observe sample observations as long as k0

− log k1 = −h.

i=1

It is true that − log k0 = log k1 when αa = βa . Often, h = − log k0 is taken to be about 4 or 5, suggesting that αa = βa is small, like 0.01. As (zi − 0.5) is cumulating the sum of zi − 0.5, i = 1, 2, 3, . . ., these procedures are often called CUSUMS. If the CUSUM = (zi −0.5) exceeds h, we would investigate the process, as it seems that the mean has shifted upward. If this shift is to θ = 1, the theory associated with these procedures shows that we need only eight or nine samples on the average, rather than 43, to detect this shift. For more information about these methods, the reader is referred to one of the many books on quality improvement through statistical methods. What we would like to emphasize here is that through sequential methods (not only the sequential probability ratio test), we should take advantage of all past experience that we can gather in making inferences. EXERCISES 4.1. Let X be N (0, θ) and, in the notation of this section, let θ = 4, θ = 9, αa = 0.05, and βa = 0.10. nShow that the sequential probability ratio test can be based upon the statistic 1 Xi2 . Determine c0 (n) and c1 (n). 4.2. Let X have a Poisson distribution with mean θ. Find the sequential probability θ = 0.02 against H1 : θ = 0.07. Show that this test can be ratio test for testing H0 : n based upon the statistic 1 Xi . If αa = 0.20 and βa = 0.10, ﬁnd c0 (n) and c1 (n). 4.3. Let the independent random variables Y and Z be N (μ1 , 1) and N (μ2 , 1), respectively. Let θ = μ1 − μ2 . Let us observe independent observations from each distribution, say Y1 , Y2 , . . . and Z1 , Z2 , . . . . To test sequentially the hypothesis H0 : θ = 0 against H1 : θ = 12 , use the sequence Xi = Yi − Zi , i = 1, 2, . . . . If αa = βa = 0.05, show that the test can be based upon X = Y − Z. Find c0 (n) and c1 (n). 4.4. Suppose that a manufacturing process makes about 3% defective items, which is considered satisfactory for this particular product. The managers would like to decrease this to about 1% and clearly want to guard against a substantial increase, say to 5%. To monitor the process, periodically n = 100 items are taken and the number X of defectives counted. Assume that X is b(n = 100, p = θ). Based on a sequence X1 , X2 , . . . , Xm , . . ., determine a sequential probability ratio test that tests H0 : θ = 0.01 against H1 : θ = 0.05. (Note that θ = 0.03, the present

475

Optimal Tests of Hypotheses level, is in between these two values.) Write this test in the form h0 >

m

(xi − nd) > h1

i=1

and determine d, h0 , and h1 if αa = βa = 0.02. 4.5. Let X1 , X2 , . . . , Xn be a random sample from a distribution with pdf f (x; θ) = θxθ−1 , 0 < x < 1, zero elsewhere. (a) Find a complete suﬃcient statistic for θ. (b) If αa = βa = H1 : θ = 3.

5

1 10 ,

ﬁnd the sequential probability ratio test of H0 : θ = 2 against

Minimax and Classification Procedures

We have considered several procedures which may be used in problems of point estimation. Among these were decision function procedures (in particular, minimax decisions). In this section, we apply minimax procedures to the problem of testing a simple hypothesis H0 against an alternative simple hypothesis H1 . It is important to observe that these procedures yield, in accordance with the Neyman–Pearson theorem, a best test of H0 against H1 . We end this section with a discussion on an application of these procedures to a classiﬁcation problem.

5.1

Minimax Procedures

We ﬁrst investigate the decision function approach to the problem of testing a simple null hypothesis against a simple alternative hypothesis. Let the joint pdf of the n random variables X1 , X2 , . . . , Xn depend upon the parameter θ. Here n is a ﬁxed positive integer. This pdf is denoted by L(θ; x1 , x2 , . . . , xn ) or, for brevity, by L(θ). Let θ and θ be distinct and ﬁxed values of θ. We wish to test the simple hypothesis H0 : θ = θ against the simple hypothesis H1 : θ = θ . Thus the parameter space is Ω = {θ : θ = θ , θ }. In accordance with the decision function procedure, we need a function δ of the observed values of X1 , . . . , Xn (or, of the observed value of a statistic Y ) that decides which of the two values of θ, θ or θ , to accept. That is, the function δ selects either H0 : θ = θ or H1 : θ = θ . We denote these decisions by δ = θ and δ = θ , respectively. Let L(θ, δ) represent the loss function associated with this decision problem. Because the pairs (θ = θ , δ = θ ) and (θ = θ , δ = θ ) represent correct decisions, we shall always take L(θ , θ ) = L(θ , θ ) = 0. On the other hand, if either δ = θ when θ = θ or δ = θ when θ = θ , then a positive value should be assigned to the loss function; that is, L(θ , θ ) > 0 and L(θ , θ ) > 0. It has previously been emphasized that a test of H0 : θ = θ against H1 : θ = θ can be described in terms of a critical region in the sample space. We can do the same kind of thing with the decision function. That is, we can choose a subset of C of the sample space and if (x1 , x2 , . . . , xn ) ∈ C, we can make the decision δ = θ ; whereas if (x1 , x2 , . . . , xn ) ∈ C c , the complement of C, we make the decision δ = θ .

476

Optimal Tests of Hypotheses Thus a given critical region C determines the decision function. In this sense, we may denote the risk function by R(θ, C) instead of R(θ, δ). That is, L(θ, δ)L(θ). R(θ, C) = R(θ, δ) = C∪C c

Since δ = θ if (x1 , . . . , xn ) ∈ C and δ = θ if (x1 , . . . , xn ) ∈ C c , we have L(θ, θ )L(θ) + L(θ, θ )L(θ). R(θ, C) = C

(5.1)

Cc

If, in Equation (5.1), we take θ = θ , then L(θ , θ ) = 0 and hence L(θ , θ )L(θ ) = L(θ , θ ) L(θ ). R(θ , C) = C

C

On the other hand, if in Equation (5.1) we let θ = θ , then L(θ , θ ) = 0 and, accordingly, L(θ , θ )L(θ ) = L(θ , θ ) L(θ ). R(θ , C) = Cc

Cc

It is enlightening to note that if γ(θ) is the power function of the test associated with the critical region C, then R(θ , C) = L(θ , θ )γ(θ ) = L(θ , θ )α, where α = γ(θ ) is the signiﬁcance level; and R(θ , C) = L(θ , θ )[1 − γ(θ )] = L(θ , θ )β, where β = 1 − γ(θ ) is the probability of the type II error. Let us now see if we can ﬁnd a minimax solution to our problem. That is, we want to ﬁnd a critical region C so that max[R(θ , C), R(θ , C)] is minimized. We shall show that the solution is the region L(θ ; x1 , . . . , xn ) ≤ k , C = (x1 , . . . , xn ) : L(θ ; x1 , . . . , xn ) provided the positive constant k is selected so that R(θ , C) = R(θ , C). That is, if k is chosen so that L(θ ) = L(θ , θ ) L(θ ), L(θ , θ ) C

Cc

then the critical region C provides a minimax solution. In the case of random variables of the continuous type, k can always be selected so that R(θ , C) = R(θ , C).

477

Optimal Tests of Hypotheses However, with random variables of the discrete type, we may need to consider an auxiliary random experiment when L(θ )/L(θ ) = k in order to achieve the exact equality R(θ , C) = R(θ , C). To see that C is the minimax solution, consider every other region A for which R(θ , C) ≥ R(θ , A). A region A for which R(θ , C) < R(θ , A) is not a candidate for a minimax solution, for then R(θ , C) = R(θ , C) < max[R(θ , A), R(θ , A)]. Since R(θ , C) ≥ R(θ , A) means that L(θ , θ ) L(θ ) ≥ L(θ , θ ) L(θ ), C

A

we have α=

L(θ ) ≥

C

L(θ ); A

that is, the signiﬁcance level of the test associated with the critical region A is less than or equal to α. But C, in accordance with the Neyman–Pearson theorem, is a best critical region of size α. Thus L(θ ) ≥ L(θ ) C

A

and Cc

Accordingly,

L(θ , θ ) or, equivalently,

L(θ ) ≤

Cc

L(θ ). Ac

L(θ ) ≤ L(θ , θ )

L(θ ), Ac

R(θ , C) ≤ R(θ , A).

That is,

R(θ , C) = R(θ , C) ≤ R(θ , A).

This means that

max[R(θ , C), R(θ , C)] ≤ R(θ , A).

Then certainly, max[R(θ , C), R(θ , C)] ≤ max[R(θ , A), R(θ , A)], and the critical region C provides a minimax solution, as we wanted to show. Example 5.1. Let X1 , X2 , . . . , X100 denote a random sample of size 100 from a distribution that is N (θ, 100). We again consider the problem of testing H0 : θ = 75 against H1 : θ = 78. We seek a minimax solution with L(75, 78) = 3 and L(78, 75) = 1. Since L(75)/L(78) ≤ k is equivalent to x ≥ c, we want to determine c, and thus k, so that 3P (X ≥ c; θ = 75) = P (X < c; θ = 78).

478

(5.2)

Optimal Tests of Hypotheses Because X is N (θ, 1), the preceding equation can be rewritten as 3[1 − Φ(c − 75)] = Φ(c − 78). As requested in Exercise 5.4, the reader can show by using Newton’s algorithm that the solution to one place is c = 76.8. The signiﬁcance level of the test is 1 − Φ(1.8) = 0.036, approximately, and the power of the test when H1 is true is 1 − Φ(−1.2) = 0.885, approximately.

5.2

Classification

The summary above has an interesting application to the problem of classification, which can be described as follows. An investigator makes a number of measurements on an item and wants to place it into one of several categories (or classify it). For convenience in our discussion, we assume that only two measurements, say X and Y , are made on the item to be classiﬁed. Moreover, let X and Y have a joint pdf f (x, y; θ), where the parameter θ represents one or more parameters. In our simpliﬁcation, suppose that there are only two possible joint distributions (categories) for X and Y , which are indexed by the parameter values θ and θ , respectively. In this case, the problem then reduces to one of observing X = x and Y = y and then testing the hypothesis θ = θ against the hypothesis θ = θ , with the classiﬁcation of X and Y being in accord with which hypothesis is accepted. From the Neyman–Pearson theorem, we know that a best decision of this sort is of the following form: If f (x, y; θ ) ≤ k, f (x, y; θ ) choose the distribution indexed by θ ; that is, we classify (x, y) as coming from the distribution indexed by θ . Otherwise, choose the distribution indexed by θ ; that is, we classify (x, y) as coming from the distribution indexed by θ . Some discussion on the choice of k follows in the next remark. Remark 5.1 (On the Choice of k). Consider the following probabilities: π π

= =

P [(X, Y ) is drawn from the distribution with pdf f (x, y; θ )] P [(X, Y ) is drawn from the distribution with pdf f (x, y; θ )].

Note that π + π = 1. Then it can be shown that the optimal classiﬁcation rule is determined by taking k = π /π ; see, for instance, Seber (1984). Hence, if we have prior information on how likely the item is drawn from the distribution with parameter θ , then we can obtain the classiﬁcation rule. In practice, it is common for each distribution to be equilikely, in which case, π = π = 1/2 and, hence, k = 1. Example 5.2. Let (x, y) be an observation of the random pair (X, Y ), which has a bivariate normal distribution with parameters μ1 , μ2 , σ12 , σ22 , and ρ. That joint pdf is given by f (x, y; μ1 , μ2 , σ12 , σ22 ) =

2πσ1 σ2

1

1 − ρ2

e−q(x,y;μ1 ,μ2 )/2 ,

479

Optimal Tests of Hypotheses for −∞ < x < ∞ and −∞ < y < ∞, where σ1 > 0, σ2 > 0, −1 < ρ < 1, and 2 2 y − μ2 y − μ2 x − μ1 1 x − μ1 + . − 2ρ q(x, y; μ1 , μ2 ) = 1 − ρ2 σ1 σ1 σ2 σ2 Assume that σ12 , σ22 , and ρ are known but that we do not know whether the respective means of (X, Y ) are (μ1 , μ2 ) or (μ1 , μ2 ). The inequality f (x, y; μ1 , μ2 , σ12 , σ22 , ρ) ≤k f (x, y; μ1 , μ2 , σ12 , σ22 , ρ) is equivalent to

1 2 [q(x, y; μ1 , μ2 )

− q(x, y; μ1 , μ2 )] ≤ log k.

Moreover, it is clear that the diﬀerence in the left-hand member of this inequality does not contain terms involving x2 , xy, and y 2 . In particular, this inequality is the same as μ1 − μ1 μ2 − μ2 ρ(μ2 − μ2 ) ρ(μ1 − μ1 ) 1 x + y − − 1 − ρ2 σ12 σ1 σ2 σ22 σ1 σ2 ≤ log k + 12 [q(0, 0; μ1 , μ2 ) − q(0, 0; μ1 , μ2 )], or, for brevity, ax + by ≤ c.

(5.3)

That is, if this linear function of x and y in the left-hand member of inequality (5.3) is less than or equal to a constant, we classify (x, y) as coming from the bivariate normal distribution with means μ1 and μ2 . Otherwise, we classify (x, y) as arising from the bivariate normal distribution with means μ1 and μ2 . Of course, if the prior probabilities can be assigned as discussed in Remark 5.1 then k and thus c can be found easily; see Exercise 5.3. Once the rule for classiﬁcation is established, the statistician might be interested in the two probabilities of misclassiﬁcations using that rule. The ﬁrst of these two is associated with the classiﬁcation of (x, y) as arising from the distribution indexed by θ if, in fact, it comes from that index by θ . The second misclassiﬁcation is similar, but with the interchange of θ and θ . In the preceding example, the probabilities of these respective misclassiﬁcations are P (aX + bY ≤ c; μ1 , μ2 )

and

P (aX + bY > c; μ1 , μ2 ).

The distribution of Z = aX + bY is obtained previously. It follows that the distribution of Z = aX + bY is given by N (aμ1 + bμ2 , a2 σ12 + 2abρσ1 σ2 + b2 σ22 ). With this information, it is easy to compute the probabilities of misclassiﬁcations; see Exercise 5.3.

480

Optimal Tests of Hypotheses One ﬁnal remark must be made with respect to the use of the important classiﬁcation rule established in Example 5.2. In most instances the parameter values μ1 , μ2 and μ1 , μ2 as well as σ12 , σ22 , and ρ are unknown. In such cases the statistician has usually observed a random sample (frequently called a training sample) from each of the two distributions. Let us say the samples have sizes n and n , respectively, with sample characteristics x , y , (sx )2 , (sy )2 , r

and

x , y , (sx )2 , (sy )2 , r .

The statistics r and r are the sample correlation coeﬃcients. The sample correlation coeﬃcient is the mle for the correlation parameter ρ of a bivariate normal distribution. If in inequality (5.3) the parameters μ1 , μ2 , μ1 , μ2 , σ12 , σ22 , and ρσ1 σ2 are replaced by the unbiased estimates x , y , x , y ,

(n − 1)(sx )2 + (n − 1)(sx )2 (n − 1)(sy )2 + (n − 1)(sy )2 , , n + n − 2 n + n − 2 (n − 1)r sx sy + (n − 1)r sx sy , n + n − 2

the resulting expression in the left-hand member is frequently called Fisher’s linear discriminant function. Since those parameters have been estimated, the distribution theory associated with aX + bY does provide an approximation. Although we have considered only bivariate distributions in this section, the results can easily be extended to multivariate normal distributions. EXERCISES 5.1. Let X1 , X2 , . . . , X20 be a random sample of size 20 from a distribution which is N (θ, 5). Let L(θ) represent the joint pdf of X1 , X2 , . . . , X20 . The problem is to test H0 : θ = 1 against H1 : θ = 0. Thus Ω = {θ : θ = 0, 1}. (a) Show that L(1)/L(0) ≤ k is equivalent to x ≤ c. (b) Find c so that the signiﬁcance level is α = 0.05. Compute the power of this test if H1 is true. (c) If the loss function is such that L(1, 1) = L(0, 0) = 0 and L(1, 0) = L(0, 1) > 0, ﬁnd the minimax test. Evaluate the power function of this test at the points θ = 1 and θ = 0. 5.2. Let X1 , X2 , . . . , X10 be a random sample of size 10 from a Poisson distribution with parameter θ. Let L(θ) be the joint pdf of X1 , X2 , . . . , X10 . The problem is to test H0 : θ = 12 against H1 : θ = 1. n (a) Show that L( 12 )/L(1) ≤ k is equivalent to y = 1 xi ≥ c. (b) In order to make α = 0.05, show that H0 is rejected if y > 9 and, if y = 9, reject H0 with probability 12 (using some auxiliary random experiment).

481

Optimal Tests of Hypotheses (c) If the loss function is such that L( 12 , 12 ) = L(1, 1) = 0 and L( 12 , 1) = 1 and L(1, 12 ) = 2, show that the minimax procedure is to reject H0 if y > 6 and, if y = 6, reject H0 with probability 0.08 (using some auxiliary random experiment). 5.3. In Example 5.2 let μ1 = μ2 = 0, μ1 = μ2 = 1, σ12 = 1, σ22 = 1, and ρ = 12 . (a) Find the distribution of the linear function aX + bY . (b) With k = 1, compute P (aX + bY ≤ c; μ1 = μ2 = 0) and P (aX + bY > c; μ1 = μ2 = 1). 5.4. Determine Newton’s algorithm to ﬁnd the solution of Equation (5.2). If software is available, write a program which performs your algorithm and then show that the solution is c = 76.8. If software is not available, solve (5.2) by “trial and error.” 5.5. Let X and Y have the joint pdf 1 x y , 0 < x < ∞, 0 < y < ∞, exp − − f (x, y; θ1 , θ2 ) = θ1 θ2 θ1 θ2 zero elsewhere, where 0 < θ1 , 0 < θ2 . An observation (x, y) arises from the joint distribution with parameters equal to either (θ1 = 1, θ2 = 5) or (θ1 = 3, θ2 = 2). Determine the form of the classiﬁcation rule. 5.6. Let X and Y have a joint bivariate normal distribution. An observation (x, y) arises from the joint distribution with parameters equal to either μ1 = μ2 = 0, (σ12 ) = (σ22 ) = 1, ρ =

1 2

or μ1 = μ2 = 1, (σ12 ) = 4, (σ22 ) = 9, ρ = 12 . Show that the classiﬁcation rule involves a second-degree polynomial in x and y. 5.7. Let W = (W1 , W2 ) be an observation from one of two bivariate normal distributions, I and II, each with μ1 = μ2 = 0 but with the respective variance-covariance matrices 1 0 3 0 and V 2 = . V1= 0 4 0 12 How would you classify W into I or II?

482

Optimal Tests of Hypotheses

Answers to Selected Exercises 1.4

10 i=1

n

x2i ≥ 18.3; yes; yes.

i=1 xi ≥ c. 10 10 2 1.6 3 i=1 xi + 2 i=1 xi ≥ c.

1.5

1.7 About 96; 76.7. n 1.8 i=1 [xi (1 − xi )] ≥ c. 1.9 About 39; 15. 1.10 0.08; 0.875. 2.1 (1 − θ)9 (1 + 9θ). 2.2 1 −

15 16θ 4 , 1

2.3 1 − Φ

< θ.

3−5θ 2

.

2.10 Γ(n, θ); n Reject H0 if i=1 xi ≥ c. 6 1 ; (c) 32 . 2.12 (b) 32 (d) reject if y = 0; if y = 1, reject with probability 15 .

3.1 |t| = 2.27 > 2.145; reject H0 . 3.9 Reject H0 if |y3 − θ0 | ≥ c. n 3.11 (a) i=1 (1 − xi ) ≥ c. 4.1 5.84n − 32.42; 5.84n + 41.62. 4.2 0.04n − 1.66; 0.04n + 1.20.

2.4 About 54; 5.6.

4.4 0.025, 29.7, −29.7.

2.7 Reject H0 if x ≥ 77.564.

5.5 (9y − 20x)/30 ≤ c ⇒ (x, y) ∈ 2nd.

2.8 About 27; reject H0 if x ≤ 24.

5.7 2w12 + 8w22 ≥ c ⇒ (w1 , w2 ) ∈ II.

483

484

Inferences About Normal Models 1

Quadratic Forms

A homogeneous polynomial of degree 2 in n variables is called a quadratic form in those variables. If both the variables and the coeﬃcients are real, the form is called a real quadratic form. To illustrate, the form X12 +X1 X2 +X22 is a quadratic form in the two variables X1 and X2 ; the form X12 + X22 + X32 − 2X1 X2 is a quadratic form in the three variables X1 , X2 , and X3 ; but the form (X1 − 1)2 + (X2 − 2)2 = X12 + X22 − 2X1 − 4X2 + 5 is not a quadratic form in X1 and X2 , although it is a quadratic form in the variables X1 − 1 and X2 − 2. Let X and S 2 denote, respectively, the mean and variance of a random sample X1 , X2 , . . . , Xn from an arbitrary distribution. Thus (n − 1)S

2

=

n 1

=

2 n X 1 + X2 + · · · + X n Xi − (Xi − X) = n 1 2

n−1 2 (X1 + X22 + · · · + Xn2 ) n 2 − (X1 X2 + · · · + X1 Xn + · · · + Xn−1 Xn ) n

is a quadratic form in the n variables X1 , X2 , . . . , Xn . If the sample arises from a distribution that is N (μ, σ 2 ), we know that the random variable (n − 1)S 2 /σ 2 is χ2 (n − 1) regardless of the value of μ. This fact proved useful in our search for a conﬁdence interval for σ 2 when μ is unknown. It has been seen that tests of certain statistical hypotheses require a statistic n that is a quadratic form. For instance, we can make use of the statistic 1 Xi2 , which is a quadratic form in the variables X1 , X2 , . . . , Xn . Later in this chapter, tests of other statistical hypotheses are investigated, showing that statistics

From Chapter 9 of Introduction to Mathematical Statistics, Seventh Edition. Robert V. Hogg, c 2013 by Pearson Education, Inc. Joseph W. McKean, Allen T. Craig. Copyright All rights reserved.

485

Inferences About Normal Models composed of functions of quadratic forms are necessary to conduct the tests in an expeditious manner. But ﬁrst we shall make a study of the distribution of certain quadratic forms in normal and independent random variables. The following theorem is proved in Section 9. Theorem 1.1. Let Q = Q1 + Q2 + · · · + Qk−1 + Qk , where Q, Q1 , . . . , Qk are k + 1 random variables that are real quadratic forms in n independent random variables which are normally distributed with common mean and variance μ and σ 2 , respectively. Let Q/σ 2 , Q1 /σ 2 , . . . , Qk−1 /σ 2 have chi-square distributions with degrees of freedom r, r1 , . . . , rk−1 , respectively. Let Qk be nonnegative. Then (a) Q1 , . . . , Qk are independent, and hence (b) Qk /σ 2 has a chi-square distribution with r − (r1 + · · · + rk−1 ) = rk degrees of freedom. Three examples illustrative of the theorem follow, each of which deals with a distribution problem that is based on the remarks made in the subsequent paragraph. Let the random variable X have a distribution that is N (μ, σ 2 ). Let a and b denote positive integers greater than 1 and let n = ab. Consider a random sample of size n = ab from this normal distribution. The observations of the random sample are denoted by the symbols X11 , X21 , .. .

X12 , X22 ,

..., ...,

X1j , X2j ,

..., ...,

X1b X2b

Xi1 , .. .

Xi2 ,

...,

Xij ,

...,

Xib

Xa1 ,

Xa2 ,

...,

Xaj ,

...,

Xab .

By assumption, these n = ab random variables are independent, and each has the same normal distribution with mean μ and variance σ 2 . Thus, if we wish, we may consider each row as being a random sample of size b from the given distribution; and we may consider each column as being a random sample of size a from the given distribution. We now deﬁne a + b + 1 statistics. They are a b

486

X ..

=

X i.

=

X .j

=

X11 + · · · + X1b + · · · + Xa1 + · · · + Xab i=1 j=1 = ab ab b Xij Xi1 + Xi2 + · · · + Xib j=1 = , i = 1, 2, . . . , a, b b a Xij X1j + X2j + · · · + Xaj i=1 = , j = 1, 2, . . . , b. a a

Xij

Inferences About Normal Models Thus the statistic X .. is the mean of the random sample of size n = ab; the statistics X 1. , X 2. , . . . , X a. are, respectively, the means of the rows; and the statistics X .1 , X .2 , . . . , X .b are, respectively, the means of the columns. Examples illustrative of the theorem follow. Example 1.1. Consider the variance S 2 of the random sample of size n = ab. We have the algebraic identity (ab − 1)S 2

=

a b

(Xij − X .. )2

i=1 j=1

=

a b

[(Xij − X i. ) + (X i. − X .. )]2

i=1 j=1

=

a b

(Xij − X i. )2 +

i=1 j=1

+2

a b (X i. − X .. )2 i=1 j=1

a b

(Xij − X i. )(X i. − X .. ).

i=1 j=1

The last term of the right-hand member of this identity may be written ⎡ ⎤ a b a ⎣(X i. − X .. ) 2 (Xij − X i. )⎦ = 2 [(X i. − X .. )(bX i. − bX i. )] = 0, i=1

j=1

and the term

i=1

a b

(X i. − X .. )2

i=1 j=1

may be written b

a

(X i. − X .. )2 .

i=1

Thus, (ab − 1)S 2 =

a b

(Xij − X i. )2 + b

i=1 j=1

a

(X i. − X .. )2 ,

i=1

or, for brevity, Q = Q1 + Q 2 . We shall use Theorem 1.1 with k = 2 to show that Q1 and Q2 are independent. Since S 2 is the variance of a random variable of size n = ab from the given normal distribution, then (ab − 1)S 2 /σ 2 has a chi-square distribution with ab − 1 degrees of freedom. Now ⎡ ⎤ a b Q1 ⎣ (Xij − X i. )2 /σ 2 ⎦ . = σ2 i=1 j=1

487

Inferences About Normal Models b 2 For each ﬁxed value of i, j=1 (Xij − X i. ) is the product of (b − 1) and the variance of arandom sample of size b from the given normal distribution, and b accordingly, j=1 (Xij − X i. )2 /σ 2 has a chi-square distribution with b − 1 degrees of freedom. Because the Xij s are independent, Q1 /σ 2 is the sum of a independent random variables, each having a chi-square distribution with b−1 degrees of freedom. 2 Hence Q 1 /σ has a chi-square distribution with a(b − 1) degrees of freedom. Now a Q2 = b i=1 (X i. − X .. )2 ≥ 0. In accordance with the theorem, Q1 and Q2 are independent, and Q2 /σ 2 has a chi-square distribution with ab − 1 − a(b − 1) = a − 1 degrees of freedom. Example 1.2. In (ab − 1)S 2 , replace Xij − X .. by (Xij − X .j ) + (X .j − X .. ) to obtain 2

(ab − 1)S =

b a

[(Xij − X .j ) + (X .j − X .. )]2 ,

j=1 i=1

or (ab − 1)S 2 =

b a

(Xij − X .j )2 + a

j=1 i=1

b

(X .j − X .. )2 ,

j=1

or, for brevity, Q = Q3 + Q4 . It is easy to show (Exercise 1.1) that Q3 /σ 2 has a chi-square distribution with b b(a − 1) degrees of freedom. Since Q4 = a j=1 (X .j − X .. )2 ≥ 0, the theorem enables us to assert that Q3 and Q4 are independent and that Q4 /σ 2 has a chisquare distribution with ab − 1 − b(a − 1) = b − 1 degrees of freedom. Example 1.3. In (ab − 1)S 2 , replace Xij − X .. by (X i. − X .. ) + (X .j − X .. ) + (Xij − X i. − X .j + X .. ) to obtain (Exercise 1.2) (ab − 1)S 2 = b

a i=1

(X i. − X .. )2 + a

b

(X .j − X .. )2 +

j=1

b a

(Xij − X i. − X .j + X .. )2 ,

j=1 i=1

or, for brevity, Q = Q 2 + Q4 + Q5 , where Q2 and Q4 are deﬁned in Examples 1.1 and 1.2. From Examples 1.1 and 1.2, Q/σ 2 , Q2 /σ 2 , and Q4 /σ 2 have chi-square distributions with ab − 1, a − 1, and b − 1 degrees of freedom, respectively. Since Q5 ≥ 0, the theorem asserts that Q2 , Q4 , and Q5 are independent and that Q5 /σ 2 has a chi-square distribution with ab − 1 − (a − 1) − (b − 1) = (a − 1)(b − 1) degrees of freedom.

488

Inferences About Normal Models Once these quadratic form statistics have been shown to be independent, a multiplicity of F -statistics can be deﬁned. For instance, Q4 /(b − 1) Q4 /[σ 2 (b − 1)] = 2 Q3 /[σ b(a − 1)] Q3 /[b(a − 1)] has an F -distribution with b − 1 and b(a − 1) degrees of freedom; and Q4 /(b − 1) Q4 /[σ 2 (b − 1)] = Q5 /[σ 2 (a − 1)(b − 1)] Q5 /(a − 1)(b − 1) has an F -distribution with b−1 and (a−1)(b−1) degrees of freedom. The subsequent sections show that likelihood ratio tests of certain statistical hypotheses can be based on these F -statistics. EXERCISES 1.1. In Example 1.2, verify that Q = Q3 + Q4 and that Q3 /σ 2 has a chi-square distribution with b(a − 1) degrees of freedom. 1.2. In Example 1.3, verify that Q = Q2 + Q4 + Q5 . 1.3. Let X1 , X2 , . . . , Xn be a random sample from a normal distribution N (μ, σ 2 ). Show that n n n−1 (X1 − X )2 , (Xi − X)2 = (Xi − X )2 + n i=1 i=2 n n where X = i=1 Xi /n and X = i=2 Xi /(n − 1). n Hint: Replace Xi − X by (Xi − X ) − (X1 − X )/n. Show that i=2 (Xi − X )2 /σ 2 has a chi-square distribution with n − 2 degrees of freedom. Prove that the two terms in the right-hand member are independent. What then is the distribution of

[(n − 1)/n](X1 − X )2 ? σ2 1.4. Let Xijk , i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , c, be a random sample of size c b a n = abc from a normal distribution N (μ, σ 2 ). Let X ... = k=1 j=1 i=1 Xijk /n c b and X i.. = k=1 j=1 Xijk /bc. Prove that a b c

(Xijk − X ... )2 =

i=1 j=1 k=1

a b c

(Xijk − X i.. )2 + bc

i=1 j=1 k=1

a

(X i.. − X ... )2 .

i=1

a b c Show that i=1 j=1 k=1 (Xijk − X i.. )2 /σ 2 has a chi-square distribution with a(bc − 1) degrees of freedom. Prove that the two terms in right-hand memthe a ber are independent. What, then, is the distribution of bc i=1 (X i.. − X ... )2 /σ 2 ?

489

Inferences About Normal Models Furthermore, let X .j. = a b c i=1 j=1 k=1

c

a

k=1

(Xijk − X ... )2 =

i=1

Xijk /ac and X ij. =

a b c

c k=1

Xijk /c. Show that

(Xijk − X ij. )2

i=1 j=1 k=1

+ bc

a

(X i.. − X ... )2 + ac

i=1

+c

a b

b

(X .j. − X ... )2

j=1

(X ij. − X i.. − X .j. + X ... ).

i=1 j=1

Prove that the four terms in the right-hand member, when divided by σ 2 , are independent chi-square variables with ab(c − 1), a − 1, b − 1, and (a − 1)(b − 1) degrees of freedom, respectively. a random sample of size n = 4 from the normal distri1.5. Let X1 , X2 , X3 , X4 be 4 bution N (0, 1). Show that i=1 (Xi − X)2 equals [X3 − (X1 + X2 )/2]2 [X4 − (X1 + X2 + X3 )/3]2 (X1 − X2 )2 + + 2 3/2 4/3 and argue that these three terms are independent, each with a chi-square distribution with 1 degree of freedom.

2

One-Way ANOVA

Consider b independent random variables that have normal distributions with unknown means μ1 , μ2 , . . . , μb , respectively, and unknown but common variance σ 2 . For each j = 1, 2, . . . , b, let X1j , X2j , . . . , Xaj represent a random sample of size a from the normal distribution with mean μj and variance σ 2 . The appropriate model for the observations is Xij = μj + eij ;

i = 1, . . . , a , j = 1, . . . , b,

(2.1)

where eij are iid N (0, σ 2 ). Suppose that it is desired to test the composite hypothesis H0 : μ1 = μ2 = · · · = μb = μ, μ unspeciﬁed, against all possible alternative hypotheses H1 . A likelihood ratio test is used. Such problems often arise in practice. For example, suppose for a certain type of disease there are b drugs which can be used to treat it and we are interested in determining which drug is best in terms of a certain response. Let Xj denote this response when drug j is applied and let μj = E(Xj ). If we assume that Xj is N (μj , σ 2 ), then the above null hypothesis says that all the drugs are equally eﬀective. We often summarize this problem by saying that we have one factor at b levels. In this case the factor is the treatment of the disease and each level corresponds to one of the treatment drugs. Model (2.1) is called a one-way model. As shown, the likelihood ratio test can be thought of in terms of estimates

490

Inferences About Normal Models of variance. Hence this is an example of an analysis of variance (ANOVA). In short, we say that this example is a one-way ANOVA problem. Here the total parameter space is Ω = {(μ1 , μ2 , . . . , μb , σ 2 ) : −∞ < μj < ∞, 0 < σ 2 < ∞} and ω = {(μ1 , μ2 , . . . , μb , σ 2 ) : −∞ < μ1 = μ2 = · · · = μb = μ < ∞, 0 < σ 2 < ∞}. The likelihood functions, denoted by L(ω) and L(Ω) are, respectively, ⎤ ⎡ ab/2 b a 1 1 exp ⎣− 2 (xij − μ)2 ⎦ L(ω) = 2πσ 2 2σ j=1 i=1 and L(Ω) =

1 2πσ 2

ab/2

⎤ b a 1 exp ⎣− 2 (xij − μj )2 ⎦ . 2σ j=1 i=1 ⎡

Now b a ∂ log L(ω) = σ −2 (xij − μ) ∂μ j=1 i=1

and b a ab 1 ∂ log L(ω) = − + (xij − μj )2 . ∂(σ 2 ) 2σ 2 2σ 4 j=1 i=1

If we equate these partial derivatives to zero, the solutions for μ and σ 2 are, respectively, in ω, (ab)−1

b a

xij = x..

j=1 i=1

(ab)−1

b a

(xij − x.. )2 = v,

(2.2)

j=1 i=1

and these values maximize L(ω). Furthermore, a ∂ log L(Ω) −2 =σ (xij − μj ), ∂μj i=1

j = 1, 2, . . . , b,

491

Inferences About Normal Models and b a ab 1 ∂ log L(Ω) = − + (xij − μj )2 . ∂(σ 2 ) 2σ 2 2σ 4 j=1 i=1

If we equate these partial derivatives to zero, the solutions for μ1 , μ2 , . . . , μb and σ 2 are, respectively, in Ω, a−1

a

xij = x.j ,

j = 1, 2, . . . , b,

i=1

(ab)−1

b a

(xij − x.j )2 = w,

(2.3)

j=1 i=1

and these values maximize L(Ω). These maxima are, respectively, ⎡

⎤ab/2

⎢ ⎥ ⎢ ⎥ ab ⎢ ⎥ L(ˆ ω) = ⎢ ⎥ b a ⎢ ⎥ ⎣ 2π 2⎦ (xij − x.. ) ⎡

j=1 i=1

⎡

⎤ b a 2 (xij − x.. ) ⎥ ⎢ ab ⎢ ⎥ j=1 i=1 ⎢ ⎥ exp ⎢− b a ⎥ ⎢ ⎥ ⎣ 2 2 ⎦ (xij − x.. ) j=1 i=1

⎤ab/2

⎥ ⎢ ⎥ ⎢ ab ⎥ ⎢ =⎢ ⎥ b a ⎥ ⎢ ⎣ 2π 2⎦ (xij − x.. )

e−ab/2

j=1 i=1

and ⎡

⎤ab/2

⎢ ⎥ ⎢ ⎥ ab ⎢ ⎥ ˆ =⎢ L(Ω) ⎥ b a ⎢ ⎥ ⎣ 2π 2⎦ (xij − x.j )

e−ab/2 .

j=1 i=1

Finally,

⎡

b a

⎤ab/2

(xij − x.j )2 ⎥ ⎢ ⎥ L(ˆ ω) ⎢ ⎥ ⎢ j=1 i=1 =⎢ b a Λ= ⎥ ˆ ⎥ ⎢ L(Ω) ⎣ 2⎦ (xij − x.. ) j=1 i=1

492

.

Inferences About Normal Models In the notation of Section 1, the statistics deﬁned by the functions x.. and v given by the equations in expression (2.2) of this section are 1 Xij ab j=1 i=1 b

X .. =

1 Q (Xij − X .. )2 = , ab j=1 i=1 ab b

a

and

V =

a

(2.4)

while the statistics deﬁned by the functions x.1 , x.2 , . . . , x.b and w given aby Equations (2.3) in this section are, respectively, given by the formulas X .j = i=1 Xij /a, b a j = 1, 2, . . . , b, and Q3 /ab = j=1 i=1 (Xij − X .j )2 /ab. Thus, in the notation of Section 1, Λ2/ab deﬁnes the statistic Q3 /Q. We reject the hypothesis H0 if Λ ≤ λ0 . To ﬁnd λ0 so that we have a desired signiﬁcance level α, we must assume that the hypothesis H0 is true. If the hypothesis H0 is true, the random variables Xij constitute a random sample of size n = ab from a distribution that is normal with mean μ andvariance σ 2 . Thus, by Example b 1.2, we have that Q = Q3 + Q4 , where Q4 = a j=1 (X .j − X .. )2 ; that Q3 and Q4 are independent; and that Q3 /σ 2 and Q4 /σ 2 have chi-square distributions with b(a − 1) and b − 1 degrees of freedom, respectively. Thus, the statistic deﬁned by λ2/ab may be written 1 Q3 = . Q3 + Q4 1 + Q4 /Q3 The signiﬁcance level of the test of H0 is

1 2/ab ≤ λ0 α = PH 0 1 + Q4 /Q3

Q4 /(b − 1) ≥c , = P H0 Q3 /([b(a − 1)] where c=

b(a − 1) −2/ab (λ0 − 1). b−1

But F =

Q4 /(b − 1) Q4 /[σ 2 (b − 1)] = Q3 /[σ 2 b(a − 1)] Q3 /[b(a − 1)]

has an F -distribution with b − 1 and b(a − 1) degrees of freedom. Hence the test of the composite hypothesis H0 : μ1 = μ2 = · · · = μb = μ, μ unspeciﬁed, against all possible alternatives may be tested with an F -statistic. Setting the constant c to the upper α F -critical point with b − 1 and b(a − 1) degress of freedom, denoted by F (α, b − 1, b(a − 1)), yields a test of level α. Remark 2.1. It should be pointed out that a test of the equality of the b means μj , j = 1, 2, . . . , b, does not require that we take a random sample of size a from each of the b normal distributions. That is, the samples may be of diﬀerent sizes, for instance, a1 , a2 , . . . , ab ; see Exercise 2.1.

493

Inferences About Normal Models Suppose now that we wish to compute the power of the test of H0 against H1 when H0 is false, that is, when we do not have μ1 = μ2 = · · · = μb = μ. In Section 3 we show that under H1 , Q4 /σ 2 no longer has a χ2 (b − 1) distribution. Thus we cannot use an F -statistic to compute the power of the test when H1 is true. The problem is discussed in Section 3. An observation should be made in connection with maximizing a likelihood function with respect to certain parameters. Sometimes it is easier to avoid the use of the calculus. For example, L(Ω) of this section can be maximized with respect to μj , for every ﬁxed positive σ 2 , by minimizing z=

b a

(xij − μj )2

j=1 i=1

with respect to μj , j = 1, 2, . . . , b. Now z can be written as z=

b a

[(xij − x.j ) + (x.j − μj )]2

j=1 i=1

=

b a

(xij − x.j )2 + a

j=1 i=1

b

(x.j − μj )2 .

j=1

Since each term in the right-hand member of the preceding equation is nonnegative, clearly z is a minimum, with respect to μj , if we take μj = x.j , j = 1, 2, . . . , b. EXERCISES 2.1. Let X1j , X2j , . . . , Xaj j represent independent random samples of sizes aj from a normal distribution with means μj and variances σ 2 , j = 1, 2, . . . , b. Show that aj aj b b b (Xij − X .. )2 = (Xij − X .j )2 + aj (X .j − X .. )2 , j=1 i=1

j=1 i=1

b

j=1

aj Xij /aj . or Q = Q3 + Q4 . Here X .. = j=1 i=1 Xij / j=1 aj and X .j = i=1 2 2 If μ1 = μ2 = · · · = μb , show that Q /σ and Q3 /σ have chi-square distributions. Prove that Q3 and Q4 are independent, and hence Q4 /σ 2 also has a chi-square distribution. If the likelihood ratio Λ is used to test H0 : μ1 = μ2 = · · · = μb = μ, μ unspeciﬁed and σ 2 unknown against all possible alternatives, show that Λ ≤ λ0 is equivalent to the computed F ≥ c, where ⎛ ⎞ b ⎝ aj − b⎠ Q4 F =

aj

j=1

(b − 1)Q3

b

.

Determine the distribution of F when H0 is true and, hence, determine c so that the test has level α.

494

Inferences About Normal Models 2.2. Consider the T -statistic that was derived through a likelihood ratio for testing the equality of the means of two normal distributions having common variance. Show that T 2 is exactly the F -statistic of Exercise 2.1 with a1 = n, a2 = m, and b = 2. Of course, X1 , . . . , Xn , X are replaced with X11 , . . . , X1n , X 1. and Y1 , . . . , Ym , Y by X21 . . . , X2m , X 2. . 2.3. In Exercise 2.1, show that the linear functions Xij − X .j and X .j − X .. are uncorrelated. Hint: Recall the deﬁnition of X .j and X .. and, without loss of generality, we can let E(Xij ) = 0 for all i, j. 2.4. The following are observations associated with independent random samples from three normal distributions having equal variances and respective means μ 1 , μ2 , μ3 . I 0.5 1.3 −1.0 1.8

II 2.1 3.3 0.0 2.3 2.5

III 3.0 5.1 1.9 2.4 4.2 4.1

Compute the F -statistic that is used to test H0 : μ1 = μ2 = μ3 . 2.5. Using the notation of this section, assume that the means satisfy the condition that μ = μ1 +(b−1)d = μ2 −d = μ3 −d = · · · = μb −d. That is, the last b−1 means are equal but diﬀer from the ﬁrst mean μ1 , provided that d = 0. Let independent random samples of size a be taken from the b normal distributions with common unknown variance σ 2 . (a) Show that the maximum likelihood estimators of μ and d are μ ˆ = X .. and b

dˆ =

X .j /(b − 1) − X .1

j=2

b

.

(b) Using Exercise 1.3, ﬁnd Q6 and Q7 = cdˆ2 so that, when d = 0, Q7 /σ 2 is χ2 (1) and a b (Xij − X .. )2 = Q3 + Q6 + Q7 . i=1 j=1

(c) Argue that the three terms in the right-hand member of part (b), once divided by σ 2 , are independent random variables with chi-square distributions, provided that d = 0.

495

Inferences About Normal Models (d) The ratio Q7 /(Q3 + Q6 ) times what constant has an F -distribution, provided that d = 0? Note that this F is really the square of the two-sample T used to test the equality of the mean of the ﬁrst distribution and the common mean of the other distributions, in which the last b − 1 samples are combined into one. 2.6. Let μ1 , μ2 , μ3 be, respectively, the means of three normal distributions with a common but unknown variance σ 2 . In order to test, at the α = 5% signiﬁcance level, the hypothesis H0 : μ1 = μ2 = μ3 against all possible alternative hypotheses, we take an independent random sample of size 4 from each of these distributions. Determine whether we accept or reject H0 if the observed values from these three distributions are, respectively, X1 : X2 : X3 :

5 11 10

9 13 6

6 8 10 12 9 9

2.7. The driver of a diesel-powered automobile decided to test the quality of three types of diesel fuel sold in the area based on mpg. Test the null hypothesis that the three means are equal using the following data. Make the usual assumptions and take α = 0.05. Brand A: 38.7 39.2 40.1 38.9 Brand B: 41.9 42.3 41.3 Brand C: 40.8 41.2 39.5 38.9 40.3

3

Noncentral χ2 and F -Distributions

Let X1 , X2 , . . . , Xn denote independent random variables are N (μi , σ 2 ), i = n 2 that 2 1, 2, . . . , n, and consider the quadratic form Y = 1 Xi /σ . If each μi is zero, we know that Y is χ2 (n). We shall now investigate the distribution of Y when each μi is not zero. The mgf of Y is given by n X2 i M (t) = E exp t 2 σ i=1 2 n X . E exp t 2i = σ i=1 Consider

E exp

tXi2 σ2

The integral exists if t

2 and noncentrality parameter θ. 3.4. Show that the square of a noncentral T random variable is a noncentral F random variable. 3.5. Let X1 and X2 be two independent random variables. Let X1 and Y = X1 +X2 be χ2 (r1 , θ1 ) and χ2 (r, θ), respectively. Here r1 < r and θ1 ≤ θ. Show that X2 is χ2 (r − r1 , θ − θ1 ). 3.6. In Exercise 2.1, if μ1 , μ2 , . . . , μb are not equal, what are the distributions of Q3 /σ 2 , Q4 /σ 2 , and F ?

4

Multiple Comparisons

Consider b independent random variables that have normal distributions with unknown means μ1 , μ2 , . . . , μb , respectively, and with unknown but common variance b known real constants that are not all zero. We want to σ 2 . Let k1 , . . . , kb represent b ﬁnd a conﬁdence interval of j=1 kj μj , a linear function of the means μ1 , μ2 , . . . , μb . , . . . , Xaj of size a from the distriTo do this, we take a random sample X1j , X2j a If we denote bution N (μj , σ 2 ), j = 1, 2, . . . , b. i=1 Xij /a by X .j , then we know a 2 2 2 2 (X − X ) that X .j is N (μj , σ /a), that ij .j /σ is χ (a − 1), and that the i=1 two random variables are independent. Since the independent samples are random a taken from the b distributions, the 2b random variables X .j , i=1 (Xij − X .j )2 /σ 2 , j = 1, 2, . . . , b, are independent. Moreover, X .1 , X .2 , . . . , X .b and b a (Xij − X .j )2 σ2 j=1 i=1

498

Inferences About Normal Models b are independent and the latter is χ2 [b(a−1)]. Let Z = 1 kj X .j . Then Z is normal b b 2 2 with mean 1 kj μj and variance 1 kj σ /a, and Z is independent of 1 (Xij − X .j )2 . b(a − 1) j=1 i=1 b

V =

a

Hence the random variable T =

(

b

1 kj X .j −

b b b k μ )/ (σ 2 /a) 1 ki2 j j kj X .j − 1 kj μj 1 = 1 b V /σ 2 (V /a) 1 kj2

b

has a t-distribution with b(a − 1) degrees of freedom. For 0 < α < 1, let c = tα/2,b(a−1) . It follows that the probability is 1 − α that b 1

kj X .j − c

b

kj2

1

V ≤ a

b

kj μ j ≤

1

b

kj X .j + c

1

b 1

kj2

V . a

The observed values of X .j , j = 1, 2, . . . , b, and V provide a 100(1 − α)% conﬁdence b interval for 1 kj μj . b It should be observed that the conﬁdence interval for 1 kj μj depends upon the particular choice of k1 , k2 , . . . , kb . It is conceivable that we may be interested in more than one linear function of μ1 , μ2 , . . . , μb , such as μ2 − μ1 , μ3 − (μ1 + μ2 )/2, b or μ1 + · · · + μb . We can, of course, ﬁnd for each 1 kj μj a random interval that b has a preassigned probability of including that particular 1 kj μj . But how can we compute the probability that simultaneously these random intervals include their respective linear functions of μ1 , μ2 , . . . , μb ? The following procedure of multiple comparisons, due to Scheﬀ´e, is one solution to this problem. The random variable b (X .j − μj )2 j=1

σ 2 /a is χ2 (b) and, because it is a function of X .1 , . . . , X .b alone, it is independent of the random variable b a 1 (Xij − X .j )2 . V = b(a − 1) j=1 i=1 Hence, the random variable a F =

b

(X .j − μj )2 /b

j=1

V

499

Inferences About Normal Models has an F -distribution with b and b(a − 1) degrees of freedom. For 0 < α < 1, let d = F (α, b, b(a − 1)). Then P (F ≤ d) = 1 − α or ⎤ ⎡ b V P ⎣ (X .j − μj )2 ≤ bd ⎦ = 1 − α. a j=1 b Note that j=1 (X .j − μj )2 is the square of the distance, in b-dimensional space, from the point (μ1 , μ2 , . . . , μb ) to the random point (X .1 , X .2 , . . . , X .b ). Consider a space of dimension b and let (t1 , t2 , . . . , tb ) denote the coordinates of a point in that space. An equation of a hyperplane that passes through the point (μ1 , μ2 , . . . , μb ) is given by (4.1) k1 (t1 − μ1 ) + k2 (t2 − μ2 ) + · · · + kb (tb − μb ) = 0, where not all the real numbers kj , j = 1, 2, . . . , b, are equal to zero. The square of the distance from this hyperplane to the point (t1 = X .1 , t2 = X .2 , . . . , tb = X .b ) is [k1 (X .1 − μ1 ) + k2 (X .2 − μ2 ) + · · · + kb (X .b − μb )]2 . k12 + k22 + · · · + kb2

(4.2)

b From the geometry of the situation it follows that 1 (X .j − μj )2 is equal to the maximum of expression (4.2) with respect to k1 , k2 , . . . , kb . Thus the inequality b 2 1 (X .j − μj ) ≤ (bd)(V /a) holds if and only if 2 b kj (X .j − μj ) 1 b

≤ bd kj2

V , a

(4.3)

j=1

for every real k1 , k2 , . . . , kb , not all zero. Accordingly, these two equivalent events have the same probability, 1 − α. However, inequality (4.3) may be written in the form ! ! b b b ! ! V ! ! . kj X .j − kj μj ! ≤ bd kj2 ! ! ! a 1

1

1

Thus the probability is 1 − α that simultaneously, for all real k1 , k2 , . . . , kb , not all zero, b b b b b V V ≤ . (4.4) kj X .j − bd kj2 kj μ j ≤ kj X .j + bd kj2 a a 1 1 1 1 1 Denote by A the event where inequality (4.4) is true for all real k1 , . . . , kb , and denote by B the event where that inequality is true for a ﬁnite number of b-tuples (k1 , . . . , kb ). If A occurs, then B occurs; hence, P (A) ≤ P (B). In the applications,

500

Inferences About Normal Models b one is often interested only in a ﬁnite number of linear functions 1 kj μj . Once the observed values are available, we obtain from (4.4) a conﬁdence interval for each of these linear functions. Since P (B) ≥ P (A) = 1 − α, we have a conﬁdence coeﬃcient of at least 100(1 − α)% that the linear functions are in these respective conﬁdence intervals. Remark 4.1. If the sample sizes, say a1 , a2 , . . . , ab , are unequal, inequality (4.4) becomes b b b b b kj2 kj2 V ≤ V , (4.5) kj X .j − bd kj μ j ≤ kj X .j + bd aj aj 1 1 1 1 1 where aj

X .j =

aj b

Xij

i=1

aj

,

V =

(Xij − X .j )2

j=1 i=1 b

, (aj − 1)

1

b and d is selected from Table V with b and 1 (aj −1) degrees of freedom. Inequality (4.5) reduces to inequality (4.4) when a1 = a2 = · · · = ab . b Moreover, if we restrict our attention to linear functions of the form 1 kj μj b with 1 kj = 0 (such linear functions are called contrasts), the radical in inequality (4.5) is replaced by b kj2 d(b − 1) V, aj 1 where d is now found the Table V with b − 1 and

b

1 (aj

− 1) degrees of freedom.

In multiple comparisons based on the Scheﬀ´e procedure, one often ﬁnds that the length of a conﬁdence interval is much greater than the length of a 100(1 − b α)% conﬁdence interval for a particular linear function 1 kj μj . But this is to be expected because in one case the probability 1 − α applies to just one event, and in the other it applies to the simultaneous occurrence of many events. One reasonable way to reduce the length of these intervals is to take a larger value of α, say 0.25, instead of 0.05. After all, it is still a very strong statement to say that the probability is 0.75 that all these events occur. There are, however, other multiple comparison procedures which are often used in practice. One of these is the Bonferroni procedure described in Exercise 4.2. This procedure can be used for a ﬁnite number of conﬁdence intervals and, as Exercise 4.3 shows, the concept is easily extended to tests of hypotheses. In the case of the 2b pairwise comparisons of means, i.e., comparisons of the form μi − μj , the procedure most often used is the Tukey–Kramer procedure; see Miller (1981) and Hsu (1996) for discussion.

501

Inferences About Normal Models EXERCISES 4.1. If A1 , A2 , . . . , Ak are events, prove, by induction, Boole’s inequality P (A1 ∪ A2 ∪ · · · ∪ Ak ) ≤

k

P (Ai ).

1

Then show that P (Ac1 ∩ Ac2 ∩ · · · ∩ Ack ) ≥ 1 −

b

P (Ai ).

1

4.2 (Bonferroni Multiple Comparison Procedure). In the notation of this section, let (ki1 , ki2 , . . . , kib ), i = 1, 2, . . . , m, represent a ﬁnite number of b-tuples. The b problem is to ﬁnd simultaneous conﬁdence intervals for j=1 kij μj , i = 1, 2, . . . , m, by a method diﬀerent from that of Scheﬀ´e. Deﬁne the random variable Ti by ⎛ ⎞" ⎞ ⎛ b b b ⎝ 2 ⎠ V /a, i = 1, 2, . . . , m. ⎝ kij X .j − kij μj ⎠ kij j=1

j=1

j=1

(a) Let the event Aci be given by −ci ≤ Ti ≤ ci , i = 1, 2, . . . , m. Find the random b variables Ui and Wi such that Ui ≤ 1 kij μj ≤ Wj is equivalent to Aci . (b) Select ci = tα/(2m),b(a−1) . Then P (Aci ) = 1 − α/m; i.e., P (Ai ) = α/m. Determine a lower bound on the probability the random b that simultaneously b intervals (U1 , W1 ), . . . , (Um , Wm ) include j=1 k1j μj , . . . , j=1 kmj μj . Hint: Use Exercise 4.1. (c) Let a = 3, b = 6, and α = 0.05. Consider the linear functions μ1 −μ2 , μ2 −μ3 , μ3 − μ4 , μ4 − (μ5 + μ6 )/2, and (μ1 + μ2 + · · · + μ6 )/6. Here m = 5. Show that the lengths of the conﬁdence intervals given by the results of part (b) are shorter than the corresponding ones given by the method of Scheﬀ´e. If m becomes suﬃciently large, however, this is not the case. 4.3. Extend the Bonferroni procedure described in the last problem to simultaneous testing. That is, suppose we have m hypotheses of interest: H0i versus H1i , i = 1, . . . , m. For testing H0i versus H1i , let Ci,α be a critical region of size α and assume H0i is rejected if Xi ∈ Ci,α , for a sample Xi . Determine a rule so that we can simultaneously test these m hypotheses with a Type I error rate less than or equal to α.

5

The Analysis of Variance

Recall the one-way analysis of variance (ANOVA) problem considered in Section 2 which was concerned with one factor at b levels. In this section, we are concerned with the situation where we have two factors A and B with levels a and

502

Inferences About Normal Models b, respectively. This is called a two-way analysis of variance (ANOVA). Let Xij , i = 1, 2, . . . , a and j = 1, 2, . . . , b, denote the response for factor A at level i and factor B at level j. Denote the total sample size by n = ab. We shall assume that the Xij s are independent normally distributed random variables with common variance σ 2 . Denote the mean of Xij by μij . The mean μij is often referred to as the mean of the (i, j)th cell. For our ﬁrst model, we consider the additive model where (5.1) μij = μ + (μi· − μ) + (μ·j − μ) ; that is, the mean in the (i, j)th cell is due to additive eﬀects of the levels, i of factor A and j of factor B, over the average (constant) μ. Let αi = μi· − μ, i = 1, . . . , a; βj = μ·j − μ, j = 1, . . . , b; and μ = μ. Then the model can be written more simply as (5.2) μij = μ + αi + βj , b a where i=1 αi = 0 and j=1 βj = 0. We refer to this model as being a two-way ANOVA model. For example, take a = 2, b = 3, μ = 5, α1 = 1, α2 = −1, β1 = 1, β2 = 0, and β3 = −1. Then the cell means are

Factor A

1 2

1 μ11 = 7 μ21 = 5

Factor B 2 3 μ12 = 6 μ13 = 5 μ22 = 4 μ23 = 3

Note that for each i, the plots of μij versus j are parallel. This is true for additive models in general; see Exercise 5.8. We call these plots mean proﬁle plots. Had we taken β1 = β2 = β3 = 0, then the cell means would be

Factor A

1 2

1 μ11 = 6 μ21 = 4

Factor B 2 3 μ12 = 6 μ13 = 6 μ22 = 4 μ23 = 4

The hypotheses of interest are H0A : α1 = · · · = αa = 0 versus H1A : αi = 0, for some i,

(5.3)

H0B : β1 = · · · = βb = 0 versus H1B : βj = 0, for some j.

(5.4)

and If H0A is true, then by (5.2) the mean of the (i, j)th cell does not depend on the level of A. The second example above is under H0B . The cell means remain the same from column to column for a speciﬁed row. We call these hypotheses main eﬀect hypotheses. Remark 5.1. The model just described, and others similar to it, are widely used in statistical applications. Consider a situation in which it is desirable to investigate the eﬀects of two factors that inﬂuence an outcome. Thus the variety of a grain

503

Inferences About Normal Models and the type of fertilizer used inﬂuence the yield; or the teacher and the size of the class may inﬂuence the score on a standardized test. Let Xij denote the yield from the use of variety i of a grain and type j of fertilizer. A test of the hypothesis that β1 = β2 = · · · = βb = 0 would then be a test of the hypothesis that the mean yield of each variety of grain is the same regardless of the type of fertilizer used. To construct a test of the composite hypothesis H0B versus H1B , we could obtain the corresponding likelihood ratio. However, to gain more insight into such a test, let us reconsider the likelihood ratio test of Section 2, namely, that of the equality of the means of b distributions. There the important quadratic forms are Q, Q3 , and Q4 , which are related through the equation Q = Q4 + Q3 . That is, (ab − 1)S 2 =

b b a a (X .j − X .. )2 + (Xij − X .j )2 , j=1 i=1

j=1 i=1

so we see that the total sum of squares, (ab − 1)S 2 , is decomposed into a sum of squares, Q4 , among column means and a sum of squares, Q3 , within columns. The latter sum of squares, divided by n = ab, is the mle of σ 2 , provided that the 2 . Of course, (ab − 1)S 2 /ab is the mle parameters are in Ω; and we denote it by σˆΩ 2 /σˆ2 )ab/2 is a of σ 2 under ω, here denoted by σˆω2 . So the likelihood ratio Λ = (σˆΩ ω monotone function of the statistic F =

Q4 /(b − 1) Q3 /[b(a − 1)]

upon which the test of the equality of means is based. To help ﬁnd a test for H0B versus H1B , (5.4), return to the decomposition of Example 1.3, Section 1, namely, Q = Q2 + Q4 + Q5 . That is, (ab−1)S 2 =

a b i=1 j=1

(X i. −X .. )2 +

a b

(X .j −X .. )2 +

i=1 j=1

a b

(Xij −X i. −X .j +X .. )2 .

i=1 j=1

Thus the total sum of squares is decomposed into that among rows (Q2 ), that among columns (Q4 ), and that remaining (Q5 ). It is interesting to observe that 2 = Q /ab is the mle of σ 2 under Ω and σˆΩ 5 (Q4 + Q5 ) (Xij − X i. )2 = σˆω2 = ab ab i=1 j=1 a

b

is that estimator under ω. A useful monotone function of the likelihood ratio 2 /σˆ2 )ab/2 is Λ = (σˆΩ ω Q4 /(b − 1) , F = Q5 /[(a − 1)(b − 1)] which has, under H0B , an F -distribution with b − 1 and (a − 1)(b − 1) degrees of freedom. The hypothesis H0B is rejected if F ≥ F (α, b − 1, (a − 1)(b − 1)), at signiﬁcance level α. This is the likelihood ratio test for H0B versus H1B .

504

Inferences About Normal Models If we are to compute the power function of the test, we need the distribution of F when H0B is not true. From Section 3 we know, when H1B is true, that Q4 /σ 2 and Q5 /σ 2 are independent (central or noncentral) chi-square variables. We shall compute the noncentrality parameters of Q4 /σ 2 and Q5 /σ 2 when H1B is true. We have E(Xij ) = μ + αi + βj , E(X i. ) = μ + αi , E(X .j ) = μ + βj , and E(X .. ) = μ. Accordingly, the noncentrality parameter Q4 /σ 2 is b b a 2 a 2 (μ + βj − μ) = 2 β σ 2 j=1 σ j=1 j

and that of Q5 /σ 2 is σ −2

b a

(μ + αi + βj − μ − αi − μ − βj + μ)2 = 0.

j=1 i=1

Thus, if the hypothesis H0B is not true, F has a noncentral F -distribution with b−1 b and (a−1)(b−1) degrees of freedom and noncentrality parameter a j=1 βj2 /σ 2 . The desired probabilities can then be found in tables of the noncentral F -distribution. A similar argument can be used to construct the F needed to test the equality of row means; that is, H0A versus H1A , (5.3). The F test statistic is essentially the ratio of the sum of squares among rows and Q5 . In particular, this F is deﬁned by F =

Q2 /(a − 1) Q5 /[(a − 1)(b − 1)]

and under H0A : α1 = α2 = · · · = αa = 0 has an F -distribution with a − 1 and (a − 1)(b − 1) degrees of freedom. The analysis of variance problem that has just been discussed is usually referred to as a two-way classiﬁcation with one observation per cell. Each combination of i and j determines a cell; thus, there is a total of ab cells in this model. Let us now investigate another two-way classiﬁcation problem, but in this case we take c > 1 independent observations per cell. Let Xijk , i = 1, 2, . . . , a, j = 1, 2, . . . , b, and k = 1, 2, . . . , c, denote n = abc random variables which are independent and which have normal distributions with common, but unknown, variance σ 2 . Denote the mean of each Xijk , k = 1, 2, . . . , c, by μij . Under the additive model, (5.1), the mean of each cell depended on its row and column, but often the mean is cell-speciﬁc. To allow this, consider the parameters γij

= =

μij − {μ + (μi· − μ) + (μ·j − μ)} μij − μi· − μ·j + μ,

for i = 1, . . . a, j = 1, . . . , b. Hence γij reﬂects the speciﬁc contribution to the cell mean over and above the additive model. These parameters are called interaction parameters. Using the second form (5.2), we can write the cell means as μij = μ + αi + βj + γij ,

(5.5)

505

Inferences About Normal Models a b a b where i=1 αi = 0, j=1 βj = 0, and i=1 γij = j=1 γij = 0. This model is called a two-way model with interaction. For example, take a = 2, b = 3, μ = 5, α1 = 1, α2 = −1, β1 = 1, β2 = 0, β3 = −1, γ11 = 1, γ12 = 1, γ13 = −2, γ21 = −1, γ22 = −1, and γ23 = 2. Then the cell means are

Factor A

1 2

1 μ11 = 8 μ21 = 4

Factor B 2 3 μ12 = 7 μ13 = 3 μ22 = 3 μ23 = 5

Note that, if each γij = 0, then the cell means are

Factor A

1 2

1 μ11 = 7 μ21 = 5

Factor B 2 3 μ12 = 6 μ13 = 5 μ22 = 4 μ23 = 3

Note that the mean proﬁle plots for this second example are parallel, but those in the ﬁrst example (where interaction is present) are not. The major hypotheses of interest for the interaction model are H0AB : γij = 0 for all i, j versus H1AB : γij = 0, for some i, j.

(5.6)

From Exercise 1.4 of Section 1, we have that a b c

2

(Xijk − X ... ) = bc

i=1 j=1 k=1

a

2

(X i.. − X ... ) + ac

i=1

+c

a b

b

(X .j. − X ... )2

j=1

(X ij. − X i.. − X .j. + X ... )2

i=1 j=1

+

a b c

(Xijk − X ij. )2 ;

i=1 j=1 k=1

that is, the total sum of squares is decomposed into that due to row diﬀerences, that due to column diﬀerences, that due to interaction, and that within cells. The test of H0AB versus H1AB is based upon an F with (a − 1)(b − 1) and ab(c − 1) degrees of freedom given by ⎡ ⎤" a b ⎣c (X ij. − X i.. − X .j. + X ... )2 ⎦ [(a − 1)(b − 1)] F =

i=1 j=1

#

(Xijk − X ij. )2

%$[ab(c − 1)]

.

The reader should verify that the noncentrality parameter of this F -distribution b a 2 /σ 2 . Thus F is central when H0AB : γij = 0, i = is equal to c j=1 i=1 γij 1, 2, . . . , a, j = 1, 2, . . . , b, is true.

506

Inferences About Normal Models If H0AB : γij = 0 is accepted, then one usually continues to test αi = 0, i = 1, 2, . . . , a, by using the test statistic a bc (X i·· − X ··· )2 /(a − 1)

F =

i=1 a b c

, (Xijk − X ij· )2 /[ab(c − 1)]

i=1 j=1 k=1

which has a null F -distribution with a−1 and ab(c−1) degrees of freedom. Similarly, the test of βj = 0, j = 1, 2, . . . , b, proceeds by using the test statistic b ac (X ·j· − X ··· )2 /(b − 1)

F =

j=1 a b c

, 2

(Xijk − X ij· ) /[ab(c − 1)]

i=1 j=1 k=1

which has a null F -distribution with b − 1 and ab(c − 1) degrees of freedom. EXERCISES 5.1. Show that b a j=1 i=1

2

(Xij − X i. ) =

b a

2

(Xij − X i. − X .j + X .. ) + a

j=1 i=1

b

(X .j − X .. )2 .

j=1

5.2. If at least one γij = 0, show that the F , which is used to test that each inter b a 2 /σ 2 . action is equal to zero, has noncentrality parameter equal to c j=1 i=1 γij 5.3. Using the background of the two-way classiﬁcation with one observation per ˆ i = X i. −X .. , cell, show that the maximum likelihood estimator of αi , βj , and μ are α βˆj = X .j − X .. , and μ ˆ = X .. , respectively. Show that these are unbiased estimators of their respective parameters and compute var(ˆ αi ), var(βˆj ), and var(ˆ μ). 5.4. Prove that the linear functions Xij − X i. − X .j + X .. and X .j − X .. are uncorrelated, under the assumptions of this section. 5.5. Given the following observations associated with a two-way classiﬁcation with a = 3 and b = 4, compute the F -statistic used to test the equality of the column means (β1 = β2 = β3 = β4 = 0) and the equality of the row means (α1 = α2 = α3 = 0), respectively. Row/Column 1 2 3

1 3.1 2.7 4.0

2 4.2 2.9 4.6

3 2.7 1.8 3.0

4 4.9 3.0 3.9

507

Inferences About Normal Models 5.6. With the background of the two-way classiﬁcation with c > 1 observations per cell, show that the maximum likelihood estimators of the parameters are α ˆi βˆj

= =

X i.. − X ... X .j. − X ...

γˆij μ ˆ

= =

X ij. − X i.. − X .j. + X ... X ... .

Show that these are unbiased estimators of the respective parameters. Compute the variance of each estimator. 5.7. Given the following observations in a two-way classiﬁcation with a = 3, b = 4, and c = 2, compute the F -statistics used to test that all interactions are equal to zero (γij = 0), all column means are equal (βj = 0), and all row means are equal (αi = 0), respectively. Row/Column 1 2 3

1 3.1 2.9 2.7 2.9 4.0 4.4

2 4.2 4.9 2.9 2.3 4.6 5.0

3 2.7 3.2 1.8 2.4 3.0 2.5

4 4.9 4.5 3.0 3.7 3.9 4.2

5.8. For the additive model (5.1), show that the mean proﬁle plots are parallel. The sample mean proﬁle plots are given by plotting X ij· versus j, for each i. These oﬀer a graphical diagnostic for interaction detection. Obtain these plots for the last exercise. 5.9. We wish to compare compressive strengths of concrete corresponding to a = 3 diﬀerent drying methods (treatments). Concrete is mixed in batches that are just large enough to produce three cylinders. Although care is taken to achieve uniformity, we expect some variability among the b = 5 batches used to obtain the following compressive strengths. (There is little reason to suspect interaction, and hence only one observation is taken in each cell.) Treatment A1 A2 A3

B1 52 60 56

Batch B2 B3 47 44 55 49 48 45

B4 51 52 44

B5 42 43 38

(a) Use the 5% signiﬁcance level and test HA : α1 = α2 = α3 = 0 against all alternatives. (b) Use the 5% signiﬁcance level and test HB : β1 = β2 = β3 = β4 = β5 = 0 against all alternatives.

508

Inferences About Normal Models 5.10. With a = 3 and b = 4, ﬁnd μ, αi , βj and γij if μij , for i = 1, 2, 3 and j = 1, 2, 3, 4, are given by 6 7 7 10 3 11 8 5 9

6

12 8 10

A Regression Problem

There is often interest in the relationship between two variables, for example, a student’s scholastic aptitude test score in mathematics and this same student’s grade in calculus. Frequently, one of these variables, say x, is known in advance of the other, and hence there is interest in predicting a future random variable Y . Since Y is a random variable, we cannot predict its future observed value Y = y with certainty. Thus let us ﬁrst concentrate on the problem of estimating the mean of Y , that is, E(Y ). Now E(Y ) is usually a function of x; for example, in our illustration with the calculus grade, say Y , we would expect E(Y ) to increase with increasing mathematics aptitude score x. Sometimes E(Y ) = μ(x) is assumed to be of a given form, such as a linear or quadratic or exponential function; that is, μ(x) could be assumed to be equal to α + βx or α + βx + γx2 or αeβx . To estimate E(Y ) = μ(x), or equivalently the parameters α, β, and γ, we observe the random variable Y for each of n possible diﬀerent values of x, say x1 , x2 , . . . , xn , which are not all equal. Once the n independent experiments have been performed, we have n pairs of known numbers (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ). These pairs are then used to estimate the mean E(Y ). Problems like this are often classiﬁed under regression because E(Y ) = μ(x) is frequently called a regression curve. Remark 6.1. A model for the mean such as α + βx + γx2 is called a linear model because it is linear in the parameters α, β, and γ. Thus αeβx is not a linear model because it is not linear in α and β. Note that, in Sections 1 to 4, all the means were linear in the parameters and hence are linear models. Let us begin with the case in which E(Y ) = μ(x) is a linear function. Denote by Yi the response at xi and consider the model Yi = α + β(xi − x) + ei ,

i = 1, . . . , n,

(6.1)

n where x = n−1 i=1 xi and e1 , . . . , en are iid random variables with a common N (0, σ 2 ) distribution. Hence E(Yi ) = α + β(xi − x), Var(Yi ) = σ 2 , and Yi has N (α + β(xi − x), σ 2 ) distribution. The n points are (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ); so the ﬁrst problem is that of ﬁtting a straight line to the set of points. Figure 6.1 shows a scatterplot of 60 observations (x1 , y1 ), . . . , (x60 , y60 ) drawn from a linear model of the form (6.1). The joint pdf of Y1 , . . . , Yn is the product of the individual probability density

509

Inferences About Normal Models functions; that is, the likelihood function equals & ' 1 [yi − α − β(xi − x)]2 √ L(α, β, σ ) = exp − 2σ 2 2πσ 2 i=1 ( ) n/2 n 1 1 2 = exp − 2 [yi − α − β(xi − x)] . 2πσ 2 2σ i=1 2

n

y

yi

y^i

x xi

Figure 6.1: The plot shows the least squares ﬁtted line (solid line) to a set of data. The dashed-line segment from (xi , yˆi ) to (xi , yi ) shows the deviation of (xi , yi ) from its ﬁt. To maximize L(α, β, σ 2 ), or, equivalently, to minimize n − log L(α, β, σ ) = log(2πσ 2 ) + 2 2

n

i=1 [yi

− α − β(xi − x)]2 , 2σ 2

we must select α and β to minimize H(α, β) =

n

[yi − α − β(xi − x)]2 .

i=1

Since |yi − α − β(xi − x)| = |yi − μ(xi )| is the vertical distance from the point (xi , yi ) to the line y = μ(x) (see the dashed-line segment in Figure 6.1), we note that H(α, β) represents the sum of the squares of those distances. Thus, selecting α and β so that the sum of the squares is minimized means that we are ﬁtting the straight line to the data by the method of least squares (LS).

510

Inferences About Normal Models To minimize H(α, β), we ﬁnd the two ﬁrst partial derivatives, n ∂H(α, β) =2 [yi − α − β(xi − x)](−1) ∂α i=1

and n ∂H(α, β) =2 [yi − α − β(xi − x)][−(xi − x)]. ∂β i=1

Setting ∂H(α, β)/∂α = 0, we obtain n

yi − nα − β

i=1

n

(xi − x) = 0.

(6.2)

i=1

Since n

(xi − x) = 0,

i=1

we have that n

yi − nα = 0

i=1

and, thus, the mle of α is α ˆ =Y. The equation ∂H(α, β)/∂β = 0 yields, with α replaced by y, n i=1

(yi − y)(xi − x) − β

n

(xi − x)2 = 0

(6.3)

i=1

and, hence, the mle of β is n n (Y − Y )(xi − x) Yi (xi − x) n i i=1 = . βˆ = i=1 n 2 2 (x − x) i i=1 i=1 (xi − x) Equations (6.2) and (6.3) are the estimating equations for the LS solutions for this simple linear model. The ﬁtted value at the point (xi , yi ) is given by ˆ i − x), yˆi = α ˆ + β(x

(6.4)

which is shown on Figure 6.1. The ﬁtted value yˆi is also called the predicted value of yi at xi . The residual at the point (xi , yi ) is given by eˆi = yi − yˆi ,

(6.5)

511

Inferences About Normal Models which is also shown on Figure 6.1. Residual means “what is left” and the residual in regression is exactly that, i.e., what is left over after the ﬁt. The relationship between the ﬁtted values and the residuals is explored in Exercise 6.11. To ﬁnd the maximum likelihood estimator of σ 2 , consider the partial derivative n 2 n ∂[− log L(α, β, σ 2 )] i=1 [yi − α − β(xi − x)] = − . ∂(σ 2 ) 2σ 2 2(σ 2 )2 ˆ we Setting this equal to zero and replacing α and β by their solutions α ˆ and β, obtain n 1 ˆ i − x)]2 . [Yi − α ˆ − β(x σ ˆ2 = n i=1 √ ˆ 2 . Note that in terms of the Of course, due to the invariance of mles, σ ˆ = σ n 2 −1 2 ˆi . As shown in Exercise 6.11, the average of the residuals residuals, σ ˆ =n i=1 e is 0. Since α ˆ is a linear function of independent and normally distributed random variables, α ˆ has a normal distribution with mean n n 1 1 Yi = E(Yi ) E(ˆ α) = E n i=1 n i=1 1 [α + β(xi − x)] = α n i=1 n

= and variance

var(ˆ α) =

n 2 1 i=1

n

var(Yi ) =

σ2 . n

The estimator βˆ is also a linear function of Y1 , Y2 , . . . , Yn and hence has a normal distribution with mean n (xi − x)[α + β(xi − x)] ˆ E(β) = i=1 n 2 i=1 (xi − x) n n α i=1 (xi − x) + β i=1 (xi − x)2 n = =β 2 i=1 (xi − x) and variance ˆ = var(β)

n i=1

n

x −x n i 2 i=1 (xi − x)

2 var(Yi )

(xi − x)2 2 σ2 = ni=1 σ = n . 2 2 [ i=1 (xi − x)2 ] i=1 (xi − x) In summary, the estimators α ˆ and βˆ are linear functions of the independent normal random variables Y1 , . . . , Yn . In Exercise 6.10 it is further shown that the

512

Inferences About Normal Models covariance between α ˆ and βˆ is zero. It follows that α ˆ and βˆ are independent random variables with a bivariate normal distribution; that is, 1

0 α ˆ α n 2 ,σ distribution. (6.6) has a N2 1 0 n (x β 2 βˆ i −x) i=1 Next, we consider the estimator of σ 2 . It can be shown (Exercise 6.6) that n

[Yi − α − β(xi − x)]2

=

i=1

n

{(ˆ α − α) + (βˆ − β)(xi − x)

i=1

=

ˆ i − x)]}2 ˆ − β(x + [Yi − α n n(ˆ α − α)2 + (βˆ − β)2 (xi − x)2 + nσˆ2 , i=1

or for brevity, Q = Q 1 + Q2 + Q3 . Here Q, Q1 , Q2 , and Q3 are real quadratic forms in the variables Yi − α − β(xi − x),

i = 1, 2, . . . , n.

In this equation, Q represents the sum of the squares of n independent random variables that have normal distributions with means zero and variances σ 2 . Thus Q/σ 2 has a χ2 distribution n with n degrees of freedom. Each of the random variables √ 2 ˆ n(ˆ α − α)/σ and i=1 (xi − x) (β − β)/σ has a normal distribution with zero mean and unit variance; thus, each of Q1 /σ 2 and Q2 /σ 2 has a χ2 distribution with 1 degree of freedom. Since Q3 is nonnegative, we have, in accordance with Theorem 1.1, that Q1 , Q2 , and Q3 are independent, so that Q3 /σ 2 has a χ2 distribution with n − 1 − 1 = n − 2 degrees of freedom. That is, nˆ σ 2 /σ 2 has a χ2 distribution with n − 2 degrees of freedom. We now extend this discussion to obtain inference for the parameters α and β. It follows from the above derivations that both the random variable T1 √ α ˆ−α [ n(ˆ α − α)]/σ = T1 = 2 2 σ ˆ /(n − 2) Q3 /[σ (n − 2)] and the random variable T2 * +, n 2 (β ˆ − β) σ (x − x) i i=1 βˆ − β = T2 = n Q3 /[σ 2 (n − 2)] nˆ σ 2 /[(n − 2) 1 (xi − x)2 ] have a t-distribution with n − 2 degrees of freedom. These facts enable us to obtain conﬁdence intervals for α and β; see Exercise 6.3. The fact that nˆ σ 2 /σ 2 has a 2 χ distribution with n − 2 degrees of freedom provides a means of determining a conﬁdence interval for σ 2 . These are some of the statistical inferences about the parameters to which reference was made in the introductory remarks of this section.

513

Inferences About Normal Models Remark 6.2. The more discerning reader should quite properly question our construction of T1 and T2 immediately above. We know that the squares of the linear σ 2 , but we do not know, at this time, that the forms are independent of Q3 = nˆ linear forms themselves enjoy this independence. A more general result is obtained in Theorem 9.1 of Section 9 and the present case is a special instance. Example 6.1 (Geometry of the Least Squares Fit). In the modern literature, linear models are usually expressed in terms of matrices and vectors, which we brieﬂy introduce in this example. Furthermore, this allows us to discuss the simple geometry behind the least squares ﬁt. Consider then Model (6.1). Write the vectors Y = (Y1 , . . . , Yn ) , e = (e1 , . . . , en ) , and xc = (x1 − x, . . . , xn − x) . Let 1 denote the n × 1 vector whose components are all 1. Then Model (6.1) can be expressed equivalently as Y

= = =

α1 + βxc + e α [1 xc ] +e β Xβ + e,

(6.7)

where X is the n × 2 matrix with columns 1 and xc and β = (α, β) . Next, let θ = E(Y) = Xβ. Finally, let V be the two-dimensional subspace of Rn spanned by the columns of X; i.e., V is the range of the matrix X. Hence we can also express the model succinctly as Y = θ + e, θ ∈ V. (6.8) Hence, except for the random error vector e, Y would lie in V . It makes sense intuitively then, as suggested by Figure 6.2, to estimate θ by the vector in V which ˆ where is “closest” (in Euclidean distance) to Y, that is, by θ, 2 ˆ = Argmin θ θ ∈V Y − θ ,

(6.9)

n where the square of the Euclidean norm is given by u 2 = i=1 u2i , for u ∈ Rn . ˆ =α ˆ c, As shown in Exercise 6.11 and depicted on the plot in Figure 6.2, θ ˆ 1 + βx ˆ= where α ˆ and βˆ are the least squares estimates given above. Also, the vector e ˆ is the vector of residuals and nˆ e 2 . Also, just as depicted in Figure Y−θ σ 2 = ˆ ˆ and e ˆ is a right angle. In linear models, we say 6.2, the angle between the vectors θ ˆ that θ is the projection of Y onto the subspace V .

EXERCISES 6.1. Students’ scores on the mathematics portion of the ACT examination, x, and on the ﬁnal examination in the ﬁrst-semester calculus (200 points possible), y, are given. (a) Calculate the least squares regression line for these data.

514

Inferences About Normal Models

Y ^e

0

^

V

Figure 6.2: The sketch shows the geometry of least squares. The vector of re- and the vector of residuals is e ˆ. sponses is Y, the ﬁt is θ,

(b) Plot the points and the least squares regression line on the same graph. (c) Find point estimates for α, β, and σ 2 .

(d) Find 95% conﬁdence intervals for α and β under the usual assumptions.

x 25 20 26 26 28 28 29 32

y 138 84 104 112 88 132 90 183

x 20 25 26 28 25 31 30

y 100 143 141 161 124 118 168

6.2 (Telephone Data). Consider the data presented below. The responses (y) for this data set are the numbers of telephone calls (tens of millions) made in Belgium for the years 1950 through 1973. Time, the years, serves as the predictor variable (x). The data are discussed on page 172 of Hettmansperger and McKean (2011).

515

Inferences About Normal Models Year No. Calls Year No. Calls Year No. Calls Year No. Calls

50 0.44 56 0.81 62 1.61 68 18.20

51 0.47 57 0.88 63 2.12 69 21.20

52 0.47 58 1.06 64 11.90 70 4.30

53 0.59 59 1.20 65 12.40 71 2.40

54 0.66 60 1.35 66 14.20 72 2.70

55 0.73 61 1.49 67 15.90 73 2.90

(a) Calculate the least squares regression line for these data. (b) Plot the points and the least squares regression line on the same graph. (c) What is the reason for the poor least squares ﬁt? 6.3. Find (1 − α)100% conﬁdence intervals for the parameters α and β in Model (6.1). 6.4. Consider Model (6.1). Let η0 = E(Y |x = x0 − x). The least squares estimator ˆ 0 − x). of η0 is ηˆ0 = α ˆ + β(x (a) Using (6.6), determine the distribution of ηˆ0 . (b) Obtain a (1 − α)100% conﬁdence interval for η0 . 6.5. Assume that the sample (x1 , Y1 ), . . . , (xn , Yn ) follows the linear model (6.1). Suppose Y0 is a future observation at x = x0 − x and we want to determine a predictive interval for it. Assume that the model (6.1) holds for Y0 ; i.e., Y0 has a N (α + β(x0 − x), σ 2 ) distribution. We use ηˆ0 of Exercise 6.4 as our prediction of Y0 . (a) Obtain the distribution of Y0 − ηˆ0 . Use the fact that the future observation Y0 is independent of the sample (x1 , Y1 ), . . . , (xn , Yn ). (b) Determine a t-statistic with numerator Y0 − ηˆ0 . (c) Now beginning with 1 − α = P [−tα/2,n−2 < T < tα/2,n−2 ], where 0 < α < 1, determine a (1 − α)100% predictive interval for Y0 . (d) Compare this predictive interval with the conﬁdence interval obtained in Exercise 6.4. Intuitively, why is the predictive interval larger? 6.6. Show that n i=1

[Yi − α − β(xi − x)]2 = n(ˆ α − α)2 + (βˆ − β)2

n i=1

(xi − x)2 +

n

ˆ i − x)]2 . [Yi − α ˆ − β(x

i=1

6.7. Let the independent random variables Y1 , Y2 , . . . , Yn have, respectively, the probability density functions N (βxi , γ 2 x2i ), i = 1, 2, . . . , n, where the given numbers x1 , x2 , . . . , xn are not all equal and no one is zero. Find the maximum likelihood estimators of β and γ 2 .

516

Inferences About Normal Models 6.8. Let the independent random variables Y1 , . . . , Yn have the joint pdf ) ( n/2 n 1 1 2 2 L(α, β, σ ) = exp − 2 [yi − α − β(xi − x)] , 2πσ 2 2σ 1 where the given numbers x1 , x2 , . . . , xn are not all equal. Let H0 : β = 0 (α and σ 2 unspeciﬁed). It is desired to use a likelihood ratio test to test H0 against all possible alternatives. Find Λ and see whether the test can be based on a familiar statistic. Hint: In the notation of this section, show that n

(Yi − α ˆ )2 = Q3 + β-2

1

n

(xi − x)2 .

1

6.9. Using the notation of Section 2, assume that the means μj satisfy a linear function of j, namely, μj = c + d[j − (b + 1)/2]. Let independent random samples of size a be taken from the b normal distributions having means μ1 , μ2 , . . . , μb , respectively, and common unknown variance σ 2 . (a) Show that the maximum likelihood estimators of c and d are, respectively, cˆ = X .. and b j=1 [j − (b − 1)/2](X .j − X .. ) ˆ d= . b 2 j=1 [j − (b + 1)/2] (b) Show that a b

(Xij − X .. )2

i=1 j=1

=

a b i=1 j=1

+ dˆ2

2 b+1 Xij − X .. − dˆ j − 2

2 b b+1 a j− . 2 j=1

(c) Argue that the two terms in the right-hand member of part (b), once divided by σ 2 , are independent random variables with χ2 distributions provided that d = 0. (d) What F -statistic would be used to test the equality of the means, that is, H0 : d = 0? 6.10. Show that the covariance between α ˆ and βˆ is zero. 6.11. Reconsider Example 6.1. ˆ = α ˆ c , where α ˆ and βˆ are the least squares estimators (a) Show that θ ˆ 1 + βx derived in this section. ˆ is the vector of residuals; i.e., its ith entry is ˆ = Y−θ (b) Show that the vector e eˆi , (6.5).

517

Inferences About Normal Models ˆ and e ˆ (c) As depicted in Figure 6.2, show that the angle between the vectors θ is a right angle. ˆ = 0. (d) Show that the residuals sum to zero; i.e., 1 e 6.12. Fit y = a + x to the data x y

0 1

1 3

2 4

by the method of least squares. 6.13. Fit by the method of least squares the plane z = a + bx + cy to the ﬁve points (x, y, z) : (−1, −2, 5), (0, −2, 4), (0, 0, 4), (1, 0, 2), (2, 1, 0). 6.14. Let the 4 × 1 matrix Y be multivariate normal N (Xβ, σ2 I), where the 4 × 3 matrix X equals ⎡ ⎤ 1 1 2 ⎢ 1 −1 2 ⎥ ⎥ X=⎢ ⎣ 1 0 −3 ⎦ 1 0 −1 and β is the 3 × 1 regression coeﬃcient matrix. ˆ = (X X)−1 X Y . (a) Find the mean matrix and the covariance matrix of β ˆ (b) If we observe Y to be equal to (6, 1, 11, 3), compute β. 6.15. Suppose Y is an n × 1 random vector, X is an n × p matrix of known constants of rank p, and β is a p × 1 vector of regression coeﬃcients. Let Y ˆ = (X X)−1 X Y and have a N (Xβ, σ 2 I) distribution. Discuss the joint pdf of β −1 2 Y [I − X(X X) X ]Y /σ . 6.16. Let the independent normal random variables Y1 , Y2 , . . . , Yn have, respectively, the probability density functions N (μ, γ 2 x2i ), i = 1, 2, . . . , n, where the given x1 , x2 , . . . , xn are not all equal and no one of which is zero. Discuss the test of the hypothesis H0 : γ = 1, μ unspeciﬁed, against all alternatives H1 : γ = 1, μ unspeciﬁed. 6.17. Let Y1 , Y2 , . . . , Yn be n independent normal variables with common unknown variance σ 2 . Let Yi have mean βxi , i = 1, 2, . . . , n, where x1 , x2 , . . . , xn are known but not all the same and β is an unknown constant. Find the likelihood ratio test for H0 : β = 0 against all alternatives. Show that this likelihood ratio test can be based on a statistic that has a well-known distribution.

7

A Test of Independence

Let X and Y have a bivariate normal distribution with means μ1 and μ2 , positive variances σ12 and σ22 , and correlation coeﬃcient ρ. We wish to test the hypothesis that X and Y are independent. Because two jointly normally distributed

518

Inferences About Normal Models random variables are independent if and only if ρ = 0, we test the hypothesis H0 : ρ = 0 against the hypothesis H1 : ρ = 0. A likelihood ratio test is used. Let (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) denote a random sample of size n > 2 from the bivariate normal distribution; that is, the joint pdf of these 2n random variables is given by f (x1 , y1 )f (x2 , y2 ) · · · f (xn , yn ). Although it is fairly diﬃcult to show, the statistic that is deﬁned by the likelihood ratio Λ is a function of the statistic, which is the mle of ρ, namely, n (Xi − X)(Yi − Y ) . (7.1) R = i=1 n n 2 2 i=1 (Xi − X) i=1 (Yi − Y ) This statistic R is called the sample correlation coeﬃcient of the random sample. The statistic R is a consistent estimate of ρ; see Exercise 7.5. The likelihood ratio principle, which calls for the rejection of H0 if Λ ≤ λ0 , is equivalent to the computed value of |R| ≥ c. That is, if the absolute value of the correlation coeﬃcient of the sample is too large, we reject the hypothesis that the correlation coeﬃcient of the distribution is equal to zero. To determine a value of c for a satisfactory signiﬁcance level, it is necessary to obtain the distribution of R, or a function of R, when H0 is true, as we outline next. nxn , n >2 2, where x1 , x2 , . . . , xn and x = nLet X1 = x1 , X2 = x2 , . . . , Xn = x /n are ﬁxed numbers such that 1 i 1 (xi −x) > 0. Consider the conditional pdf of Y1 , Y2 , . . . , Yn given that X1 = x1 , X2 = x2 , . . . , Xn = xn . Because Y1 , Y2 , . . . , Yn are independent and, with ρ = 0, are also independent of X1 , X2 , . . . , Xn , this conditional pdf is given by ) ( n n 1 1 2 √ exp − 2 (yi − μ2 ) . 2σ2 1 2πσ2 Let Rc be the correlation coeﬃcient, given X1 = x1 , X2 = x2 , . . . , Xn = xn , so that n n n (Yi − Y )2 Rc (xi − x)(Yi − Y ) (xi − x)Yi i=1

n

=

(xi − x)2

i=1

i=1

n

= (xi − x)2

i=1

i=1 n

(xi − x)2

i=1

is like βˆ of Section 6 and has mean zero when ρ = 0. Thus, referring to T2 of Section 6, we see that √ (Yi − Y )2 / (xi − x)2 Rc Rc n − 2 . = (7.2) 2 √n n 1 − Rc2 n 2 2 i=1

Yi −Y − Rc

j=1 (Yj −Y ) / j=1 (xj −x) n 2 j=1 (xj −x)

(xi −x)

(n−2)

519

Inferences About Normal Models has, given X1 = x1 , . . . , Xn = xn , a conditional t-distribution with n − 2 degrees of freedom. Note that the pdf, say g(t), of this t-distribution√ does not √ depend upon x1 , x2 , . . . , xn . Now the joint pdf of X1 , X2 , . . . , Xn and R n − 2/ 1 − R2 , where n

(Xi − X)(Yi − Y )

1

R= n 1

(Xi − X)2

n

, (Yi − Y )2

1

is the product of g(t) and the√joint pdf √ of X1 , . . . , Xn . Integration on x1 , x2 , . . . , xn yields the marginal pdf of R n − 2/ 1 − R2 ; because g(t) does not depend upon , x2 , . . . ,√ xn , it is obvious that this marginal pdf is g(t), the conditional pdf of x1√ R n − 2/ 1 − R2 . The change-of-variable technique can now be used to ﬁnd the pdf of R.

Remark 7.1. Since R has, when ρ = 0, a conditional distribution that does not depend upon x1 , x2 , . . . , xn (and hence that conditional distribution is, in fact, the marginal distribution of R), we have the remarkable fact that R is independent of X1 , X2 , . . . , Xn . It follows that R is independent of every function of X1 , X2 , . . . , Xn alone, that is, a function that does not depend upon any Yi . In like manner, R is independent of every function of Y1 , Y2 , . . . , Yn alone. Moreover, a careful review of the argument reveals that nowhere did we use the fact that X has a normal marginal distribution. Thus, if X and Y are independent, and if Y has a normal distribution, then R has the same conditional distribution whatever isthe distribution of X, n n subject to the condition 1 (xi − x)2 > 0. Moreover, if P [ 1 (Xi − X)2 > 0] = 1, then R has the same marginal distribution whatever is the distribution of X. √ √ If we write T = R n − 2/ 1 − R2 , where T has a t-distribution with n − 2 > 0 degrees of freedom, it is easy to show by the change-of-variable technique (Exercise 7.4) that the pdf of R is given by ( Γ[(n−1)/2] (1 − r2 )(n−4)/2 −1 < r < 1 Γ( 12 )Γ[(n−2)/2] (7.3) h(r) = 0 elsewhere. We have now solved the problem of the√distribution √ of R, when ρ = 0 and n > 2, or perhaps more conveniently, that of R n − 2/ 1 − R2 . The likelihood ratio test of the hypothesis H0 : ρ = 0 against all√alternatives √ H1 : ρ = 0 may be based either on the statistic R or on the statistic R n − 2/ 1 − R2 = T , although the latter is easier to use. Therefore, a level α test is to reject H0 : ρ = 0 if |T | ≥ tα/2,n−2 . Remark 7.2. It is possible to obtain an approximate test of size α by using the fact that 1+R 1 W = log 2 1−R

520

Inferences About Normal Models has an approximate normal distribution with mean 12 log[(1 + ρ)/(1 − ρ)] and with variance 1/(n − 3). We accept this statement without proof. Thus a test of H0 : ρ = 0 can be based on the statistic Z=

1 2

log[(1 + R)/(1 − R)] − 12 log[(1 + ρ)/(1 − ρ)] , 1/(n − 3)

with ρ = 0 so that 12 log[(1 + ρ)/(1 − ρ)] = 0. However, using W , we can also test a hypothesis like H0 : ρ = ρ0 against H1 : ρ = ρ0 , where ρ0 is not necessarily zero. In that case, the hypothesized mean of W is 1 + ρ0 1 log . 2 1 − ρ0

EXERCISES 7.1. Show that n

n

(Xi − X)(Yi − Y )

1

R= n 1

(Xi − X)2

n

(Yi − Y )2

= n

1

Xi Yi − nXY

1

Xi2 − nX

1

2

n

. Yi2 − nY

2

1

7.2. A random sample of size n = 6 from a bivariate normal distribution yields a value of the correlation coeﬃcient of 0.89. Would we accept or reject, at the 5% signiﬁcance level, the hypothesis that ρ = 0? 7.3. Verify Equation (7.2) of this section. 7.4. Verify the pdf (7.3) of this section. 7.5. Using the results of Hypothesis Testing, show that R, (7.1), is a consistent estimate of ρ. 7.6. Two experiments gave the following results: n 100 200

x 10 12

y 20 22

sx 5 6

sy 8 10

r 0.70 0.80

Calculate r for the combined sample.

8

The Distributions of Certain Quadratic Forms

Remark 8.1. It is essential that the reader have the background of the multivariate normal distribution to understand Sections 8 and 9.

521

Inferences About Normal Models Remark 8.2. We make use of the trace of a square matrix. If A = [aij ] is an n × n matrix, then we deﬁne the trace of A, (tr A), to be the sum of its diagonal entries; i.e., n aii . (8.1) tr A = i=1

The trace of a matrix has several interesting properties. One is that it is a linear operator; that is, tr (aA + bB) = a tr A + b tr B. (8.2) A second useful property is: If A is an n × m matrix, B is an m × k matrix, and C is a k × n matrix, then tr (ABC) = tr (BCA) = tr (CAB).

(8.3)

The reader is asked to prove these facts in Exercise 8.7. Finally, a simple but useful property is that tr a = a, for any scalar a. We begin this section with a more formal but equivalent deﬁnition of a quadratic form. Let X = (X1 , . . . , Xn ) be an n-dimensional random vector and let A be a real n × n symmetric matrix. Then the random variable Q = X AX is called a quadratic form in X. Due to the symmetry of A, there are several ways we can write Q: Q = X AX

=

n n

aij Xi Xj =

i=1 j=1

=

n

n

aii Xi2 +

i=1

aii Xi2 + 2

i=1

aij Xi Xj

(8.4)

i=j

aij Xi Xj .

i 0 and −∞ < b < ∞. Because the eﬃcacy varies indirectly with scale, we have c2fZ = a−2 c2fX . Furthermore, as Exercise 5.9 shows, the eﬃcacy is invariant to location and, also, I(fZ ) = a−2 I(fX ). Hence the quantity maximized above is invariant to changes in location and scale. In particular, in the derivation of optimal scores, only the form of the density is important. Example 5.1 (Normal Scores). Suppose the error random variable εi has a normal distribution. Based on the discussion in the last paragraph, we can take the pdf of a N (0, 1) distribution as the form of the density. So consider fZ (z) = φ(z) = (2π)−1/2 exp{−2−1 z 2 }. Then −φ (z) = zφ(z). Let Φ(z) denote the cdf of Z. Hence the optimal score function is φ (Φ−1 (u)) = Φ−1 (u); φ(Φ−1 (u))

(5.29)

Φ−1 [R(Yj − Δ)/(n + 1)].

(5.30)

ϕN (u) = −κ

# see Exercise 5.5, which shows that κ = 1 as well as that ϕN (u) du = 0. The corresponding scores, aN (i) = Φ−1 (i/(n + 1)), are often called the normal scores. Denote the process by WN (Δ) =

n2 j=1

577

Nonparametric and Robust Statistics The associated test statistic for the hypotheses (5.1) is the statistic WN = WN (0). The estimator of Δ solves the estimating equations N ) ≈ 0. WN (Δ

(5.31)

Although the estimate cannot be obtained in closed form, this equation is relatively N , Y − X) = 1 at the easy to solve numerically. From the above discussion, ARE(Δ normal distribution. Hence normal score procedures are fully eﬃcient at the normal distribution. Actually, a much more powerful result can be obtained for symmet N , Y − X) ≥ 1 at all symmetric ric distributions. It can be shown that ARE(Δ distributions. Example 5.2 (Wilcoxon Scores). Suppose the random errors, εi , i = 1, 2, . . . , n, have a logistic distribution with pdf fZ (z) = exp{−z}/(1 + exp{−z})2 . Then the corresponding cdf is FZ (z) = (1 + exp{−z})−1 . As Exercise 5.11 shows, f (z)

− fZZ (z) = FZ (z)(1 − exp{−z})

and

FZ−1 (u) = log

Upon standardization, this leads to the optimal score function, √ ϕW (u) = 12(u − (1/2)),

u 1−u

.

(5.32)

(5.33)

that is, the Wilcoxon scores. The properties of the inference based on Wilcoxon W = med {Yj − Xi } denote the correscores are discussed in Section 4. Let Δ W , Y − X) = 0.955 at the normal. Hodges sponding estimate. Recall that ARE(Δ W , Y − X) ≥ 0.864 over all symmetric and Lehmann (1956) showed that ARE(Δ distributions. Example 5.3. As a numerical illustration, we consider some generated normal observations. The ﬁrst sample, labeled X, was generated from a N (48, 102 ) distribution, while the second sample, Y , was generated from a N (58, 102 ) distribution. There are 15 observations in each sample. The data are displayed in Table 5.1, and along with the data, the ranks and the normal scores are exhibited. We consider tests of the two-sided hypotheses H0 : Δ = 0 versus H1 : Δ = 0 for the Wilcoxon, normal scores, and Student t procedures. As the following comparison dotplots show, the second sample observations appear to be larger than those from the ﬁrst sample. Sample 1 . : . :.. : ... . . +---------+---------+---------+---------+---------+------Sample 2 . . . . . . . .. :. . . . +---------+---------+---------+---------+---------+------32.0 40.0 48.0 56.0 64.0 72.0

578

Nonparametric and Robust Statistics Table 5.1: Data for Example 5.3

Data 51.9 56.9 45.2 52.3 59.5 41.4 46.4 45.1 53.9 42.9 41.5 55.2 32.9 54.0 45.0

Sample 1 (X) Ranks Normal Scores 15 −0.04044 23 0.64932 11 −0.37229 16 0.04044 26 0.98917 4 −1.13098 12 −0.28689 10 −0.46049 17 0.12159 7 −0.75273 5 −0.98917 20 0.37229 2 −1.51793 18 0.20354 9 −0.55244

Data 59.2 49.1 54.4 47.0 55.9 34.9 62.2 41.6 59.3 32.7 72.1 43.8 56.8 76.7 60.3

Sample 2 (Y ) Ranks Normal Scores 24 0.75273 14 −0.12159 19 0.28689 13 −0.20354 21 0.46049 3 −1.30015 28 1.30015 6 −0.86489 25 0.86489 1 −1.84860 29 1.51793 8 −0.64932 22 0.55244 30 1.84860 27 1.13098

The test statistics along with their standardized versions, p-values, and corresponding estimates of the shift parameter Δ are Method

Test Statistic

Standardized

p-Value

Estimate of Δ

Student t

Y − X = 5.46

1.47

0.16

5.46

Wilcoxon

W = 270

1.53

0.12

5.20

WN = 3.73

1.48

0.14

5.15

Normal scores

Notice that the standardized tests statistics and their corresponding p-values are quite similar and all would result in the same decision regarding the hypotheses. As shown in the table, the corresponding point estimates of Δ are also alike. The estimates were obtained using the package cited ahead in Remark 5.1. We changed x5 to be an outlier with value 95.5 and then reran the analyses. The t-analysis was the most aﬀected, for on the changed data, t = 0.63 with a p-value of 0.53. In contrast, the Wilcoxon analysis was the least aﬀected (z = 1.37 and p = 0.17). The normal scores analysis was more aﬀected by the outlier than the Wilcoxon analysis with z = 1.14 and p = 0.25. Example 5.4 (Sign Scores). For our ﬁnal example, suppose that the random errors ε1 , ε2 , . . . , εn have a Laplace distribution. Consider the convenient form fZ (z) = Then fZ (z) = −2−1 sgn(z) exp{−|z|} and, hence, 2−1 exp{−|z|}. −1 −1 −1 −fZ (FZ (u))/fZ (FZ (u)) = sgn(z). But FZ (u) > 0 if and only if u > 1/2. The

579

Nonparametric and Robust Statistics optimal score function is

1 ϕS (u) = sgn u − , 2

(5.34)

which is easily shown to be standardized. The corresponding process is n2 n+1 WS (Δ) = . sgn R(Yj − Δ) − 2 j=1

(5.35)

Because of the signs, this test statistic can be written in a simpler form, which is often called Mood’s test; see Exercise 5.13. We can also obtain the associated estimator in closed form. The estimator solves the equation n2 n+1 = 0. (5.36) sgn R(Yj − Δ) − 2 j=1 For this equation, we rank the variables {X1 , . . . , Xn1 , Y1 − Δ, . . . , Yn2 − Δ}. Because ranks, though, are invariant to a constant shift, we obtain the same ranks if we rank the variables X1 − med{Xi }, . . . , Xn1 − med{Xi }, Y1 − Δ − med{Xi }, . . . , Yn2 − Δ − med{Xi }. Therefore, the solution to equation (5.36) is easily seen to be S = med{Yj } − med{Xi }. Δ

(5.37)

Other examples are given in the exercises. Remark 5.1 (Computation). Computation of the analyses for general score functions can be performed by the R functions of Hettmansperger and McKean (2011). In particular, the normal scores analysis of Example 5.3 can be computed by using the command twosampr2(x,y,score=phinscr), where x and y are the vectors containing the X and Y observations, respectively.

EXERCISES 5.1. In this section, as discussed above expression (5.2), the scores aϕ (i) are gener#1 #1 ated by the standardized score function ϕ(u); that is, 0 ϕ(u) du = 0 and 0 ϕ2 (u) du = 1. Suppose that ψ(u) is a square-integrable function deﬁned on the interval (0, 1). Consider the score function deﬁned by ϕ(u) = # 1 0

where ψ =

580

#1 0

ψ(u) − ψ [ψ(v) − ψ]2 dv

,

ψ(v) dv. Show that ϕ(u) is a standardized score function.

Nonparametric and Robust Statistics 5.2. Complete the derivation of the null variance of the test statistic Wϕ by showing the second term in expression (5.7) is true. Hint: Use the fact that under H0 , for j = j , the pair (aϕ (R(Yj )), aϕ (R(Yj ))) is uniformly distributed on the pairs of integers (i, i ), i, i = 1, 2, . . . , n, i = i . √ 5.3. For the Wilcoxon score function ϕ(u) = 12[u − (1/2)], obtain the value of sa . Then show that the VH0 (Wϕ ) given in expression (5.8) is the same (except for standardization) as the variance of the MWW statistic of Section 4. #∞ 5.4. Recall that the scores have been standardized so that −∞ ϕ2 (u) du = 1. Use this and a Riemann sum to show that n−1 s2a → 1, where s2a is deﬁned in expression (5.6). 5.5. Show that the normal scores, (5.29), derived in Example 5.1 are standardized; #1 #1 that is, 0 ϕN (u) du = 0 and 0 ϕ2N (u) du = 1. n2 5.6. In Theorem 5.1, show that the minimum value of Wϕ (Δ) is given by j=1 aϕ (j) and that it is nonpositive. 5.7. Show that EΔ [Wϕ (0)] = E0 [Wϕ (−Δ)]. 5.8. Consider the hypotheses (4.2). Suppose we select the score function ϕ(u) and the corresponding test based on Wϕ . Suppose we want to determine the sample size n = n1 + n2 for this test of signiﬁcance level α to detect the alternative Δ∗ with approximate power γ ∗ . Assuming that the sample sizes n1 and n2 are the same, show that

2 (zα − zγ ∗ )2τϕ n≈ . (5.38) Δ∗ 5.9. In the context of this section, show the following invariances: (a) Show that the parameter τϕ , (5.24), is a scale functional as deﬁned in Exercise 1.4. (b) Show that part (a) implies that the eﬃcacy, (5.20), is invariant to the location and varies indirectly with scale. (c) Suppose Z is a scale and location transformation of a random variable X; i.e., Z = a(X − b), where a > 0 and −∞ < b < ∞. Show that I(fZ ) = a−2 I(fX ). 5.10. Consider the scale parameter τϕ , (5.24), when normal scores are used; i.e., ϕ(u) = Φ−1 (u). Suppose we are sampling from a N (μ, σ 2 ) distribution. Show that τϕ = σ. 5.11. In the context of Example 5.2, obtain the results in expression (5.32). 5.12. Let the scores a(i) be generated by aϕ (i) = ϕ[i/(n + 1)], for i = 1, . . . , n, #1 #1 where 0 ϕ(u) du = 0 and 0 ϕ2 (u) du = 1. Using Riemann sums, with subintervals #1 #1 n of equal of the integrals 0 ϕ(u) du and 0 ϕ2 (u) du, show that i=1 a(i) ≈ 0 n length, and i=1 a2 (i) ≈ n.

581

Nonparametric and Robust Statistics 5.13. Consider the sign scores test procedure discussed in Example 5.4. ( ) . Hence WS∗ is (a) Show that WS = 2WS∗ − n2 , where WS∗ = #j R(Yj ) > n+1 2 an equivalent test statistic. Find the null mean and variance of WS . (b) Show that WS∗ = #j {Yj > θ∗ }, where θ∗ is the combined sample median. ∗ = #i {Xi > θ∗ }, show that we can table WS∗ (c) Suppose n is even. Letting WXS in the following 2 × 2 contingency table with all margins ﬁxed:

No. items > θ∗ No. items < θ∗

Y WS∗ n2 − WS∗ n2

X ∗ WXS ∗ n1 − WXS n1

n 2 n 2

n

Show that the usual χ2 goodness-of-ﬁt is the same as ZS2 , where ZS is the standardized z-test based on WS . This is often called Mood’s median test; see Example 5.4. 5.14. Recall the data discussed in Example 5.3. (a) Obtain the contingency table described in Exercise 5.13. (b) Obtain the χ2 goodness-of-ﬁt test statistic associated with the table and use it to test at level 0.05 the hypotheses H0 : Δ = 0 versus H1 : Δ = 0. (c) Obtain the point estimate of Δ given in expression (5.37). 5.15. Optimal signed-rank based methods also exist for the one-sample problem. In this exercise, we brieﬂy discuss these methods. Let X1 , X2 , . . . , Xn follow the location model (5.39) X i = θ + ei , where e1 , e2 , . . . , en are iid with pdf f (x), which is symmetric about 0; i.e., f (−x) = f (x). (a) Show that under symmetry the optimal two-sample score function (5.26) satisﬁes (5.40) ϕf (1 − u) = −ϕf (u), 0 < u < 1; that is, ϕf (u) is an odd function about (5.40) is 0 at u = 12 .

1 2.

Show that a function satisfying

(b) For a two-sample score function ϕ(u) which is odd about 12 , deﬁne the function ϕ+ (u) = ϕ[(u+1)/2], i.e., the top half of ϕ(u). Note that the domain of ϕ+ (u) is the interval (0, 1). Show that ϕ+ (u) ≥ 0, provided ϕ(u) is nondecreasing.

582

Nonparametric and Robust Statistics (c) Assume for the remainder of the problem that ϕ+ (u) is nonnegative and nondecreasing on the interval (0, 1). Deﬁne the scores a+ (i) = ϕ+ [i/(n + 1)], i = 1, 2, . . . , n, and the corresponding statistic Wϕ+ =

n

sgn(Xi )a+ (R|Xi |).

(5.41)

i=1

Show that Wϕ+ reduces to a linear function of the signed-rank test statistic (3.2) if ϕ(u) = 2u − 1. (d) Show that Wϕ+ reduces to a linear function of the sign test statistic (2.3) if ϕ(u) = sgn(2u − 1). Note: Suppose Model (5.39) is true and we take ϕ(u) = ϕf (u), where ϕf (u) is given by (5.26). If we choose ϕ+ (u) = ϕ[(u + 1)/2] to generate the signedrank scores, then it can be shown that the corresponding test statistic Wϕ+ is optimal, among all signed-rank tests. (e) Consider the hypotheses H0 : θ = 0 versus H1 : θ > 0. Our decision rule for the statistic Wϕ+ is to reject H0 in favor of H1 if Wϕ+ ≥ k, for some k. Write Wϕ+ in terms of the anti-ranks, (3.5). Show that Wϕ+ is distribution-free under H0 . (f ) Determine the mean and variance of Wϕ+ under H0 . (g) Assuming that, when properly standardized, the null distribution is asymptotically normal, determine the asymptotic test.

6

Adaptive Procedures

In the last section, we presented fully eﬃcient rank-based procedures for testing and estimation. As with mle methods, though, the underlying form of the distribution must be known in order to select the optimal rank score function. In practice, often the underlying distribution is not known. In this case, we could select a score function, such as the Wilcoxon, which is fairly eﬃcient for moderate- to heavy-tailed error distributions. Or if the distribution of the errors is thought to be quite close to a normal distribution, then the normal scores would be a proper choice. Suppose we use a technique which bases the score selection on the data. These techniques are called adaptive procedures. Such a procedure could attempt to estimate the score function; see, for example, Naranjo and McKean (1997). However, large data sets are often needed for these. There are other adaptive procedures which attempt to select a score from a ﬁnite class of scores based on some criteria. In this section, we look at an adaptive testing procedure for testing which retains the distribution-free property.

583

Nonparametric and Robust Statistics Frequently, an investigator is tempted to evaluate several test statistics associated with a single hypothesis and then use the one statistic that best supports his or her position, usually rejection. Obviously, this type of procedure changes the actual signiﬁcance level of the test from the nominal α that is used. However, there is a way in which the investigator can ﬁrst look at the data and then select a test statistic without changing this signiﬁcance level. For illustration, suppose there are three possible test statistics, W1 , W2 , and W3 , of the hypothesis H0 with respective critical regions C1 , C2 , and C3 such that P (Wi ∈ Ci ; H0 ) = α, i = 1, 2, 3. Moreover, suppose that a statistic Q, based upon the same data, selects one and only one of the statistics W1 , W2 , W3 , and that W is then used to test H0 . For example, we choose to use the test statistic Wi if Q ∈ Di , i = 1, 2, 3, where the events deﬁned by D1 , D2 , and D3 are mutually exclusive and exhaustive. Now if Q and each Wi are independent when H0 is true, then the probability of rejection, using the entire procedure (selecting and testing), is, under H0 , PH0 (Q ∈ D1 , W1 ∈ C1 ) + PH0 (Q ∈ D2 , W2 ∈ C2 ) + PH0 (Q ∈ D3 , W3 ∈ C3 ) = PH0 (Q ∈ D1 )PH0 (W1 ∈ C1 ) + PH0 (Q ∈ D2 )PH0 (W2 ∈ C2 ) + PH0 (Q ∈ D3 )PH0 (W3 ∈ C3 ) = α[PH0 (Q ∈ D1 ) + PH0 (Q ∈ D2 ) + PH0 (Q ∈ D3 )] = α. That is, the procedure of selecting Wi using an independent statistic Q and then constructing a test of signiﬁcance level α with the statistic Wi has overall signiﬁcance level α. Of course, the important element in this procedure is the ability to be able to ﬁnd a selector Q that is independent of each test statistic W . This can frequently be done by using the fact that complete suﬃcient statistics for the parameters, given by H0 , are independent of every statistic whose distribution is free of those parameters. For illustration, if independent random samples of sizes n1 and n2 arise from two normal distributions with respective means μ1 and μ2 and common variance σ 2 , then the complete suﬃcient statistics X, Y , and V =

n1

(Xi − X)2 +

n2

1

(Yi − Y )2

1

for μ1 , μ2 , and σ 2 are independent of every statistic whose distribution is free of μ1 , μ2 , and σ 2 , such as the statistics n1

1 n2 1

n1

(Xi − X)2 , (Yi − Y )2

|Xi − median(Xi )|

1 n2

, |Yi − median(Yi )|

range(X1 , X2 , . . . , Xn1 ) . range(Y1 , Y2 , . . . , Yn2 )

1

Thus, in general, we would hope to be able to ﬁnd a selector Q that is a function of the complete suﬃcient statistics for the parameters, under H0 , so that it is independent of the test statistic.

584

Nonparametric and Robust Statistics It is particularly interesting to note that it is relatively easy to use this technique in nonparametric methods by using the independence result based upon complete suﬃcient statistics for parameters. For the situations here, we must ﬁnd complete suﬃcient statistics for a cdf, F , of the continuous type. The order statistics Y1 < Y2 < · · · < Yn of a random sample of size n from a distribution of the continuous type with pdf F (x) = f (x) are suﬃcient statistics for the “parameter” f (or F ). Moreover, if the family of distributions contains all probability density functions of the continuous type, the family of joint probability density functions of Y1 , Y2 , . . . , Yn is also complete. That is, the order statistics Y1 , Y2 , . . . , Yn are complete suﬃcient statistics for the parameters f (or F ). Accordingly, our selector Q is based upon those complete suﬃcient statistics, the order statistics under H0 . This allows us to independently choose a distributionfree test appropriate for this type of underlying distribution, and thus increase the power of our test. A statistical test that maintains the signiﬁcance level close to a desired significance level α for a wide variety of underlying distributions with good (not necessarily the best for any one type of distribution) power for all these distributions is described as being robust. As an illustration, the pooled t-test (Student’s t) used to test the equality of the means of two normal distributions is quite robust provided that the underlying distributions are rather close to normal ones with common variance. However, if the class of distributions includes those that are not too close to normal ones, such as contaminated normal distributions, the test based upon t is not robust; the signiﬁcance level is not maintained and the power of the t-test can be quite low for heavy-tailed distributions. As a matter of fact, the test based on the Mann–Whitney–Wilcoxon statistic (Section 4) is a much more robust test than that based upon t if the class of distributions includes those with heavy tails. In the following example, we illustrate a robust, adaptive, distribution free procedure in the setting of the two-sample problem. Example 6.1. Let X1 , X2 , . . . , Xn1 be a random sample from a continuous-type distribution with cdf F (x) and let Y1 , Y2 , . . . , Yn2 be a random sample from a distribution with cdf F (x − Δ). Let n = n1 + n2 denote the combined sample size. We test H0 : Δ = 0 versus H1 : Δ > 0, by using one of four distribution-free statistics, one being the Wilcoxon and the other three being modiﬁcations of the Wilcoxon. In particular, the test statistics are n2 ai [R(Yj )], i = 1, 2, 3, 4, (6.1) Wi = j=1

where ai (j) = ϕi [j/(n + 1)], and the four functions are displayed in Figure 6.1. The score function ϕ1 (u) is the Wilcoxon. The score function ϕ2 (u) is the sign score function. The score function ϕ3 (u) is good for short-tailed distributions, and ϕ4 (u) is good for long, right-skewed distributions with shift alternatives.

585

Nonparametric and Robust Statistics 1(u)

2(u)

u

u

1

1

3(u)

4(u)

u 1

u 1

Figure 6.1: Plots of the score functions ϕ1 (u), ϕ2 (u), ϕ3 (u), and ϕ4 (u). We combine the two samples into one denoting the order statistics of the combined sample by V1 < V2 < · · · < Vn . These are complete suﬃcient statistics for F (x) under the null hypothesis. For i = 1, . . . , 4, the test statistic Wi is distribution free under H0 and, in particular, the distribution of Wi does not depend on F (x). Therefore, each Wi is independent of V1 , V2 , . . . , Vn . We use a pair of selector statistics (Q1 , Q2 ), which are functions of V1 , V2 , . . . , Vn , and hence are also independent of each Wi . The ﬁrst is U .05 − M .5 , (6.2) Q1 = M .5 − L.05 where U .05 , M .5 , and L.05 are the averages of the largest 5% of the V s, the middle 50% of the V s, and the smallest 5% of the V s, respectively. If Q1 is large (say 2 or more), then the right tail of the distribution seems longer than the left tail; that is, there is an indication that the distribution is skewed to the right. On the other hand, if Q1 < 12 , the sample indicates that the distribution may be skewed to the left. The second selector statistic is Q2 =

U .05 − L.05 . U .5 − L.5

(6.3)

Large values of Q2 indicate that the distribution is heavy-tailed, while small values indicate that the distribution is light-tailed. Rules are needed for score selection, and here we make use of the benchmarks proposed in an article by Hogg et al. (1975). These rules are tabulated below, along with their benchmarks:

586

Nonparametric and Robust Statistics Benchmark Q2 > 7 Q1 > 2 and Q2 < 7 Q1 ≤ 2 and Q2 ≤ 2 Elsewhere

Distribution Indicated Heavy-tailed symmetric Right-skewed Light-tailed symmetric Moderate heavy-tailed

Score Selected ϕ2 ϕ4 ϕ3 ϕ1

Hogg et al. (1975) performed a Monte Carlo power study of this adaptive procedure over a number of distributions with diﬀerent kurtosis and skewness coeﬃcients. In the study, both the adaptive procedure and the Wilcoxon test maintain their α level over the distributions, but the Student t does not. Moreover, the Wilcoxon test has better power than the t-test, as the distribution deviates much from the normal (kurtosis = 3 and skewness = 0), but the adaptive procedure is much better than the Wilcoxon for the short-tailed distributions, the very heavy-tailed distributions, and the highly skewed distributions which were considered in the study. The adaptive distribution-free procedure that we have discussed is for testing. Suppose we have a location model and were interested in estimating the shift in locations Δ. For example, if the true F is a normal cdf, then a good choice for the estimator of Δ would be the estimator based on the normal scores procedure discussed in Example 5.1. The estimators, though, are not distribution free and, hence, the above reasoning does not hold. Also, the combined sample observations X1 , . . . , Xn1 , Y1 , . . . , Yn2 are not identically distributed. There are adaptive proce . . . , Yn − Δ, where Δ is an initial dures based on residuals X1 , . . . , Xn1 , Y1 − Δ, 2 estimator of Δ; see page 237 of Hettmansperger and McKean (2011) for discussion. EXERCISES 6.1. In Exercises 6.2 and 6.3, the student is asked to apply the adaptive procedure described in Example 6.1 to real data sets. The hypotheses of interest are H0 : Δ = 0 versus H1 : Δ > 0, where Δ = μY − μX . The four distribution free test statistics are Wi =

n2

ai [R(Yj )],

i = 1, 2, 3, 4,

(6.4)

j=1

where ai (j) = ϕi [j/(n + 1)], and the score functions are given by ϕ1 (u)

=

ϕ2 (u)

=

ϕ3 (u)

=

ϕ4 (u)

=

2u − 1,

0