Chapter 15 Maximum Likelihood Estimation, Likelihood Ratio Test, Bayes Estimation, and Decision Theory Chapter 15 Bei Ye, Yajing Zhao, Lin Qian, Lin Sun, Ralph Hurtado, Gao Chen, Yuanchi Xue, Tim Knapik, Yunan Min, Rui Li Section 15.1 Maximum Likelihood Estimation Maximum Likelihood Estimation (MLE) 1. 2. 3. 4. Likelihood function Calculation of MLE Properties of MLE Large sample inference and delta method • 2.Calculation of Maximum Likelihood Estimation • 2.Calculation of Maximum Like Likelihood Function 1.1 Parameter space Θ X1 ,…, Xn : i.i.d. observations θ: an unknown parameter Θ: The set of all possible values of θ 1.2 Joint p.d.f. or p.m.f. of X1 ,…, Xn n f ( x1 , , xn | ) f ( x1 | ) f ( x2 | ) f ( xn | ) f ( xi | ) i 1 Likelihood Function 1.3 Likelihood Function of θ For observed χ1,…,χn : n L( | x1 , , xn ) f ( xi | ) i 1 The joint p.d.f. or p.m.f. is a function of χ1,…,χn for given θ . The likelihood function is a function of θ for given χ1,…,χn . Example : Normal Distribution Suppose χ1,…,χn is a random sample from a normal distribution with p.d.f.: ( x )2 1 f ( x | , ) exp{ } 2 2 2 2 a vector parameter: ( , 2 ) Likelihood Function: n L( , 2 ) i 1 ( xi )2 1 [ exp{ }] 2 2 2 1 1 n ( ) exp{ 2 2 2 n i 1 ( xi ) 2 } Calculation of MLE 2.1 Maximum Likelihood Estimation: Need to find x1,..., xn which maximizes the likelihood function L x1,..., xn . Simple Example: • Two independent Bernoulli trials with success probabilityθ. • θ is known : 1/4 or 1/3 (Θ). • The probabilities of observing χ= 0, 1, 2 successes can be calculated. Let’s look at the following table. Calculation of MLE Probability of ObservingχSuccesses 0 χ 1 2 1/4 9/16 6/16 1/16 1/3 4/ 9 4/ 9 1/9 the # of successes Parameter space Θ The MLE is chosen to maximize L x for observed χ. χ=0: : 1/ 4 χ =1 or 2: : 1/ 3 Calculation of MLE 2.2 Log-likelihood function: n ln L( ) ln f ( xi ; ) i 1 Setting the derivative of L(θ) equal to zero and solving for θ. d [ln L( )] 0 d Note: The likelihood function must be differentiable and then this method can be used. Properties of MLE MLE: optimality properties in large samples The concept of information due to Fisher Properties of MLE 3.1 Fisher Information: 2 d ln f ( x | ) 2 d ln f ( x | ) I ( ) f ( x | )dx E d d Alternative expression d 2 ln f ( x | ) d ln f ( x | ) I ( ) f ( x | )dx E 2 d d 2 Properties of MLE for an i.i.d. sample: X 1 , X 2, , X n i.i.d. sample from p.d.f f ( x | ) d 2 ln f ( X 1 , X 2 , , X n | ) I n ( ) E 2 d d2 E 2 ln f ( X 1 | ) ln f ( X 2 | ) d ln f ( X n | ) d 2 ln f ( X 1 | ) d 2 ln f ( X 2 | ) E E 2 2 d d I ( ) I ( ) I ( ) nI ( ) d 2 ln f ( X n | ) E 2 d Properties of MLE for k-dimensional vector parameter p.d.f. of an r.v. X is f ( x | ), where (1 , 2 , , k ) information matrix of , I ( ), is given by ln f ( x | ) ln f ( x | ) I ij ( ) E i j 2 ln f ( x | ) E i j Properties of MLE 3.2 Cramér-Rao Lower Bound A random sample X1, X2, …, Xn from p.d.f f(x|θ). Let ˆ. be any estimator of θ with E (ˆ) B( ),where B(θ) is the bias of ˆ. If B(θ) is differentiable in θ and if certain regularity conditions holds, then 2 1 B ( ) Var (ˆ) nI ( ) (Cramér-Rao inequality) The ratio of the lower bound to the variance of any estimator of θ is called the efficiency of the estimator. An estimator has efficiency = 1 is called the efficient estimator. Large Sample Inferences and Delta Method 4.1 Large Sample Inferences To make Large sample inference on unknown parameter θ 1 (Single Parameter), we need to estimate : Var(ˆ) nI ( ) I(θ) is estimated by: n d 2 ln f ( X | ) 1 i I (ˆ) d 2 n i 1 ˆ This estimate does not require evaluation of the expected value. An approximate large sample CI on θ is: ˆ z 1 2 nI (ˆ) ˆ z 1 2 nI (ˆ) 4. Large Sample Inferences and Delta Method 4.2 Delta Method for Approximating the Variance of an Estimator To estimate a nonlinear function h(θ). Suppose that : E ( ) and Var( ) is a known function of θ Delta Method: h( ) h( ) ( )h' ( ) 2 Var h( ) h' ( ) Var ( ) Section 15.2 Likelihood Ratio Test Likelihood Ratio (LR) Test 1. 2. 3. 4. Background of LR test Neyman-Pearson Lemma and Test Examples Generalized Likelihood Ratio Test Background of LR test • Jerzy Splawa Neyman, Egon Sharpe Pearson 1894-1981 Polish1895-1980 American English mathematician mathematician. Neyman-Pearson lemma We want to find a rejection region R such that the error of both type I and type II error are as small as possible. Suppose x1, x2 , ......, xn have joint p.d.f f ( x1 , xn ; ) Consider the ratio f ( x1 , , xn ;1 ) ( x ; 0 ,1 ) f ( x1 , , xn ; 0 ) Then a best critical region of L (1 ) L ( 0 ) is {( x1 , xn ) : ( x;0 ,1 ) k} Where k is a constant such that P(( x1 , xn ) C; H 0 ) What is Likelihood Ratio (LR) test • A ratio is computed between the maximum probability of a result under null and alternative hypothesis. L1 L0 where the numerator corresponds to the maximum probability of an observed result under the null hypothesis; denominator under the alternative hypothesis. • Test idea: if observe x, then condition L1 ( x) L( x) is evidence in favor of the alternative; the opposite inequality is evidence against the alternative. • Hence, the decision of rejecting null hypothesis was made based on the value of this ratio. The Test • Let x1, x2 , ......, xn be a random sample with p.d.f f ( x ) • Hypothesis: H0 : H1 : 1 0 • Test statistic: L(1 x1 , x2 , , xn ) L( 0 x1 , x2 , , xn ) • Reject H 0 when the likelihood ratio exceeds k Characteristics of LR Test • Most powerful test of significance level α; Maximize 1 P(reject 0 1 ) • Very useful and widely applicable, esp. in medicine to assist in interpreting diagnostic tests . • Exact distribution of the likelihood ratio corresponding to specific hypotheses is very difficult to determine. • The computations are difficult to perform by hand. Example 1: Test on Normal Distribution Mean H0 : 0 H1 : 1 0 The likelihoods under H 0 and H1 are: n L( 0 x1, x2 ,, xn ) ( 1 ) n exp( 2 2 ( x ) i 0 i 1 2 2 ) n 1 L( 1 x1, x2 ,, xn ) ( ) n exp( 2 2 ( x ) i 1 i 1 2 2 ) Example 1 continued Likelihood ratio is L( 1 x1 , x2 , xn ) n exp( 2 (2 x (0 1 ))( 1 0 )) L( 0 x1 , x2 , xn ) 2 Reject H 0when the ratio exceeds a constant k, which is chosen for a specific significance level: P( x k 0 ) The test is independent of 1 . It is the most powerful level test for all 1 0. A Numerical Example Suppose a random sample x with size n 1 and f ( x, ) x 1I (0,1) ( x) Test H 0 : 1 versus H1 : 2 with 0.05 f ( x1 ;2) 2 xI( 0,1) ( x) ( x;1,2) 2x f ( x2 ;1) 1 I ( 0,1) ( x) Where P(2 x k ; H 0 ) 0.05 k 0.95 So we reject H 0 if 2x 0.95. Generalized Likelihood Ratio Test • Neyman-Pearson Lemma shows that the most powerful test for a Simple vs. Simple hypothesis testing problem is a Likelihood Ratio Test. • We can generalize the likelihood ratio method for the Composite vs. Composite hypothesis testing problem. Hypothesis • Suppose H0 specifies that θ is in Θ0 and H1 specifies that θ is in Θ0c. Symbolically, the hypotheses are: H 0 : 0 H1 : c 0 Test Statistics ( x1 , x2 ,, xn ) max 0 L | x1 , x2 ,, xn max L | x1 , x2 ,, xn L ˆ0 | x1 , x2 ,, xn L ˆ | x1 , x2 ,, xn Test Statistics • Note that λ ≤ 1. • An intuitive way to understand λ is to view: the numerator of λ as the maximum probability of the observed sample computed over parameters in the null hypothesis the denominator of λ as the maximum probability of the observed sample over all possible parameters. λ is the ratio of these two maxima Test Statistics If H0 is true, λ should be close to 1. If H1 is true, λ should be smaller. • A small λ means that the observed sample is much more likely for the parameter points in the alternative hypothesis than for any parameter point in the null hypothesis. Reject Region & Critical Constant • Rejects Ho if λ < k, where k is the critical constant. • k < 1. • k is chosen to make the level of the test equal to the specified α, that is, α = PΘo ( λ ≤ k ). A Simple Example to illustrate GLR test • Ex 15.18 (GLR Test for Normal Mean: Known Variance) For a random sample x1, x2……, xn from an N (μ, σ2) distribution with known σ2, derive the GLR test for the one-sided testing problem: Ho: µ ≤ µ0 vs. H1: µ > µ0 where µ0 is specified. Solutions The likelihood function is 1 n 1 n 2 L( | x1 , x2 ,, xn ) ( ) exp 2 ( xi ) 2 2 i 1 1 n n 1 n 2 2 ( ) exp 2 ( xi x ) 2 ( x ) 2 2 2 i 1 Solutions If , then the restricted MLE of µ under H0 is simply . If , then the restricted MLE of µ under H0 is , because in this case, the maximum of the likelihood function under H0 is attained at . Solutions Thus, the numerator & denominator of the likelihood ratio are showing below, respectively max 0 L( | x1 , x2 , , xn ) ( ( 1 1 ) n exp 2 2 2 2 ( x x ) i i 1 1 1 n ) exp 2 2 2 ( xi x ) 2 ( x 0 ) 2 i 1 n n 2 if ( x 0 ) n 1 1 n max L( | x1 , x2 , , xn ) ( ) exp 2 2 2 2 if ( x 0 ) ( xi x ) i 1 n 2 Solution Taking the ratio of the two and canceling the common terms, we get 1 n 2 exp 2 2 ( x 0 ) if ( x 0 ) if ( x 0 ) Solution • Clearly, we do not reject H0 when λ = 1, i.e., when . • Therefore, the condition λ < k is equivalent to subject to . • In other words, we reject H0 if is large, which leads to the usual upper one sided z-test. Section 15.3 Bayesian Inference Bayesian Inference 1. 2. 3. 4. Background of Bayes Bayesian Inference defined Bayesian Estimation Bayesian Testing Background of Thomas Bayes • Thomas Bayes – 1702 – 1761 – British mathematician and Presbyterian minister – Fellow of the Royal Society – Studied logic and theology at the University of Edinburgh – He was barred from studying at Oxford and Cambridge because of his religion Background of Bayes • Baye’s Theorem – Famous probability theorem for finding “reverse probability” P( B j | A) P( B j ) P( A | B j ) k P( B ) P( A | B ) i 1 i i – The theorem was published posthumously in a paper entitled “Essay Towards Solving a Problem in the Doctrine of Chances” Bayesian Inference • Application to Statistics – Qualitative Overview – Estimate an unknown parameter – Assumes the investigator has some prior knowledge of the unknown parameter – Assumes the prior knowledge can be summarized in the form of a probability distribution on , called the prior distribution – Thus, is a random variable Bayesia • Application to Statistics – Qualitative Overview (cont.) – The data are used to update the prior distribution and obtain the posterior distribution – Inferences on are based on the posterior distribution Bayesian Inference • Criticisms by Frequentists – Prior knowledge is not accurate enough to form a meaningful prior distribution – Perceptions of prior knowledge differ from person to person – This may cause inferences on the same data to differ from person to person. Some Key Terms in Bayesian Inference… In the classical approach the parameter, θ, is thought to be an unknown, but fixed, quantity. In the Bayesian approach, θ is considered to be a quantity whose variation can be described by a probability distribution which is called prior distribution. • prior distribution – a subjective distribution, based on experimenter’s belief, and is formulated before the data are seen. • posterior distribution – is computed from the prior and the likelihood function using Bayes’ theorem. • posterior mean – the mean of the posterior distribution • posterior variance – the variance of the posterior distribution • conjugate priors - a family of prior probability distributions in which the key property is that the posterior probability distribution also belongs to the family of the prior probability distribution 1.5.3.1 Bayesian Estimation Now lets move on to how we can estimate parameters using Bayesian approach. Now let’s move on to how we can estimate parameters using Bayesian approach. (Using text notation) x1 , x2 ,..., xn Let be an unknown parameter based on a random sample, from a distribution with pdf /pmf f ( x | ) Let ( ) be the prior distribution of Let * ( | x1 , x2,..., xn ) be the posterior distribution If we apply Bayes Theorem(Eq. 15.1), our posterior distribution becomes f x1 , x2 ,..., xn : f x1 , x2 ,..., xn d f * x1 , x2 ,..., xn Note that X1,X2,…Xn f x1 , x2 ,..., xn f * x1 , x2 ,..., xn is the marginal PDF of Bayesian Estimation(continued) As seen in equation 15.2, the posterior distribution represents what is known about after observing the data x1 , x2 ,..., xn . From earlier chapters, we know that the likelihood of a variable is f ( | ) So, to get a better idea of the posterior distribution, we note that: prior distribution posterior distribution likelihood i.e. * ( | ) f ( | ) ( ) For a detailed practical example of deriving the posterior mean and using Bayesian estimation, visit: http://www.stat.berkeley.edu/users/rice/Stat135/Bayes.pdf Example 15.25 Let x be an observation from an N ( , 2 ) distribution where μ is unknown and σ2 is known. Show that the normal distribution is a conjugate prior on μ. 2 1 x 2 1 0 f ( x ) ( ) exp exp 2 2 2 2 0 2 0 2 1 2 2 2 2 exp 0 x 0 2 2 20 2 0 1 We can ignore the factor 2 because it will cancel from both the 0 numerator and denominator of the expression for * ( x). Similarly, any terms 1 not involving μ can be canceled from the numerator and denominator. Example 15.25 (continue) Thus, we see that * ( x) is proportional to 1 * 2 exp *2 2 Where * p0 x q0 0 02 p0 2 02 1 0 1 2 2 2 2 0 0 2 q0 1 p0 2 02 2 and 2 1 *2 It follows that * ( x) has the form of the normal distribution. Specifically, * * *2 1 / 2 is distribution (the normalizing constant N , * ( x) comes from the denomination. Example 15.25 (continue) Note the following remarks 1. If a single observation x is replaced by the sample mean x of n i.i.d observation * *2 , N ( , 2 ) then the posterior distribution of μ is N with 1 * p0 x q0 0 p0 02 2 / n 02 n 1 2 2 0 2 /n q0 1 p0 2 / n 02 *2 2. The posterior mean (Bayes Estimate) μ* is the weighted average of x and μ0 with weights p0 and q0 being inversely proportional to their respective variances. If the prior variance σ02<< σ2/n, then the prior dominates the data and μ* is closer to μ0 . Conversely, if σ02>> σ2/n, then the data dominate the prior and μ* is closer to x . 3. The posterior variance σ*2 is a harmonic sum of σ2/n and σ02 . 15.3.2 Bayesian Testing Assumption: H 0 : 0 H1 : 1 *0 * (0 ) P( 0 | x) (1 ) P( 1 | x) * 1 * If k * 1 * 0 , we reject 0* 1* 1 H0 in favor of Ha . Where k >0 is a suitably chosen critical constant. Example 15.27 Consider the problem of testing H 0 : 0 Vs H a : a where 1 0 based on a random sample x1 , x2 ,..., xn from an N ( , 2 ) Using the expression for the likelihood function from Example 15.16 and canceling the common terms, the ratio of the posteriors becomes n 2 ( x ) } * 1 2 1 1 n1 0 2 exp 2 x 0 1 * 2 0 exp{ n ( x ) 2 } 0 2 0 0 2 2 Therefore 1* / 0* k holds if 1 exp{ n1 0 0 exp 2 x 0 1 k 2 2 1 We see that the Bayes test has the same form as the Neyman-Pearson test, but the critical constant is modified by the ratio of the prior probabilities. Decision theory 1. Definition – Loss – Risk 2. Comparison of estimators – Admissibility – Minimax – Bayes risk 3. Example of hypothesis testing Section 15.4 Decision Theory Definition • Decision theory aims to unite the following under a common framework – point estimation – confidence interval estimation – hypothesis testing • D : the set of all decisions • X : the set of all outcomes, typically a sample of identical and independent random variables • : X D the function that chooses decisions given the sample Loss • How do you evaluate the performance of a decision function ? • Consider point estimation – X ( X 1,, X n ) where X i depends on an unknown parameter – D , which we expect seeing we are attempting to predict – The decision function should return an estimate of the true parameter – A “good” decision rule selects values close to the actual value of Loss and risk • A loss function incurs a penalty when the estimation deviates from the true value of L( , ) ( ) 2 Squared error loss • The risk function is a measure of the decision rule’s performance over the set of outcomes R( , ) E[ L( , )] Squared error loss L( , ) ( ) 2 R ( , ) E[( ) 2 ] 2 E[ 2 2 ] 2 E[ ] 2 E[ ] E[ 2 ] 2 ( E[ ] E[ ] ) E[ ] 2E[ ] 2 2 Var ( ) ( E[ ] ) 2 Var ( ) Bias ( , ) 2 2 Comparing estimators • Let X ( X 1,, X n ) where X i ~ Bernoulli ( p) • Unbiased estimator for p p(1 p) R(1 , p) Var (X) n 1 X • Bayes estimator for p (assuming P ~ Beta (a, b) n 2 a Xi i 1 abn ((a b) 2 p 2 (n 2a(a b)) p a 2 R( 2 , p) (a b n) 2 ) Comparing estimators If we let a = b = 1, and n = 4, then the risk functions become: p(1 p) R(1 , p) 4 R( 2, p) 1 / 36 Comparison of risk 0.07 Point estimator Bayes estimator 0.06 The estimators perform differently depending on the value of p How do we select which estimator to use? Risk 0.05 0.04 0.03 0.02 0.01 0 0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1 Admissibility • A decision 1 is inadmissible if there exists another decision 2 such that R(1, ) R( 2 , ) Comparison of risk 1 0.8 Risk Decision rule 1 is inadmissible because it’s risk is higher than the other decision rule for all values of p Decision 1 Decision 2 0.6 0.4 0.2 0 0 0.2 0.4 0.6 p 0.8 1 Choosing estimators • Restrict the set of estimators to those that are admissible • Minimax – Choose the decision that has the lowest maximum value of risk min {max ( R( , ))} • Bayes risk – Assume comes from a distribution with pdf – Bayes risk is the average risk across values of B( ) R( , ) ( )d – Select the estimator with smallest bayes risk ( ) Comparing minimax and Bayes Purchase price: $5 Sold price (not defective): $6 Penalty cost (defective): $2 Probability of defective: p Rule d1: purchase Rule d2: not purchase _______________________________________________________ No. Defectives Decision Rule 1 Decision Rule 2 x Decision Decision 0 d1 d1 1 d2 d1 2 d2 d2 ________________________________________________________ Loss function: L(d1,p)=8p-1 L(d2,p)=2 Risk functions: R(1 ,p )= 8p3 19 p 2 14 p 1 R( 2 ,p )= -8p3 3 p 2 8 p 1 Comparing minimax and Bayes A retailer must decide whether to purchase a large lot of items containing an unknown fraction p of defectives. The purchase price is $5 per item. If it is not defective, he can sell it at $6, which means the profit would be $1. But if an item is defective, it must be junked. Moreover, there is a penalty cost $2 for no trade. The retailer inspects tow items at random, to decide whether to purchase (d1) or not purchase (d 2). _______________________________________________________ No. Defectives Decision Rule 1 Decision Rule 2 x Decision Decision 0 d1 d1 1 d2 d1 2 d2 d2 ________________________________________________________ Comparing minimax and bayes Loss functions: L(d1 , p) (2 5) p 1(1 p) 8 p 1 L( d 2 , p ) 2 Risk functions: R(1 , p) L(d1 , p) P(d1 | p) L(d 2 , p) P(d 2 | p) (8 p 1) P( X 0 | p) 2 P( X 1or 2 | p) (8 p 1)(1 p)2 2[1 (1 p)2 ] 8p3 19 p 2 14 p 1 R( 2 , p) L(d1 , p) P(d1 | p) L(d 2 , p) P(d 2 | p ) (8 p 1) P( X 0or1| p) 2 P( X 2 | p) (8 p 1)(1 p 2 ) 2 p 2 -8p3 3 p 2 8 p 1 Comparing minimax and bayes Max R( 1 ,p )=2.289 Max R( 2 ,p )=3.329 1 is the Minimax rule. 3.5 3 Risk of delta 2 Risk of delta 1 2.5 For the prior ( p) =2(1-p), B( 2)= 1 0 R( 2 , p) ( p)dp 1 is the Bayes rule. 41 30 Risk 39 R ( , p ) ( p ) dp B(1)= 0 1 30 1 2 1.5 1 0.5 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 p 0.6 0.7 0.8 0.9 1 Bayes Decision Rule • Is 1 the best rule among all the decision rules? B( ) R( , ) ( )d [ L( ( x), ) f ( x | )dx] ( )d X [ L( ( x), ) f ( x | ) ( )d ]dx A( ( x), x)dx X X where A(d , x) L(d , ) f ( x | ) ( )d Bayes Decision Rule Since X is discrete for this problem B( ) R( , ) ( )d A( ( x), x) X Then B ( ) is minimized by choosing for each x X that decision d (x) which minimizes A( d , x ) For the following loss function: L(d 1, p ) 8 p 1 and L(d 2, p ) 2 Bayes Decision Rule 1 A(d 1, x) L(d 1, p ) f ( x | p ) ( p )dp 0 2 x (8 p 1) p (1 p ) 2 x 2(1 p )dp x 0 1 1 1 2!2 [8 p x 1 (1 p ) 3 x dp p x (1 p ) 3 x dp ] x!(2 x)! 0 0 4 ( x 1)!(3 x)! x!(3 x)! [8 ] x!(2 x)! 5! 4! (8 x 3)(3 x) 30 Note: 1 0 a!b! p (1 p) dp (a b 1)! a b Bayes Decision Rule 1 A(d 1, x) L(d 1, p) f ( x | p) ( p)dp 0 (8 x 3)(3 x) 30 3 x A(d 2, x) L(d 2, p) f ( x | p) ( p)dp 3 0 1 • Now we can check: x 0 : A( d 1, x) 9 / 30 A( d 2, x) 1 x 1 : A(d 1, x) 22 / 30 A(d 2, x) 2 / 3 x 2 : A(d 1, x) 19 / 30 A( d 2, x) 1 / 3 • Therefore min ( d 1, d 2, d 2) 1 is the best rule among all decision rules w.r.t to the given prior Loss functions revisited (hypothesis testing) • Consider the hypothesis testing problem: H 0 : 0 vs. H 1 : 1 where is the unknown parameter and given the prior probabilities 0 ( 0) of H 0 and 1 ( 1) of H 1 with : 0 1 1 Loss functions revisited (hypothesis testing) • What would be a good loss function? L( d 0 , 0) 0, L( d 1, 0) c 0 and L( d 0, 1) c1, L ( d 1, 1) 0 d0 : do not reject H 0 d1 : reject H 0 c0 c1 : the cost of type I error : the cost of type II error Loss functions revisited (hypothesis testing) • We have A(d 0; x1, x 2,..., xn) c1 1 f ( x1, x 2,..., xn | 1) and A(d 1; x1, x 2,..., xn) c 0 0 f ( x1, x 2,..., xn | 0) The Bayes test rejects H0 (makes decision d1) if A(d 1; x1, x 2,..., xn) A(d 0; x1, x 2,..., xn) Which holds if 1 f ( x1, x 2,..., xn | 1) 1 * c0 1 f ( x1, x 2,..., xn | 0) 0 * c1 Trendy ending slide