An Introduction to Bayesian Analysis: Theory and Methods

frequentist interval iJ ± Zo.D25 iJ(l - iJ) In where iJ = L Xi ln. 12. Derive (5.32) from an appropriate probability matching equation.

For Bayesians, model selection and model criticism are extremely important inference problems. Sometimes these tend to become much more complicated than estimation problems. In this chapter, some of these issues will be discussed in detail. However, all models and hypotheses considered here are lowdimensional because high-dimensional models need a different approach. The Bayesian solutions will be compared and contrasted with the corresponding procedures of classical statistics whenever appropriate. Some of the discussion in this chapter is technical and it will not be used in the rest of the book. Those sections that are very technical (or otherwise can be omitted at first reading) are indicated appropriately. These include Sections 6.3.4, 6.4, 6.5, and 6. 7. In Sections 6.2 and 6.3, we compare frequentist and Bayesian approaches to hypothesis testing. We do the same in an asymptotic framework in Section 6.4. Recently developed methodologies such as the Bayesian P-value and some non-subjective Bayes factors are discussed in Sections 6.5 and 6.7.

6.1 Preliminaries First, let us recall some notation from Chapter 2 and also let us introduce some specific notation for the discussion that follows. Suppose X having density f(xiO) is observed, with 0 being an unknown element of the parameter space 8. Suppose that we are interested in comparing two models M 0 and M 1 , which are given by

For i = 0, 1 let gi(O) be the prior density of 0, conditional on Mi being the true model. Then, to compare models M 0 and M 1 on the basis of a random sample x = (x 1 , ... ,xn) one would use the Bayes factor

i = 0, 1.

We also use the notation BF01 for the Bayes factor. Recall from Chapter 2 that the Bayes factor is the ratio of posterior odds ratio of the hypotheses to the corresponding prior odds ratio. Therefore, if the prior probabilities of the hypotheses, 7l'o = P"(Mo) = P"(8 0 ) and 71' 1 = P"(MI) = P"(8I) = 1- 7l'o are specified, then as in ( 2.1 7),

Thus, if conditional prior densities g0 and g 1 can be specified, one should simply use the Bayes factor B 01 for model selection. If, further 7l'o is also specified, the posterior odds ratio of M 0 to M 1 can also be utilized. However, these computations may not always be easy to perform, even when the required prior ingredients are fully specified. A possible solution is the use of BIC as an approximation to a Bayes factor. We study this in Subsection 6.1.1. The situation can get much worse when the task of specifying these prior inputs itself becomes a difficult problem as in the following problem.

Example 6.1. Consider the problem that is usually called nonparametric regression. Independent responses Yi are observed along with covariates xi, i = 1, ... , n. The model of interest is

1, ... ,n,

where Ei are i.i.d. N(O, a 2 ) errors with unknown error variance a 2 . The function g is called the regression function. In linear regression, g is a priori assumed to be linear in a set of finite regression coefficients. In general, g can be assumed to be fully unknown also. Now, if model selection involves choosing g from two different fully nonparametric classes of regression functions, this becomes a very difficult problem. Computation of Bayes factor or posterior odds ratio is then a formidable task. Various simplifications including reducing g to be semi-parametric have been studied. In such cases, some of these problems can be handled. Consider a different model checking problem now, that of testing for normality. This is a very common problem encountered in frequentist inference, because much of the inferential methodology is based on the normality assumption. Simple or multiple linear regression, ANOVA, and many other techniques routinely use this assumption. In its simplest form, the problem can

be stated as checking whether a given random sample X 1 ,X2 , ... ,Xn arose from a population having the normal distribution. In the setup given above in (6.1), we may write it as

M1 : X does not have the normal distribution.

However, this looks quite different from (6.1) above, because M 1 does not constitute a parametric alternative. Hence it is not clear how to usc Bayes factors or posterior odds ratios here for model checking. The difficulty with this model checking problem is clear: one is only interested in M 0 and not in M1. This problem is addressed in Section 6.3 of Gelman et al. (1995). Sec also Section 9.9. We use the posterior predictive distribution of replicated future data to assess whether the predictions show systematic differences. In practice, replicated data will not be available, so cross-validation of some form has to be used, as discussed in Section 9.9. Gelman ct al. (1995) have not used cross-validation and their P-values have come in for some criticism (see Section 6.5). The object of model checking is not to decide whether the model is true or false but to check whether the model provides plausible approximation to the data. It is clear that we have to use posterior predictive values and Bayesian P-values of some sort, but consensus on details docs not seem to have emerged yet. It remains an important problem.

6.1.1 BIC Revisited Under appropriate regularity conditions on j, g 0 , and g 1 , the Bayes factor given in (6.2) can be approximated using the Laplace approximation or the saddle point approximation. Let us change notation and express (6.3) as follows: mi(x) =

j(xiOi)gi(Oi) d(Ji, i = 0, 1.

where (Ji is the Pi-dimensional vector of parameters under Mi, assumed to be independent of n (the dimension of the ~bservation vector x). Let Oi be the posterior mode of Oi, i = 0, 1. Assume Oi is an interior point of 8i. Then, expanding the logarithm of the integrand in (6. 7) around using a secondorder Taylor series approximation, we obtain

where H 0i is the corresponding negative Hessian. Applying this approximation to (6.7) yields,

= J(xl0i)gi(Oi)(2n)Pd 2 IH=-8, 1 1 / 2 .

2log B 01 is a commonly used evidential measure to compare the support provided by the data x for M 0 relative to M 1 . Under the above approximation we have, 2log(Bor)

1) .

A variation of thE> approximation is also commonly use9_, where instead of the posterior mode Oi, the maximum likelihood estimate Oi is employed. Then, instead of (6.8), one obtains (6.9)

Here H9 is the observed Fisher information matrix evaluated at the maximum likelihood estimator. If the observations are i.i.d. we have that H9 = nH1 8 , where H 1 8 is the observed Fisher information matrix obtained fr;m a sin'gie observati~n'. In this case,

IH-ll) -(Po- Pr) log!!:___+ log ~ . 27r ( IH-ll

J(xiOo)) ~ - (Po - pr) log n. ( f(xiOr)

This is the approximate Bayes factor based on the Bayesian information criterion (BIC) due to Schwarz (1978). The term (Po- pr) log n can be considered a penalty for using a more complex model.

AIC = 2log f(xiO)- 2p for a model J(xiO). The penalty for using a complex model is not as drastic as that in BIC. A Bayesian interpretation of AIC for high-dimensional prediction problems is presented in Chapter 9. Problem 16 of Chapter 9 invites you to explore if AIC is suitable for low-dimensional testing problems.

6.2 P-value and Posterior Probability of H 0 as Measures of Evidence Against the Null One particular tool from classical statistics that is very widely used in applied sciences for model checking or hypothesis testing is the P-value. It also happens to be one of the concepts that is highly misunderstood and misused. The basic idea behind R.A. Fisher's (see Fisher (1973)) original (1925) definition of P-value given below did have a great deal of appeal: It is the probability under a (simple) null hypothesis of obtaining a value of a test statistic that is at least as extreme as that observed in the sample data. Suppose that it is desired to test

H 0 : B = Bo versus H1 : B -=f. Bo,

and that a classical significance test is available and is based on a test statistic T(X), large values of which are deemed to provide evidence against the null hypothesis. If data X = x is observed, with corresponding t = T(x), the P-value then is

T(x)).

Example 6.2. Suppose we observe X 1 , ••• ,Xn i.i.d. from N(B,~J 2 ), where ~J 2 is known. Then X is sufficient for e and it has the N(B, IJ 2 fn) distribution. Noting that T = T(X) = lfo (X- 80 ) fiJI, is a natural test statistic to test (6.13), one obtains the usual P-value as a = 2[1 -

justification. From a Bayesian point of view, various objections have been raised by Edwards et al. (1963), Berger and Sellke (1987), and Berger and Delampady (1987), against use of P-values as measures of evidence against H 0 . A recent review is Ghosh et al. (2005). To a Bayesian the posterior probability of H 0 summarizes the evidence against H 0 . In many of the common cases of testing, the P-value is smaller than the posterior probability by an order of magnitude. The reason for this is that the P-value ignores the likelihood of the data under the alternative and takes into account not only the observed deviation of the data from the null hypothesis as measured by the test statistic but also more significant deviations. In view of these facts, one may wish to see if P-values can be calibrated in terms of bounds for posterior probabilities over natural classes of priors. It appears that calibration takes the form of a search for an alternative measure of evidence based on posterior that may be acceptable to a nonBayesian. In this connection, note that there is an interesting discussion of the admissibility of P-value as a measure of evidence in Hwang et al. (1992).

6.3 Bounds on Bayes Factors and Posterior Probabilities 6.3.1 Introduction We begin with an example where P-values and the posterior probabilities are very different.

Example 6.3. We observe X"' N(B,cr 2 jn), with known cr 2 . Upon using T = lvn (X- 80 ) /crl as the test statistic to test (6.13), recall that the P-value comes out to be o: = 2[1 -

= (Bo- Jl)/T. Now, if we choose J1 = 80, T = cr and

For various values of t and n, the different measures of evidence, o: = P-value, B = Bayes factor, and P = P(H0 Ix) are displayed in Table 6.1 as shown in Berger and Delampady (1987). It may be noted that the posterior probability of H 0 varies between 4 and 50 times the corresponding P-value which is an indication of how different these two measures of evidence can be.

Table 6.1. Normal Example: Measures of Evidence 1

6.3.2 Choice of Classes of Priors Clearly, there are irreconcilable differences between the classical P-value and the corresponding Bayesian measures of evidence in the above example. However, one may argue that the differences are perhaps due to the choice of ?To or 9 1 that cannot claim to be really 'objective.' The choice of ?To = 1/2 may not be crucial because the Bayes factor, B, which does not need this, seems to be providing the same conclusion, but the choice of 9 1 does have substantial effect. To counter this argument, let us consider lower bounds on B and P over wide classes of prior densities. What is surprising is that even these lower bounds that are based on priors 'least favorable' to H 0 are typically an order of magnitude larger than the corresponding P-values for precise null hypotheses. The other motivation for looking at bounds over classes of priors is that they correspond with robust Bayesian answers that are more compelling when an objective choice for a single prior does not exist. Thus, in the case of precise null hypotheses, if G is the class of all plausible conditional prior densities 9 1 under H 0 , we are then lead to the consideration of the following bounds.

. B(G,x)=mfBor= gEG

. mf P(Holx) =

This brings us back to the question of choice of the class G as in Chapter 3, where the robust Bayesian approach has been discussed. As explained there, robustness considerations force us to consider classes that are neither too large nor too small. Choosing the class G A = {all densities} certainly is very extreme because it allows densities that are severely biased towards H 1 . Quite often, the class G NC = {all natural conjugate densities with mean 80 } is an interesting class to consider. However, this turns out to be inadequate for robustness considerations. The following class

which strikes a balance between these two extremes seems to be a good choice. Because we are comparing various measures of evidence, it is informative to examine the lower bounds for each of these classes. In particular, we can gather the magnitudes of the differences between these measures across the classes. To simplify proofs of some of the results given below, we restate a result indicated in Section 3.8.1. Lemma 6.4. Suppose Cr is a set of prior probability measures on RP given by Cr = {vt: t E T}, T C Rd, and let C be the convex hull of Cr. Then

sup { f(x!B) d1r(B) =sup { f(x!B) dvt(B). 1rEC

Proof. Because C =::l Cr, LHS ?: RHS in (6.17). However, as

:::; sup { f(x!B) dvt(B).

yielding the other inequality also.

The following results are from Berger and Sellke (1987) and Edwards et al. (1963). Theorem 6.5. Let B(x) be the maximum likelihood estimate of e for the ob-

served value of x. Then

In view of Lemma 6.4, the proof of this result is quite elementary, once it is noted that the extreme points of G A are point masses. Theorem 6.6. Let Us be the class of all uniform distributions symmetric

about 00 . Then

B(Gus, x) = B(Us, x), P(Ho!Gus, x) = P(HoiUs, x).

Proof. Simply note that any unimodal symmetric distribution is a mixture of symmetric uniforms, and apply Lemma 6.4 again. 0 Because B(Us, x) = f(xiBo)/ supgEUs mg(x), computation of sup mg(x) = sup gEUs

is required to employ Theorem 6.6. Also, it may be noted that as far as robustness is considered, using the class Gus of all symmetric unimodal priors is the same as using the class Us of all symmetric uniform priors. It is perhaps reasonable to assume that many of these uniform priors are somewhat biased against H 0 , and hence we should consider unimodal symmetric prior distributions that are smoother. One possibility is scale mixtures of normal distributions having mean 80 . This class is substantially larger than just the class of normals centered at 80 ; it includes Cauchy, all Student's t and so on. To obtain the lower bounds, however, it is enough to consider

GNor = {all normal distributions with mean Bo}, in view of Lemma 6.4.

Example 6. 7. Let us continue with Example 6.3. We have the following results from Berger and Sellke (1987) and Edwards et al. (1963). (i) B(GA,x) = exp(-~), because the MLE of B is x; hence 1

(ii) E(HoiGA,x) = [1 + -:;,;o exp(~)r . (iii) If t ::;; 1, B(Gus,x) = 1, and P(HoiGus,x) = 1r0. This is because in this case, the unimodal symmetric distribution that maximizes mg(x) is the degenerate distribution that puts all its mass at 80 . (iv) If t > 1, the g E Gus that maximizes mg(x) is non-degenerate and from Theorem 6.6 and Example 3.4, 1

(v) If t :S; 1, B(GNon x) = 1, and P(HoiGNon x) = 7ro. If t

For various values oft, the different measures of evidence, a = P-value, B = lower bound on Bayes factor, and P =lower bound on P(H0 Ix) are displayed in Table 6.2. 7ro has been chosen to be 0.5. What we note is that the differences between P-values and the corresponding Bayesian measures of evidence remain irreconcilable even when the lower bounds on such measures are considered. In other words, even the least possible Bayes factor and posterior probability of H 0 are substantially larger than the corresponding P-value. This is so, even for the choice G A, which is rather astonishing (see Edwards et al. (1963)).

6 Hypothesis Testing and Model Selection Table 6.2. Normal Example: Lower Bounds on Measures of Evidence

6.3.3 Multiparameter Problems It is not the case that the discrepancies between P-values and lower bounds on Bayes factor or posterior probability of H 0 are present only for tests of precise null hypotheses in single parameter problems. This phenomenon is much more prevalent. We shall present below some simple multi parameter problems where similar discrepancies have been discovered. The following result on testing a p-variate normal mean vector is from Delampady (1986). Example 6.8. Suppose X,...., Np(O,I), where X = (X 1 , X 2 , ... , Xp) and(} =

((h, {}z, ... , (}P). It is desired to test Ho : (} = 0° versus H 1

-:f. (} 0 ,

where (} 0 = ((}~, (}~, ... , (}~) is a specified vector. The classical test statistic is T(X)

which has ax; distribution under Ho. Thus the P-value of the data

P(x; 2: T(x)).

Consider the class CusP of unimodal spherically symmetric (about 0°) prior distributions for 0, the natural generalization Gus. This will consist of densities g(O) of the form g(O) = h((O- 0°)'(0- 0°)), where his non-increasing. Noting that any unimodal spherically symmetric distribution is a mixture of uniforms on symmetric spheres, and applying Lemma 6.4, we obtain sup gEGusP

where V(k) is the volume of a sphere of radius k, and f(xJO) is the Np(O,I) density. Therefore, we have that,

Using this result, numerical values were computed for different dimensions, p and different P-values, a. In Table 6.3 we present these values where

Table 6.3. Multivariate Normal Example: Lower Bounds on Measures of Evidence Q

1 .018 2 .014 3 .012 4 .011 5 .010 10 .009 15 .009 20 .009 30 .009 40 .009 00 .009

B denotes B(GusP, x) and P denotes P(H0 [Gusp, x) for 1r0 = 0.5. As can be readily seen, the lower bounds remain substantially larger than the corresponding P-values in all dimensions. Note that spherical symmetry is not the only generalization of symmetry from one dimension to higher dimensions. Very different answers can be obtained if, for example, elliptical symmetry is used instead. Suppose we consider densities of the form g(O) = /!QTh((O- 0°)'Q(O- 0°)), where Q is an arbitrary positive definite matrix and h is non-increasing. Then the following result, which is informally stated in Delampady and Berger (1990), obtains. For the sake of simplicity, let us take () 0 = 0.

Theorem 6.9. Let f(x[O) be a multivariate, multiparameter density. Consider the class of elliptically symmetric unimodal prior densities CuEs=

{g: g(O) = [Q[h(O'QO),h non-increasing, Q positive definite}. (6.22)

where V(k) is the volume of a sphere of radius k, and Q > 0 denotes that Q is positive definite. Proof. Note that sup gEGuEs

because the maximization of the inside integral over non-increasing h in (6.24) is the same as maximization of that integral over the class of unimodal spherically symmetric densities, and hence Lemma 6.4 applies. D Consider the above result in the context of Example 6.8. The lower bounds on the Bayes factor as well as the posterior probability of the null hypothesis will be substantially lower if we use the class G u ES rather than G u sP. This is immediate from (6.23), because the lower bounds over CusP correspond with the maximum in (6.23) with Q = I. The result also questions the suitability of Gu ES for these lower bounds in view of the fact that the lower bounds will correspond with prior densities that are extremely biased towards H 1 . Many other esoteric classes of prior densities have also been considered by some for deriving lower bounds. In particular, generalization of symmetry from the single-parameter case to the multiparameter case has been examined. DasGupta and Delampady (1990) consider several subclasses of the symmetric star-unimodal densities. Some of these are mixtures of uniform distributions on .Cp (for p = 1, 2, oo ), class of distributions with components that are independent symmetric unimodal distributions and a certain subclass of one-symmetric distributions. Note that mixtures of uniform distributions on .C 2 balls are simply unimodal spherically symmetric distributions, whereas mixtures of uniform distributions on .C 1 balls contain distributions whose components are i.i.d. exponential distributions. Uniform distributions on hypercubes form a subclass of mixtures of uniforms on .C= balls. Also considered there is the larger subclass consisting of distributions whose components are identical symmetric unimodal distributions. Another class of one-symmetric distributions considered there is of interest because it contains distributions whose components are i.i.d. Cauchy. Even though studies such as these are important from robustness considerations, we feel that they do not necessarily add to our understanding of possible interpretation of P-values from a robust Bayesian point of view. However, interested readers will find that Dharmadhikari and Joag-Dev (1988) is a good source for multivariate unimodality, and Fang et al. (1990) is a good reference for multivariate symmetry for material related to the classes mentioned above. We have noted earlier that computation of Bayes factor and posterior probability is difficult when parametric alternatives are not available. Many frequentist statisticians claim that P-values are valuable when there are no alternatives explicitly specified, as is common with tests of fit. We consider this issue here for a particularly common test of fit, the chi-squared test of goodness of fit. It will be observed that alternatives do exist implicitly, and hence Bayes factors and posterior probabilities can indeed be computed. The

following results from Delampady and Berger (1990) once again point out the discrepancies between P-values and Bayesian measures of evidence.

Example 6.1 0. Let n = ( n1, n2, ... , nk) be a sample of fixed size N = 1 ni from a k-cell multinomial distribution with unknown cell probabilities p (p1,P2, ... ,pk) and density (mass) function

where p 0 = (p~, pg, ... , p~) is a specified interior point of the k-dimensional simplex. Instead of focusing on the exact multinomial setup, the most popular approach is to use the chi-,.;quared approximation. Here the test statistic of interest is

which has the asymptotic distribution (as N ---+ oo) of 1 under H 0 . To compare P-values so obtained with the corresponding robust Bayesian measures of evidence, the following are two natural classes of prior distributions to consider. (i) The conjugate class Gc of Dirichlet priors with density

(ii) Consider the following transform of (p 1, P2, ... , Pk- I) 1 : u

= up ( ) = (P1 - p~ ' P2 - pg , ... , P_k_-_1_-_P_:_~:__---=-1) I P1

+ ( ~- Pk ) ( vJil, y/p2, ... ' VPk-1) vPk + Pk

The justification (see Delampady and Berger (1990)) for using such a transform is that its range is nk- 1 unlike that of p and its likelihood function is more symmetric and closer to a multivariate normal. Now let

and consider the class of prior densities g obtained by transforming back to the original parameter:

Delampady and Berger (1990) show that as N ---+ oo, the lower bounds on Bayes factors over Gc and Crus converge to those corresponding with the multivariate normal testing problem (chi-squared test) in Example 6.8, thus proving that irreconcilability of P-values and Bayesian measures of evidence is present in goodness of fit problems as well. Additional discussion of the multinomial testing problem with mixture of conjugate priors can be found in Good (1965, 1967, 1975). Edwards et al. (1963) discuss the possibility of finding lower bounds on Bayes factors over the conjugate class of priors for the binomial problem. Extensive discussion of the binomial problem and further references can be found in Berger and Delampady (1987).

6.3.4 Invariant Tests 1 A natural generalization of the symmetry assumption (on the prior distribution) is invariance under a group of transformations. Such a generalization and many examples can be found in Delampady (1989a). A couple of those examples will be discussed below to show the flavor of the results. The general results that utilize sophisticated mathematical arguments will be skipped, and instead interested readers are referred to the source indicated above. For a good discussion on invariance of statistical decision rules, see Berger (1985a). Recall that the random observable X takes values in a space X and has density (mass) function f(xJO). The unknown parameter is 0 E 8 ~ Rn, for some positive integer n. It is desired to test H 0 : 0 E 8 0 versus H 1 : 0 E 8 1 . We assume the following in addition. (i) There is a group g (of transformations) acting on X that induces a group g (of transformations acting) on 8. These two groups are isomorphic (see Section 5.1.7) and elements of g will be denoted by g, those of g by g. (ii) f(gxJgO) = f(xJO)k(g) for a suitable continuous map k (from g to (0, oo)). (iii) gGo =Go, g81 = 81, gG = 8. In this context, the following concept of a maximal invariant is needed.

Definition. When a group G of transformations acts on a space X, a function T(x) on X is said to be invariant (with respect to G) if T(g(x)) = T(x) for all x E X and g E G. A function T(x) is maximal invariant (with respect to G) if it is invariant and further T(xl) 1

Section 6.3.4 may be omitted at first reading.

for some g E G.

This means that G divides X into orbits where invariant functions are constant. A maximal invariant assigns different values to different orbits. Now from (i), we have that the action of g and g induce maximal invariants t(X) on X and TJ(O) on 8, respectively.

Remark 6.11. The family of densities f(xiO) is said to be invariant under g if (ii) is satisfied. The testing problem Ho : 8 E 8o versus H 1 : 0 E 81 is said to be invariant under g if in addition (iii) is also satisfied. Example 6.12. Consider Example 6.8 again and suppose X "' NP (0, I). It is desired to test Ho : 0 = 0 versus H 1 : 0 =f. 0. This testing problem is invariant under the group (] 0 of all orthogonal transformations; i.e., if His an orthogonal matrix of order p, then gHX = HX"' Np(HO, I), so that gHO = HO. Further, f(xiO) = (27r)-PI 2 exp( -

so that (ii) is satisfied. Also, gHO = 0, and (iii) too is satisfied.

Example 6.13. Let X 1 , X 2 , · · ·, Xn be a random sample from N (8, 0" 2 ) with both 8 and O" unknown. The problem is to test the hypothesis H 0 : 8 = 0 against Hl : 8 =I 0. A sufficient statistic for (8, 0") is X = (X, S), X= L~ Xdn and S =

where K is a constant. Also,

{(8,0"): 8 E R,O" > 0}.

The problem is invariant under the group G = {gc = c : c > 0}, where the action of gc is given by gc(x) = c(x, s) = (ex, cs). Note that f(gcxl8, O") = c- 2 f(xl8, O"). A number of technical conditions in addition to the assumptions (i)-(iii) yield a very useful representation for the density of the maximal invariant statistic t(X). Note that this density, q(t(x)ITJ(O)), depends on the parameter 0 only through the maximal invariant in the parameter space, TJ(O).

The technique involved in the derivation of these results uses an averaging over a relevant group. The general method of this kind of averaging is due to Stein (1956), but because there are a number of mathematical problems to overcome, various different approaches were discovered as can be seen in Wijsman (1967, 1985, 1986), Andersson (1982), Andersson et al. (1983), and Farrell (1985). For further details, see Eaton (1989), Kariya and Sinha (1989), and Wijsman (1990). The specific conditions and proofs of these results can be found in the above references. In particular, the groups considered here are amenable groups as presented in detail in Bondar and Milnes (1981). See also Delampady (1989a). The orthogonal group, and the group of location-scale transformations are amenable. The multiplicative group of non-singular p x p linear transformations is not. Let us return to the issue of comparison of P-values and lower bounds on Bayes factors and posterior probabilities (of hypotheses) in this setup. We note that it is necessary to reduce the problem by using invariance for any meaningful comparison because the classical test statistic and hence the computation of P-value are already based on this. Therefore, the natural class of priors to be used for this comparison is the class G 1 of G-invariant priors; i.e., those priors 1r that satisfy (iv) 1r(A) = 1r(Ag). Theorem 6.14. If 9 is a group of transformations satisfying certain regularity conditions (see Delampady {1989a)),

denotes the space of maximal invariants on the parameter space.

Corollary 6.15. If 8 0 /9 = {0}, then under the same conditions as in Theorem 6.14, B(GI x) = q (t(x)IO) ' {supryEe;g q (t(x)lry)}

Example 6.16. (Example 6.12, continued.) Consider the class of all priors that are invariant under orthogonal transformations, and note that this class is simply the class of all spherically symmetric distributions. Now, application of Corollary 6.15 yields, q ( t(x) IO)

where q(tlry) is the density of a noncentral x2 random variable with p degrees of freedom and non-centrality parameter ry, and r] is the maximum likelihood estimate of 7J from data t(x). For selected values of p the lower bounds, B and P (for 1r0 = 0.5) are tabulated against their P-values in Table 6.4.

Table 6.4. Invariant Test for Normal Means

Notice that the lower bounds on the posterior probabilities of the null hypothesis are anywhere from 4 to 7 times as large as the corresponding P-values, indicating that there is a vast discrepancy between P-values and posterior probabilities. This is the same phenomenon as what was seen in Table 6.3. What is, however, interesting is that the class of priors considered here is larger and contains the one considered there, but the magnitude of the discrepancy is about the same. Example 6.17. (Example 6.13, continued.) In the normal example with unknown variance, we have the maximal invariants t(x) = x/s and T)(B, a)= Bja. If we define, Gr = {n: dn(B,a) = h 1 (T])dT)da,h 1 is any density forT)}, a

we obtain, q (t(x)IO) B(Gr, x) = q (t(x)lil)' where q(tiT)) is the density of a noncentral Student's t random variable with n1 degrees of freedom, and non-centrality parameter T), and ij is the maximum likelihood estimate ofT). The fact that all the necessary conditions (which are needed to apply the relevant results) are satisfied is shown in Andersson (1982) and Wijsman (1967). For selected values of the lower bounds are tabulated along with the P-values in Table 6.5. For small values of n, the lower bounds in Table 6.5 are comparable with the corresponding P-values, whereas as n gets large the differences between these lower bounds and the P-values get larger. See also in this connection Section 6.4. There is substantial literature on Bayesian testing of a point null. Among these are Jeffreys (1957, 1961), Good (1950, 1958, 1965, 1967, 1983, 1985, 1986), Lindley (1957, 1961, 1965, 1977), Raiffa and Schlaiffer (1961), Edwards et al. (1963), Hildreth (1963), Smith (1965), Zellner (1971, 1984), Dickey (1971, 1973, 1974, 1980), Lempers (1971), Rubin (1971), Leamer (1978), Smith and Spiegelhalter (1980), Zellner and Siow (1980), and Diamond and Forrester (1983). Related work can also be found in Pratt (1965), DeGroot (1973), Dempster (1973), Dickey (1977), Bernardo (1980), Hill (1982), Shafer (1982), and Berger ( 1986).

6 Hypothesis Testing and Model Selection Table 6.5. Test for Normal Mean, Variance Unknown Q

lnvariance and Minimaxity Our focus has been on deriving bounds on Bayes factors for invariant testing problems. There is, however, a large literature on other aspects of invariant tests. For example, if the group under consideration satisfies the technical condition of amenability and hence the Hunt-Stein theorem is valid, then the minimax invariant test is minimax among all tests. We do not discuss these results here. For details on this and other related results we would like to refer the interested readers to Berger (1985a), Kiefer (1957, 1966), and Lehmann (1986).

6.3.5 Interval Null Hypotheses and One-sided Tests Closely related to a sharp null hypothesis H 0 : () = ()0 is an interval null hypothesis H 0 : IB-Bol ~E. Dickey (1976) and Berger and Delampady (1987) show that the conflict between P-values and posterior probabilities remains if E is sufficiently small. The precise order of magnitude of small E depends on the sample size n. One may also ask similar questions of possible conflict between P-values and posterior probabilities for one-sided null, say, H 0 : () ~ () 0 versus H 1 : () > ()0 . In the case of () = mean of a normal, and the usual uniform prior, direct calculation shows the P-value equals posterior probability. On the other hand, Casella and Berger (1987) show in general the two are not the same and the P-value may be smaller or greater depending on the family of densities in the model. Incidentally, the ambiguity of an improper prior discussed in Section 6. 7 does not apply to one-sided nulls. In this case the Bayes factor remains invariant if the improper prior is multiplied by an arbitray constant.

6.4 Role of the Choice of an Asymptotic Framework2 This section is based on Ghosh et al. (2005). Suppose X 1 , ... , Xn are i.i.d. N(B, a 2 ), a 2 known, and consider the problem of testing H 0 : () = ()0 versus 2

Section 6.4 may be omitted at first reading.

H 1 : B -=1- Ba. If instead of taking a lower bound as in the previous sections, we take a fixed prior density g 1(B) under H 1 but let n go to oo, then the conflict between P-values and posterior probabilities is further enhanced. Historically this phenomenon was noted earlier than the conflict with the lower bound, vide Jeffreys (1961) and Lindley (1957). Let g 1 be a uniform prior density over some interval ( Ba - a, Ba + a) containing Ba. The posterior probability of Ha given X= (X1, ... , Xn) is

P(HaiX) =Ira exp[-n(X- Ba) 2 /(20" 2 )]/ K, where Ira is the specified prior probability of Ha and -

2 2 exp[-n(X- B) /(20" )]dB.

Suppose X is such that X= Ba + ZaO"/Vn where Za is the 100(1- a)% quantile of N(O, 1). Then X is significant at level ct. Also, for sufficiently large n, X is well within (Ba-a, Ba +a) because X- Ba tends to zero as n increases. This leads to

~ Iraexp(-z;/2)/[Iraexp(-z;/2) + (1 -Ira) O"V(2Ir/n)]. 2a

Thus P(HaiX) -+ 1 as n-+ oo whereas the P-value is equal to a for all n. This is known as the Jeffreys-Lindley paradox. It may be noted that the same phenomenon would arise with any fiat enough prior in place of uniform. Indeed, P-values cannot be compared across sample sizes or across experiments, see Lindley (1957), Ghosh et al. (2005). Even a frequentist tends to agree that the conventional values of the significance level a like a = 0.05 or 0.01 are too large for large sample sizes. This point is further discussed below. The Jeffreys-Lindley paradox shows that for inference about B, P-values and Bayes factors may provide contradictory evidence and hence can lead to opposite decisions. Once again, as mentioned in Section 6.3, the evidence against Ha contained in P-values seems unrealistically high. We argue in this section that part of this conflict arises from the fact that different types of asymptotics are being used for the Bayes factors and the P-values. We begin with a quick review of the two relevant asymptotic frameworks in classical statistics for testing a sharp null hypothesis. The standard asymptotics of classical statistics is based on what are called Pitman alternatives, namely, Bn = Ba + d/fo at a distance of 0(1/fo) from the null. The Pitman alternatives are also called contiguous in the very general asymptotic theory developed by Le Cam (vide Roussas (1972), Le Cam and

Yang (2000), Hajek and Sidak (1967)). The log-likelihood ratio of a contiguous alternative with respect to the null is stochastically bounded as n -+ oo. On the other hand, for a fixed alternative, the log-likelihood ratio tends to -oo (under the null) or oo (under the fixed alternative). If the probability of Type 1 error is 0 < a < 1, then the behavior of the likelihood ratio has the following implication. The probability of Type 2 error will converge to 0 < j3 < 1 under a contiguous alternative Bn and to zero if e is a fixed alternative. This means the fixed alternatives are relatively easy to detect. So in this framework it is assumed that the alternatives of importance are the contiguous alternatives. Let us call this theory Pitman type asymptotics. There are several other frameworks in classical statistics of which Bahadur's (Bahadur, 1971; Serfling, 1980, pp. 332~341) has been studied most. We focus on Bahadur's approach. In Bahadur's theory, the alternatives of importance are fixed and do not depend on n. Given a test statistic, Bahadur evaluates its performance at a fixed alternative by the limit (in probability or a.s.) of ~(logP-value) when the alternative is true. Which of these two asymptotics is appropriate in a given situation should depend on which alternatives are important, fixed alternatives or Pitman alternatives 80 + d/ .Jii that approach the null hypothesis at a certain rate. This in turn depends on how the sample size n is chosen. If n is chosen to ensure a Type 2 error bounded away from 0 and 1 (like a), then Pitman alternatives seem appropriate. If n is chosen to be quite large, depending on available resources but not on alternatives, then Bahadur's approach would be reasonable. 6.4.1 Comparison of Decisions via P-values and Bayes Factors in Bahadur's Asymptotics

In this subsection, we essentially follow Bahadur's approach for both P-values and Bayes factors. A Pitman type asymptotics is used for both in the next subsection. We first show that if the P-value is sufficiently small, as small as it is typically in Bahadur's theory, B 01 will tend to zero, calling for rejection of H 0 , i.e., the evidence in the P-value points in the same direction as that in the Bayes factor or posterior probability, removing the sense of paradox in the result of Jeffreys and Lindley. One could, therefore, argue that the Pvalues or the significance level a assumed in the Jeffreys-Lindley example are not small enough. The asymptotic framework chosen is not appropriate when contiguous alternatives are not singled out as alternatives of importance. We now verify the claim about the limit of B 01 . Without loss of generality, take 80 = 0, a 2 = 1. First note that logBo 1 where

provided the prior g 1 (B) is a continuous function of Band is positive at all B. If we omit Rn from the right-hand side of (6.26), we have Schwarz's (1978) approximation to the Bayes factor via BIC (Section 4.3). The logarithm of P-value (p) corresponding to observed X is logp = log 2[1 -

( ..;n I x I)] = -

~ X 2 (1 + o(1))

by standard approximation to a normal tail (vide Feller (1973, p. 175) or Bahadur (1971, p. 1)). Thus~ logp --t -0 2 /2 and by (6.26), logB 01 --t-oo. This result is true as long as IX I > c(log n j n) 1 / 2 , c > V2. Such deviations are called moderate deviations, vide Rubin and Sethuraman (1965). Of course, even for such P-values, p rv (B 0 1/n) so that P-values are smaller by an order of magnitude. The conflict in measuring evidence remains but the decisions are the same. Ghosh et al. (2005) also pursue the comparison of the three measures of evidence based on the likelihood ratio, the P-value based on the likelihood ratio test, and the Bayes factor B 01 under general regularity conditions. 6.4.2 Pitman Alternative and Rescaled Priors

We consider once again the problem of testing H 0 : B = 0 versus H 1 : B -f. 0 on the basis of a random sample from N(B, 1). Suppose that the Pitman alternatives are the most important ones and the prior g 1 (B) under H 1 puts most of the mass on Pitman alternatives. One such prior is N(0,6jn). Then B 01 =

VS+i exp [- ~

1) X

If the P-value is close to zero, ..fiiiXI is large and therefore, B 01 is also close to zero, i.e., for these priors there is no paradox. The two measures are of the same order but the result of Berger and Sellke (1987) for symmetric unimodal priors still implies that P-value is smaller than the Bayes factor.

6.5 Bayesian P-value3 Even though valid Bayesian quantities such as Bayes factor and posterior probability of hypotheses are in principle the correct tools to measure the evidence for or against hypotheses, they are quite often, and especially in many practical situations, very difficult to compute. This is because either the alternatives are only very vaguely specified, vide (6.6), or very complicated. Also, in some cases one may not wish to compare two or more models but check how a model fits the data. Bayesian P-values have been proposed to deal with such problems. 3

Section 6.5 may be omitted at first reading.

180

6 Hypothesis Testing and Model Selection

Let M 0 be a target model, and departure from this model be of interest. If, under this model, X has density f(xlry), 17 E £, then for a Bayesian with prior 1r on 1], m1r(x) = J£ f(xlry)n(ry) dry, the prior predictive distribution is the actual predictive distribution of X. Therefore, if a model departure statistic

T(X) is available, then one can define the prior predictive P-value (or tail area under the predictive distribution) as

P = pm7C (T(X) 2': T(xobs)IMo), where Xobs is the observed value of X (see Box (1980)). Although it is true that this is a valid Bayesian quantity for model checking and it is useful in situations such as the ones described in Exercise 13 or Exercise 14, it does face the criticism that it may be influenced to a large extent by the prior 1r as can be seen in the following example.

Example 6.18. Let X 1 , X 2 , · · · , X n be a random sample from N ((), a 2 ) with both () and a 2 unknown. It is of interest to detect discrepancy in the mean of the model with the target model being M 0 : () = 0. Note that T = y'riX (actually its absolute value) is the natural model departure statistic for checking this. (a) Case 1. It is felt a priori that a 2 is known, or equivalently, we choose the prior on a 2 , which puts all its mass at some known constant a5. Then under M 0 , there are no unknown parameters and hence the prior predictive P-value is simply 2(1-

(

~xT

) -n/2

which is an improper density, thus completely disallowing computation of the prior predictive P-value. (c) Case 3. Consider an inverse Gamma prior IG(v, j3) with the following density for a 2 : n(a 2 1v, j3) = (a 2 )-(v+l) exp(- ~) for a 2 > 0, where v and

A:)

j3 are specified positive constants. Because Tla 2 the predictive density ofT is then,

N(O, a 2 ), under this prior

6.5 Bayesian P-value

181

Table 6.6. Normal Example: Prior Predictive P-values .5 .5 1 1 1 2 2 2 5 5 5 .5 1 .5 .5 1 1 2 1 2 2 2 p .300 .398 .506 .109 .189 .300 .017 .050 .122 .0001 .001 .011 v

/3 .5

If 2v is an integer, under this predictive distribution,

Thus we obtain, P = pm~

(lXI

pm~

2(1-

lxobsiiMo)

(I_I_I > vnlxobsiiM) 0

Vffv - Vffv

p. ( vnlxobsl 2v

JNv

))

where Fzv is the c.d.f. of tzv· For foxobs = 1.96 and various values of v and {3, the corresponding values of the prior predictive P-values are displayed in Table 6.6. Further, note that p -t 1 as {3 -t oo for any fixed v > 0. Thus it is clear that the prior predictive P-value, in this example, depends crucially on the values of v and {3. What can be readily seen in this example is that if the prior 1r used is a poor choice, even an excellent model can come under suspicion upon employing the prior predictive P-value. Further, as indicated above, non-informative priors that are improper (thus making m7f improper too) will not allow computing such a tail area, a further undesirable feature. To rectify these problems, Gutman (1967), Rubin (1984), Meng (1994), and Gelman et al. (1996) suggest modifications by replacing 1r in m7f by n(rylxobs):

m*(xlxobs) = p* =

f(xlry)n(rylxobs) dry,

pm*([xobs)

(T(X)

and

T(Xobs)).

This is called the posterior predictive P-value. This removes some of the difficulties cited above. However, this version of Bayesian P-value has come

182

6 Hypothesis Testing and Model Selection

under severe criticism also. Bayarri and Berger (1998a) note that these modified quantities are not really Bayesian. To see this, they observe that there is "double use" of data in the above modifications: first to convert (a possibly improper) 1r(17) into a proper 7r(1JIXabs), and then again in computing the tail area of T(X) corresponding with T(xabs)· FUrthermore, for large sample sizes, the posterior distribution of 17 will essentially concentrate at ij, the MLE of ry, so that m*(xlxabs) will essentially equal f(xiiJ), a non-Bayesian object. In other words, the criticism is that, for large sample sizes the posterior predictive P-value will not achieve anything more than rediscovering the classical approach. Let us consider Example 6.18 again.

Example 6.19. (Example 6.18, continued.) Let us consider the non-informative prior 1r(a 2) IX 1/a2 again. Then, as before, because Tja 2 "'N(O, a 2), and

n (-2 exp ( - 2a2 xobs

+ sobs 2 ))( a 2)-"'+1 2

Therefore, we see that, under the posterior predictive distribution,

Thus we obtain the posterior predictive P-value to be

where Fn is the distribution function of tn. This definition of a Bayesian P-value doesn't seem satisfactory. Let lxabsl --+ oo. Note that then p --+ 2 (1- Fn(vn)). Actually, p decreases to this lower bound as lxabsl--+ oo.

6.5 Bayesian P-value

Table 6.7. Values of Pn

183

= 2 (1- Fn(fo))

n 1 2 3 4 5 6 7 8 9 10 Pn .500 .293 .182 .116 .076 .050 .033 .022 .015 .010

Values of this lower bound for different n are shown in Table 6. 7. Note that these values have no serious relationship with the observations and hence cannot be really used for model checking. Bayarri and Berger (1998a) attribute this behavior to the 'double' use of the data, namely, the use of x in computing both the posterior distribution and the tail area probability of the posterior predictive distribution. In an effort to combine the desirable features of the prior predictive P-value and the posterior predictive P-value and eliminate the undesirable features, Bayarri and Berger (see Bayarri and Berger (1998a)) introduce the conditional predictive P-value. This quantity is based on the prior predictive distribution mrr but is more heavily influenced by the model than the prior. Further, noninformative priors can be used, and there is no double use of the data. The steps are as follows: An appropriate statistic U(X), not involving the model departure statistic T(X), is identified, the conditional predictive distribution m(tlu) is derived, and the conditional predictive P-value is defined as

Pc where Uabs (1998a).

= pm(.!uobs)

(T(X) 2': T(Xabs))'

U(xabs)· The following example is from Bayarri and Berger

Example 6.20. (Example 6.18, continued.) T = foX is the model departure statistic for checking discrepancy of the mean in the normal model. Let U(X) = s 2 = ~ 2:::7= 1 (X;- X) 2 . Note that nUia 2 "'a 2 x~_ 1 . Consider n(a 2 ) ex 1/a 2 again. Then n(a 2 IU = s 2 ) ex (a 2 )(n-l)/ 2 +1 exp( -ns 2 /(2a 2 )) is the density of inverse Gamma, and hence the conditional predictive density of T given s~bs is

1 t2 ) -n/2 ex ( 1+--2n 8 abs

Thus, under the conditional predictive distribution,

184

6 Hypothesis Testing and Model Selection

T Sobs

- - rv

tn-1,

and hence we obtain the conditional predictive P-value to be Pc

= pm(.ls~bs) (lXI 2': lxobsiMo) = pm(.ls~bsl (Jn=IIXI 2': Jn=IIxl IMo) Sobs

2 ( 1- Fn_ 1

(

Sobs

vn:~Xobsl)).

We have thus found a Bayesian interpretation for the classical P-value from the usual t-test. It is worth noting that s~bs was used to produce the posterior distribution to eliminate a 2 , and that Xobs was then used to compute the tail area probability. It is also to be noted that in this example, it was easy to find U(X), which eliminates a 2 upon conditioning, and that the conditional predictive distribution is a standard one. In general, however, even though this procedure seems satisfactory from a Bayesian point of view, there are problems related to identifying suitable U(X) and also computing tail areas from (quite often intractable) m(tluobs)· Another possibility is the partial posterior predictive P-value (see Bayarri and Berger (1998a) again) defined as follows: p*

pm*(.)

(T(X) 2': T(xobs)),

where the predictive density m* is obtained using a partial posterior density 7r* that does not use the information contained in tabs= T(xobs) and is given by m*(t)

fr(tlrJ)7r*(rJ) drJ,

with the partial posterior 1r* defined as 1r*(77) ex fxlr(Xobsltobs, rJ)7r(rJ)

fx(XobslrJ) ( ) ex fr(tobslrJ) 7r

7] •

Consider Example 6.18 again with 1r(a 2 ) ex 1/a2 . Note that because Xi are i.i.d. N(O, a 2 ) under M 0 ,

fx (xobs la 2 ) ex (a 2 ) -n/ 2 exp(- : 2 (x~bs 2

ex so that

+ s~bs))

fx (Xobs la 2 ) ( a 2 ) -(n-I)/ 2 exp(-

: 2

s~bs),

6.6 Robust Bayesian Outlier Detection

f XIX_(Xobs ~-Xobs, a 2)

185

2 ). ex (a 2)-(n-l)/2 exp ( - -n-2 sobs 2a

Therefore, 1f* (a2)

ex (a2) -(n-l)/2+1 exp(- 2:2 s~bs),

and is the same as 1r(a 2 ls~bJ, which was used to obtain the conditional predictive density earlier. Thus, in this example, the partial predictive P-value is the same as the conditional predictive P-value. Because this alternative version p* does not require the choice of the statistic U, it appears this method may be used for any suitable goodness-of-fit test statistic T. However, we have not seen such work.

6.6 Robust Bayesian Outlier Detection Because a Bayes factor is a weighted likelihood ratio, it can also be used for checking whether an observation should be considered an outlier with respect to a certain target model relative to an alternative model. One such approach is as follows. Recall the model selection set-up as given in (6.1). X having density f(xiB) is observed, and it is of interest to compare two models Mo and M 1 given by Mo : X has density j(xiB) where 8 E 8o; M1 :X has density j(xiB) where 8 E 8 1 .

For i = 1, 2, gi(B) is the prior density of B, conditional on Mi being the true model. To compare Mo and M 1 on the basis of a random sample x = (x 1 , ... , Xn) the Bayes factor is given by

mo(x) B01 (x) = -(-), m1x where mi(x) = fe, f(xiB)gi(B) dB fori = 1, 2. To measure the effect on the Bayes factor of observation xd one could use the quantity kd = log

(

B(x) )

B(x-d)

(6.27)

where B(x-d) is the Bayes factor excluding observation xd. If kd < 0, then when observation xd is deleted there is an increase of evidence for Mo. Consequently, observation Xd itself favors model M 1 . The extent to which xd favors M 1 determines whether it can be considered an outlier under model M 0 . Similarly, a positive value for kd implies that xd favors M 0 . Pettit and Young (1990) discuss how kd can be effectively used to detect outliers when the prior is non-informative. The same analysis can be done with informative priors also. This assumes that the conditional prior densities g 0 and g 1 can be fully

186

6 Hypothesis Testing and Model Selection

specified. However, we take the robust Bayesian point of view that only certain broad features of these densities, such as symmetry and unimodality, can be specified, and hence we can only state that g0 or g 1 belongs to a certain class of densities as determined by the features specified. Because kd, derived from the Bayes factor, is the Bayesian quantity of inferential interest here, upper and lower bounds on kd over classes of prior densities are required. We shall illustrate this approach with a precise null hypothesis. Then we have the problem of comparing

Mo : e = eo versus M 1 : e -1- eo using a random sample from a population with density f(xle). Under M 1 , suppose e has the prior density g, g E T. The Bayes factors with all the observations and without the dth observation, respectively, are

B 9(x)

= foo~oa f(xle)g(e)de'

f(x-dleo) B9(x-d) = fo#a f(x-dle)g(e) de. Because f(xle) = f(xdle)j(x-dle), we get kd 9 =log

f(xleo) fo#a f(x-dle)g(e) del . [ f(x-dleo) .ftOofOa f(xle)g(e) de

foo~oa f(xle)g( e) de =log f(xdleo)- log [ fo#a f(x-dle)g(e) de .

(6.28)

Now note that to find the extreme values of kd, 9 , it is enough to find the extreme values of (6.29)

over the set

r. Further, this optimization problem can be rewritten as follows: sup hd 9 = sup

9EG

foo~oo f(xdle)J(±-dle)g(e) de

~--'-';;-------::-c,--------,--::.,----;--::-:---:c:--

fo#(Jo f(±-dle)g(e) de

9EG

sup 9*EG*

mf hd 9

9EG

loo~oa

f(xdle)g*(e) de,

(6.30)

foo~oa f(xdle)J(±-dle)g(e) de

= mf ---'---';;---,---------,----,---:--:---

9EG

inf 9*EG*

fo#Oo f(±-dle)g(e) de

loo~oa

f(xdle)g*(e) de,

(6.31)

6.6 Robust Bayesian Outlier Detection

187

where

{ =

g(B)f(;r_diB)

*( )

g :g B

}

fuo~oag(u)f(;r_dlu)du'gEG .

Equations (6.30) and (6.31) indicate how optimization of ratio of integrals can be transformed to optimization of integrals themselves. Consider the case where r is the class A of all prior densities on the set {B : B -=f. B0 }. Then we have the following result. Theorem 6.21. If f(;r-diB) > 0 for each B -=f. Bo, sup hd,g = sup f(xdiB), and

gEA

(6.32)

Oo/=Oo

in£ hd,g = in£ f(xdiB).

gEA

(6.33)

Ooj=Oo

Proof. From (6.30) and (6.31) above, sup hd, 9 =

gEA

sup

g* EA *

{

f(xdiB)g*(B) dB,

jl}oj=l} 0

where

A* =

{ * *( ) g :g B

g(B)f(;r_diB)

}

fuo~oog(u)f(;r_dlu)du'gEA .

Now note that extreme points of A* are point masses. Proof for the infimum is similar. The corresponding extreme values of kd are sup kd,g =log f(xdiBo) -log in£ f(xdiB),

(6.34)

in£ kd,g = logf(xdiBo) -log sup f(xdiB).

(6.35)

gEA

Ooj=Oo

gEA

l}oj=l} 0

Example 6.22. Suppose we have a sample of size n from N(B, a 2 ) with known a 2 • Then, from (6.34) and (6.35), sup kd,g = - 1 2 sup [ (xd- B) 2 - (xd- Bo) 2] gEA 2a Oo/=Oo = oo, and in£ kd,g =

gEA

~ in£ 2a O#Oo

[(xd- B) 2

(xd- Bo) 2 ]

(xd- Bo) 2 2a 2 It can be readily seen from the above bounds on kd that no observation, however large in magnitude, will be considered an outlier here. This is because A is too large a class of prior densities.

188

6 Hypothesis Testing and Model Selection

Instead, consider the class G of all N(B 0 , 7 2 ) priors with 7 2 > 76 > 0. Note that 7 2 close to 0 will make M 1 indistinguishable from M 0 , and hence it is necessary to consider 7 2 bounded away from 0. Then for g E G

where g* is the density of N(m, 82 ) with

Note, therefore, that hd,g = hd,g(xd) is just the density of N(m, a 2 evaluated at Xd· Thus,

+ 82 )

For each xd, one just needs to graphically examine the extremes of the expression above as a function of 7 2 to determine if that particular observation should be considered an outlier. Delampady (1999) discusses these results and also results for some larger nonparametric classes of prior densities.

6. 7 Nonsubjective Bayes Factors 4 Consider two models M 0 and M 1 for data X with density fi(xiOi) under model Mi, Oi being an unknown parameter of dimension Pi, i = 0, 1. Given prior specifications gi(Oi) for parameter Oi, the Bayes factor of M 1 to M 0 is obtained as B = m1(x) = f h(xiOI)gl(0 1)d0 1 (6.36) 10

mo(x)

J fo(xl0o)9o(Oo)d0o ·

Here mi(x) is the marginal density of X under Mi, i = 0, 1. When subjective specification of prior distributions is not possible, which is frequently the case, one would look for automatic method that uses standard noninformative priors. 4

Section 6.7 may be omitted at first reading.

6. 7 Nonsubjective Bayes Factors

189

There are, however, difficulties with (6.36) for noninformative priors that are typically improper. If gi are improper, these are defined only up to arbitrary multiplicative constants ci; cigi has as much validity as gi. This implies that (cl/c0 )B 10 has as much validity as B 10 . Thus the Bayes factor is determined only up to an arbitrary multiplicative constant. This indeterminacy, noted by Jeffreys (1961), has been the main motivation of new objective methods. We shall confine mainly to the nested case where fo and h are of the same functional form and fo(xiOo) is the same as fr(xjOI) with some of the co-ordinates of 0 1 specified. However, the methods described below can also be used for non-nested models. It may be mentioned here that use of diffuse (flat) proper prior does not provide a good solution to the problem. Also, truncation of noninformative priors leads to a large penalty for the more complex model. An example follows.

Example 6.23. (Testing normal mean with known variance.) Suppose we observe X= (X 1 , ... ,Xn)· Under Mo,Xi are i.i.d. N(0,1) and under M 1 , Xi are i.i.d. N(B, 1), BE R is the unknown mean. With the uniform noninformative prior gf (B) c for B under M 1 , the Bayes factor of M 1 to M 0 is given by B~ = ~cn- 1 1 2 exp[nX 2 /2].

e ::;

If one uses a uniform prior over - K ::; K, then for large K, the new Bayes factor B{b is approximately 1/(2Kc) times Bfo. Thus for large K, one is heavily biased against M 1 . This is reminiscent of the phenomenon observed by Lindley (1957). A similar conclusion is obtained if one uses a diffuse proper prior such as a normal prior N(O, T2 ), with variance T2 large. The corresponding Bayes factor is

which is approximately (nT 2 )- 1 12 exp[nX 2 /2] for large values ofnT 2 and hence can be made arbitrarily small by choosing a large value of T 2 . Also Bp0orm is highly non-robust with respect to the choice of T2 , and this non-robustness plays the same role as indeterminacy. The expressions for Bfo and Bp0orm clearly indicate similar behavior of these Bayes factors and the similar roles of V2ifc and (T 2 + 1/n)- 1/2. A solution to the above problem with improper priors is to use part of the data as a training sample. The data are divided into two parts, X= (X 1, X 2 ). The first part X 1 is used as a training sample to obtain proper posterior distributions for the parameters (given X I) starting from the noninformative priors

190

6 Hypothesis Testing and Model Selection

These proper posteriors are then used as priors to compute the Bayes factor with the remainder of the data (X 2 ). This conditional Bayes factor, conditioned on X 1 , can be expressed as

(6.37) where mi(X 1) is the marginal density of X 1 under Mi, i = 0, 1. Note that if the priors cigi, i = 0, 1, are used to compute B 10 (X 1), the arbitrary constant multiplier cl/ c0 of B 10 is cancelled by (co/ cl) of m0 (X I) /m 1 (X 1) so that the indeterminacy of the Bayes factor is removed in (6.37). A part of the data, X 1 , may be used as a training sample as described above if the corresponding posteriors 9i(BiiX I), i = 0,1 are proper or, equivalently, the marginal densities mi(X I) of X 1 under Mi, i = 0,1 are finite. One would naturally use minimal amount of data as such a training sample leaving most part of the data for model comparison. As in Berger and Pericchi (1996a), a training sample X 1 may be called proper if 0 < mi(X 1 ) < oo, i = 0, 1 and minimal if it is proper and no subset of it is proper.

Example 6.24. (Testing normal mean with known variance.) Consider the setup of Example 6.23 and the uniform noninformative prior g 1 (B) = 1 for B under M 1 . The minimal training samples are subsamples of size 1 with mo(Xi) = (1/J27r)e-xU 2 and m1(Xi) = 1. Example 6.25. (Testing normal mean with variance unknown.) Let X (X1, ... ,Xn)· Mo : X1, ... , Xn are i.i.d. N(O, a5), M1: X1, ... ,Xn are i.i.d. N(JL,ai). Consider the noninformative priors go(ao) = 1/ao under Mo and 91(JL,al) = 1/a1 . Here m 1 (Xi) = oo for a single observation Xi and a minimal training sample consists of two distinct observations xi, xj and for such a training sample (Xi, Xj), (6.38)

6. 7.1 The Intrinsic Bayes Factor

As described above, a solution to the problem with improper priors is obtained using a conditional Bayes factor B 10 (X 1), conditioned on a training sample X 1 . However, this conditional Bayes factor depends on the choice of

6.7 Nonsubjective Bayes Factors

191

the training sample X 1 . Let X(l), l = 1,2, ... L be the list of all possible minimal training samples. Berger and Pericchi (1996a) suggest considering all these minimal training samples and taking average of the corresponding L conditional Bayes factors B 10 (X(l))'s to obtain what is called the intrinsic Bayes factor (IBF). For example, taking an arithmetic average leads to the arithmetric intrinsic Bayes factor (AIBF) L

AIBF 10

_!_"'"""'

B 10

mo(X(l)) L ~ ml(X(l))

(6.39)

and the geometric average gives the geometric intrinsic Bayes factor (GIBF)

GIBFw = Bw (

IT mm

(X(l))) (X(l))

112

(6.40)

the sum and product in (6.39) and (6.40) being taken over the L possible training samples X(l), l = 1, ... , L. Berger and Pericchi (1996a) also suggest using trimmed averages or the median (complete trimming) of the conditional Bayes factors when taking an average of all of them does not seem reasonable (e.g., when the conditional Bayes factors vary much). AIBF and GIBF have good properties but are affected by outliers. If the sample size is very small, using a part of the sample as a training sample may be impractical, and Berger and Pericchi (1996a) recommend using expected intrinsic Bayes factors that replace the averages in (6.39) and (6.40) by their expectations, evaluated at the MLE under the more complex model M 1 . For more details, see Berger and Pericchi (1996a). Situations in which the IBF reduces simply to the Bayes factor B 10 with respect to the noninformative priors are given in Berger et al. (1998). The AIBF is justified by the possibility of its correspondence with actual Bayes factors with respect to "reasonable" priors at least asymptotically. Berger and Pericchi (1996a, 2001) have argued that these priors, known as "intrinsic" priors, may be considered to be natural "default" priors for the testing problems. The intrinsic priors are discussed here in Subsection 6. 7.3.

6. 7.2 The Fractional Bayes Factor O'Hagan (1994, 1995) proposes a solution using a fractional part of the full likelihood in place of using parts of the sample as training samples and averaging over them. The resulting "partial" Bayes factor, called the fractional Bayes factor (FBF), is given by

FBFw = ml(X,b) mo(X, b) where b is a fraction and

192

6 Hypothesis Testing and Model Selection

Note that F BF10 can also be written as

where (6.41) To make FBF comparable with the IBF, one may take b = mjn where m is the size of a minimal training sample as defined above and n is the total sample size. O'Hagan also recommends other choices of b, e.g., b = ynjn or lognjn.

We now illustrate through a number of examples.

Example 6.26. (Testing normal mean with known variance.) Consider the setup of Example 6.23. The Bayes factor with the noninformative prior g1 (B) = 1 was obtained as Bw =

J'iiin- 1 12 exp[n.X 2 /2] = J'iiin- 1 12 Aw

where A10 is the likelihood ratio statistic. Bayes factor conditioned on Xi is

Thus n

AIBFw

= n- 1

10 (Xi)

= n- 3 / 2 exp(nX 2 /2)

i=1

GIBFw

L exp( -Xl/2), i=1

n- 1 / 2 exp[n.X 2 /2- (1/2n)

L xf].

The median IBF (MIBF) is obtained as the median of the set of values Bw(Xi), i = 1, 2, · · ·, n. The FBF with a fraction o < b < 1 is

F BF10

b1 / 2 exp[n(1 - b)X 2 /2]

= n- 112 exp[(n- 1)X 2 /2],

if b = 1/n.

Example 6.27. (Testing normal mean with variance unknown.) Consider the setup of Example 6.25. For the standard noninformative priors considered in this example, we have

6.7 Nonsubjective Bayes Factors B 10 -

r(n-1)

TG) 1

AI BF1o = B10 x -(n) 1f

r(n-1)

FBF10 =

v;r/(-B')

[

("""' X2)n/2 Ut

[~(Xi- X) 2]Y "'"'

L.....t

IXi- Xjl

2 1::;;i<jC::n

~: (x:

193

(X 2

x2 ]~-1 , -"X) 2

X 2) J

2 with b = -:;;,·

Example 6.28. (normal linear model.) This example is from Berger and Pericchi (1996b, 2001 ). Berger and Pericchi determined the IBF for linear models for both the nested and non-nested case. We consider here only the nested case. Suppose for the data Y(n x 1) we consider the linear models

where /3i = (f3i 1, Pi2, · · ·,Pip,)' and a} are unknown parameters, and Xi is an n x Pi known design matrix of rank Pi < n. Consider priors of the form

Here qi = 0 gives the reference prior of Berger and Bernardo (1992a), and qi = Pi corresponds with the Jeffreys prior. For the nested case, when Mo is nested in M 1, Berger and Pericchi ( 1996b) consider a modified Jeffreys prior for which qo = 0 and q1 = P1 -Po· The integrated likelihoods mi(Y) with these priors can be obtained as

where C is a constant not depending on i, and Ri is the residual sum of squares under Mi, i = 0, 1. The Bayes factor B10 with the modified Jeffreys prior is then given by B 10 = (27r)(Pl-Po)/2

X' X 11/2 0 0 1 2

IX~X1I /

(R ) _Q

(n-po)/2

( 6.42)

Also, one can see that a minimal training sample Y(l) in this case is a sample of size m = p 1+ 1 such that for the corresponding design matrices Xi ( l) (under Mi), x:(l)Xi(l), i = 0,1 are nonsingular. The ratio m 0(Y(l))/m 1(Y(l)) can be obtained from the expression of B 10 by inverting it and replacing n, X 0 , X1, Ro, and R1 by m, Xo(l), X 1(l), Ro(l), and R1(l), respectively, where Ri (l) is the residual sum of squares corresponding to (Y (l)) under Mi, i = 0, 1. Thus the conditional Bayes factor B 10 (Y(l)), conditioned on Y(l) is given by B10(Y(l))

IX~Xol1/2 (Ro)(n-po)/21X~(l)X1(l)l1/2 (R1(l))(Pl-Po+l)/2 IX~X111/2

IX~(l)Xo(Z)I1/2

Ro(l)

194

6 Hypothesis Testing and Model Selection

One may now find an average of these conditional Bayes factors to find an IBF. For example, an arithmetic mean of B 10 (Y(l))'s for all possible minimal training samples Y(l)'s gives the AIBF, and a median gives the median IBF. In case of fractional Bayes factor, one obtains that (see, for example, Berger and Pericchi, 2001, page 152), with mf(X) as defined in (6.41), mg(X)

(_!!__) (Pl -po)/2

mt(X)

with b =

X~ X ljl/2 ( Rl) (m-po)/2

jX~Xojl/2

27r

m/n and hence FBF10 = b(Pl-Pol/2(Ro/R 1 )(n-m)/2.

See also O'Hagan (1995) in this context. For more examples, see Berger and Pericchi (1996a, 1996b, 2001) and O'Hagan (1995). Several other methods have been proposed as solutions to the problem with noninformative priors. Smith and Spiegelhalter (1980) and Spiegelhalter and Smith (1982) propose the imaginary minimal sample device; see also Ghosh and Samanta (2002b) for a generalization. Berger and Pericchi (2001) present comparison of four methods including the IBF and FBF with illustration through a number of examples. Ghosh and Samanta (2002b) discuss a unified derivation of some of the methods that shows that in some qualitative and conceptual sense, these methods are close to each other.

6. 7.3 Intrinsic Priors Given a default Bayes factor such as the IBF or FBF, a natural question is whether it corresponds with an actual Bayes factor based on some priors at least approximately. If such priors exist, they are called intrinsic priors. A default Bayes factor such as IBF can then be calculated as an actual Bayes factor using the intrinsic prior, and one need not consider all possible training samples and average over them. A "reasonable" intrinsic prior that corresponds to a naturally developed good default Bayes factor may be considered with be a natural default prior for the given testing or model selection problem. On the other hand, a particular default Bayesian method may be evaluated on the basis of the corresponding intrinsic prior depending on how "reasonable" the intrinsic prior is. Berger and Pericchi (1996a) describe how one can obtain intrinsic priors using an asymptotic argument. We begin with an example.

Example 6.29. (Example 6.26, continued.) Suppose that for some proper prior

1r(O) under model M 1 , BFf0

AI BF10

(6.43)

6.7 Nonsubjective Bayes Factors

195

where BF!0 denotes the Bayes factor based on a prior n(O) under M 1 . Using Laplace approximation (Section 4.3) to the integrated likelihood under M 1 , we have

BF1f 10

~ h(XIO)

fo(XIO = 0)

n- 1 12 n(e)v2,;i(deti)- 1 12

where is the MLE of under M 1 , and f is the observed Fisher information number. Thus using the expression for the AIBF in this example, and noting that i = 1, (6.43) can be expressed as 1

n(O) ~ (1jv2,;i)n

2:: exp( -Xl/2). i=l

As the RHS converges to (1/v'21f)E0 (e-xU 2 ) = (1/v'21f)(1/-/2)e-li probability one under any this suggests the intrinsic prior

2 /

with

which is a N(O, 2) density. One can easily verify that

BF!0 / AI BFw -+ 1 with probability one under any e, i.e., the AIBF is approximately the same as the Bayes factor with an N(O, 2) prior fore under M 1 . If one considers the FBF, one can directly show that the FBF, with fraction b, is exactly equal to the Bayes factor with a N(O, (b- 1 - 1)/n) prior. Let us now consider the general case. Let B 10 be the Bayes factor of M 1 to = 0, 1. We illustrate below with the AIBF. Treatment for the other IBFs and FBF will be similar. Recall that

M 0 with noninformative priors gi(Oi) for ()i under Mi, i

1 '""'mo(X(l)) AIBFw = BwB01 where B01 = - L ( ( )) .

l=l

m1 X l

Suppose for some priors Jri under Mi, i = 0, 1, AIBF10 is approximately equal to the Bayes factor based on no and n 1 , denoted BF10 (n 0 , nl). Using Laplace approximation (Section 4.3) to both the numerator and denominator of B 10 (see 6.36), AIBFw can be shown to be approximately equal to (6.44) where n denotes the sample size, Pi is the dimension of ()i, Oi is the MLE of ()i, and ii is the observed Fisher information matrix under Mi, i = 0, 1. The same approximation applied to BF10 (n 0 , nl), yields the approximation

196

6 Hypothesis Testing and Model Selection

h (XIOdn1 (01) (2n jn )Pii 2li1l- 112 fo (XIOo)no (Oo) (2n jn )Pol 2 lio l- 1/ 2

(6.45)

to BF10 (n 0 , nl). We assume that conditions for the Laplace approximation hold for the given models. To find the intrinsic priors, we equate (6.44) with (6.45) and this yields (6.46) Berger and Pericchi (1996a) obtain the intrinsic prior determining equations by taking limits on both sides of (6.46) under M 0 and M 1. Assume that, as n-+ oo, under M1, 01-+ 61, Oo-+ a(Ol), and Bo1-+ Br(Ol); under Mo, Oo-+ Oo, 0 1 -+ b(Oo), and Bo1-+ B 0(0a). The equations obtained by Berger and Pericchi (1996a) are

When M 0 is nested in M 1, Berger and Pericchi suggested the solution (6.48) However, this may not be the unique solution to (6.47). See also Dmochowski (1994) in this context. Example 6.30. (Example 6.27, continued.) A solution to the intrinsic prior determining equations suggested by Berger and Pericchi (see (6.48)) is (6.49) where

"'xi

where Z1 = (X1-X2) 2j(2ar) and Z2 = (X1 +X2) 2j(2ai) "'noncentral 2 x with d.f. = 1 and noncentrality parameter >. = 2J1 2far. Also, Z 1 and Z 2 are independent. Using the representation of a noncentral x2 density as an (infinite) weighted sum of central x2 densities, we have

6.7 Nonsubjective Bayes Factors E [ z1 where

Zf ] + z2

~ e->-./2 (>../2)i E

L-.. ]=0

XI+2j and is independent of

[

zl·

+ wj

]

197 (6.50)

We then have

and the intrinsic priors are given by

with It is to be noted that f~oo 1r1 (f.LirYI)df.L = 1.

Example 6.31. (Testing normal mean with variance unknown.) This is from Berger and Pericchi (1996a). Consider the setup of Example 6.25 with the same prior g0 under M 0 but in place of the standard noninformative prior 9I(f.L,rYI) = 1/rYI use the Jeffreys prior gi(f.L,rYI) = 1/rYi. In this case, a minimal training sample consists of two distinct observations Xi, Xi for which

Proceeding as in the previous example, noting that mo(X1,X2) m1(X1,X2) where zl and z2 are as above, and using (6.50), the intrinsic priors are obtained as

Here n 1 (f.LjrY1) is a proper prior, very close to the Cauchy (0, rY 1) prior for f.L, which was suggested by Jeffreys (1961) as a default proper prior for f.L (given rY1); see Subsection 2.7.2.

Example 6.32. Consider a negative binomial experiment; Bernoulli trials, each having probability B of success, are independently performed until a total of n successes is accumulated. On the basis of the outcome of this experiment we want to test the null hypothesis H 0 : B = ~against the alternative H 1 : B # ~ We consider this problem as choosing between the two models M 0 : B = ~ andM1=BE(0,1).

198

6 Hypothesis Testing and Model Selection

The data may be looked upon as n observations X 1 , ... , Xn where X 1 denotes number of failures before the first success, and for i = 2, · · ·, n, Xi denotes number of failures between (i - 1)th success and ith success. The random variables X 1 , .•• , Xn are i.i.d. with a common geometric distribution with probability mass function

P(Xi=x) =Bx(l-8), x=0,1,2, ... The likelihood function is

f(x 1, · · ·, X n le) -_ eE;'= 1 Xi (1 _ B)n · We consider the Jeffreys prior

which is improper. The Bayes factor with this prior is

Minimal training samples are of size 1, and the AIBF is given by

AIBFlO

= BlO

1 X :;:;

2Xi

2Xi+2 .

i=l

Let

B*(B) = E [2X1 + 1] = ~ (2x + 1) ex( 1 _B). e 2 x, +2 L 2x+2 x=O

Then the intrinsic prior is 1

n(B) =

e- 1 12 (1- e)- 1 B*(B) = 4 L(2x + 1)Txex- 112 • x=O

Simplification yields

n(B)

(B- 112

+ 8 112 /2)(2- e)- 2 .

We now consider a simple example from Lindley and Phillips (1976), also presented in Carlin and Louis (1996, Chapter 1). In 12 independent tosses of a coin, one observes 9 heads and 3 tails, the last toss yielding a tail. It is shown that one gets different results according to a binomial or a negative binomial likelihood. Let us consider the problem of testing the null hypothesis H 0 : B = 1/2 against the alternative H 1 : B =f. 1/2 where B denotes the probability of head in a trial. If a binomial model is assumed, the random observable X is the number of heads observed in a fixed number of 12 tosses. One rejects H 0 for large values of the statistic IX- 6!, and the corresponding

6.8 Exercises

199

P-value is 0.150. On the other hand, if a negative binomial model is assumed, the random observable X is the number of heads before the third trial appears. Note that expected value of X under H 0 is 3. Suppose one rejects H 0 for large values of IX- 3J. Then the corresponding P-value is 0.0325. Thus with the usual 5% Type 1 error level, the two model assumptions lead to different decisions. Let us now use a Bayes test for this problem. For the binomial model, the Jeffreys prior is proportional to e- 112 (1 - e)- 112 , which can be normalized to get a proper prior. For the negative binomial model, the data can be treated as three i.i.d. geometrically distributed random variables, as described above. The Bayes factor under the binomial model (with Jeffreys prior) and the Bayes factor under the negative binomial model (with the intrinsic prior) are respectively 1.079 and 1.424. They are different as were the P-values of classical statistics, but unlike the P-values, the Bayes factors are quite close.

6.8 Exercises 1. Assume a sharp null and continuity of the null distribution of the test statistic. (a) Calculate EHo (P-value) and EH 0 (P-valueJP-value < a), where 0 < a < 1 is the Type 1 error probability. (b) In view of your answer to (a), do you think 2(P-value) is a better measure of evidence against H 0 than P-value? 2. Suppose X rv N(B, 1) and consider the two hypothesis testing problems: H 0 : B = -1 versus H 1 : B = 1; H~

: B = 1 versus H; : B = -1.

Find the Bayes factor of H 0 relative to H 1 and that of H 0 relative to Hi_ if (a) x = 0 is observed, and (b) x = 1 is observed. Compute the classical P-value in both cases. 3. Refer to Example 6.3. Take r = 20', but keep the other parameter values unchanged. Compute B 01 for the same values oft and n as used in Table 6.1. 4. Suppose X rv N(B, 1) and consider testing

H0 : B = 0 versus H 1 : B -=/= 0. For three different values of x, x = 0, 1, 2, compute the upper and lower bounds on Bayes factors when the prior on B under the alternative hypothesis lies in (a) rA ={all prior distributions on R},

(b)

TN=

{N(O,r 2 ),r 2 > 0},

200

6 Hypothesis Testing and Model Selection

(d) rsu ={all unimodal priors on R, symmetric about 0}. Compute the classical P-value for each x value. What is the implication of rN c rsu c rs c rA? 5. Let X"' B(m, B), and let it be of interest to test H0

: ()

versus H1 :():f.

2·

If m = 10 and observed data is x = 8, compute the upper and lower bounds on Bayes factors when the prior on () under the alternative hypothesis lies in (a) rA ={all prior distributions on (0, 1)}, (b) rs ={Beta( a, a), a> 0}, (c) rs = {all symmetric (about ~) priors on (0, 1)}, (d) rsu = {all unimodal priors on (0, 1), symmetric about ~ }. Compute the classical P-value also. 6. Refer to Example 6.7. 1 (a) Show that B(GA,x) = exp(- ~ ), P(HolGA, x) = [1+ 1 -;;:o exp(~ . (b) Show that, if t:::; 1, B(Gus,x) = 1, and P(HolGus,x) =?To. (c) Show that, ift:::; 1, B(GNonx) = 1, and P(HoJGNonx) =?To. Ift > 1, B(GNor,x) = texp(-(t 2 -1)/2). 7. Suppose XJO has the tp(3, 0, Ip) distribution with density

J(xJO) ex ( 1 +

8. 9. 10. 11.

) -(3+p)/2

(x- O)'(x- 0)

and it is of interest to test H 0 : 0 = 0 versus H 1 : 0 :f. 0. Show that this testing problem is invariant under the group of all orthogonal transformations. Refer to Example 6.13. Show that the testing problem mentioned there is invariant under the group of scale transformations. In Example 6.16, find the maximal invariants in the sample space and the parameter space. In Example 6.17, find the maximal invariants in the sample space and the parameter space. Let X I() rv N ((), 1) and consider testing Ho:

JB- Bol

::::: 0.1 versus H1 :

JB- Bol > 0.1.

Suppose x = () 0 + 1.97 is observed. (a) Compute the P-value. (b) Compute Bo 1 and P(Holx) under the two priors, N(B 0 , T 2 ), with (0.148) 2 and U(Bo- 1, 80 + 1). 12. Let XJp"' Binomial(10, p). Consider the two models:

M0 : p =

versus M1 : p :f.

6.8 Exercises

201

Under M 1 , consider the following three priors for p: (i) U(O, 1), (ii) Beta(10, 10), and (iii) Beta(100, 10). If four observations, x = 0, 3, 5, 7, and 10 are available, compute kd given in Equation (6.27) for each observation, and for each of the priors and check which of the observations may be considered outliers under M 0 . 13. (Box (1980)) Let X 1 , X 2 , · · ·, Xn be a random sample from N (e, a- 2 ) with both and cr 2 unknown. It is of interest to detect discrepancy in the variance of the model with the target model being

Mo: a- 2 =

a-5,

and

e "'-' N(f-1, T 2 ),

where f-1 and T 2 are specified. (a) Show that the predictive distribution of (X1, X2, · · ·, Xn) under Mo is multivariate normal with covariance matrix a5In + T 2 11' and E(Xi) = f-1, for i = 1, 2, ... , n. (b) Show that under this predictive distribution,

(c) Derive and justify the prior predictive P-value based on the model departure statistic T(X). Apply this to data, x = (8, 5, 4, 7), and a-5 = 1, f-1 = 0, T 2 = 2. (c) What is the classical P-value for testing H 0 : a- 2 = a-5 in this problem? 14. (Box (1980)) Suppose that under the target model, for i = 1, 2, ... , n,

Yii,Bo, 0, a- 2 = ,Bo + x~O + Ei, Ei "'-' N(O, cr 2 ) i.i.d., ,6olcr 2 "'-' N(JLo,ccr 2 ),0ia 2 "'-' Nv(Oo,cr 2 F), a- 2 "'-'inverse Gamma( a,')'), where c, f-Lo, 0 0 , r, a and I' are specified. Assume the standard linear regression model notation ofy = ,60 1+X0+E, and suppose that X'l = 0. Further assume that, given a- 2 , conditionally ,60 , 0 and E are independent. Also, let /30 and 0 be the least squares estimates of ,60 and 0, respectively, and RSS = (y- /!Jol- XO)'(y- 0 1- XO). (a) Show that under the target model, conditionally on cr 2 , the predictive density of y is proportional to

( a-2)-n/2

exp(--l-((/3o-JJ,o)2 +RSS 2cr2

c+l/n

(b) Prove that the predictive distribution of y under the target model is a multivariate t. (c) Show that the joint predictive density of ( RS S, 0) is proportional to

202

6 Hypothesis Testing and Model Selection {

21'

+ RSS + (8- 8o)'((X' x)- 1 + r- 1 ) - 1 (8- 8 0 )

}-(n+a-1)/2

(d) Derive the prior predictive distribution of

(0- 8o)'((X' x)- 1 + r- 1 ) - 1 (0- 8 0 ) T(y)

21'

+ RSS

(e) Using an appropriately scaled T(y) as the model departure statistic derive the prior predictive P-value. 15. Consider the same linear regression set-up as in Exercise 14, but let the target model now be

Mo: 8

O,,Bo[a 2

,...,

N(f.1o,ca 2 ),a 2

,...,

inverse Gamma(a:,--y).

Assuming/' to be close to 0, use ~I

T( ) y

8 X'X8 RSS

as the model departure statistic to derive the prior predictive P-value. Compare it with the classical P-value for testing H 0 : 8 = 0. 16. Consider the same problem as in Exercise 15, but let the target model be

Mo: 8

O,,Bo[a ,..., N(f.lo,ca ),n(a) ex 2· a

Using T(y) = 8 X' X8 as the model departure statistic and RSS as the conditioning statistic, derive the conditional predictive P-value. Compute the partial predictive P-value using the same model departure statistic. Compare these with the classical P-value for testing H 0 : 8 = 0. 17. Let X 1 , X2, · · ·, Xn be i.i.d. with density

j(x[-\,0)

-\exp(--\(x- O)),x > 0,

where ,\ > 0 and -oo < 0 < oo are both unknown. Let the target model be

Mo : 0 = 0, n(-\) ex

>;·

Suppose the smallest order statistic, T = Xc 1l is considered a suitable model departure statistic for this problem. (a) Show that T[-\,..., exponential(n-\) under M 0 . (b) Show that -\[xabs ,..., Gamma(n, nxabs) under Mo. (c) Show that m (t IXabs ) = (

-n

nxabs

+ Xabs )n+ 1

(d) Compute the posterior predictive P-value. (e) Show that as tabs -----+ oo, the posterior predictive P-value does not necessarily approach 0. (Note that tabs :::; Xabs -----+ oo also.)

6.8 Exercises

203

18. (Contingency table) Casella and Berger (1990) present the following twoway table, which is the outcome of a famous medical experiment conducted by Joseph Lister. Lister performed 75 amputations with and without using carbolic acid. Patient Lived? Yes No

Carbolic Acid Used? Yes No 34 19 16 6

Test for association of patient mortality with the use of carbolic acid on the basis of the above data using (a) BIC and (b) the classical likelihood ratio test. Discuss the different probabilistic interpretations underlying the two tests. 19. On the basis of the data on food poisoning presented in Table 2.1, you have to test whether potato salad was the cause. (Do this separately for Crab-meat and No Crab-meat). (a) Formulate this as a problem of testing a sharp null against the alternative that the null is false. (b) Test the sharp null using BIC. (c) Test the same null using the classical likelihood ratio test. (d) Discuss whether the notions of classical Type 1 and Type 2 error probabilities make sense here. 20. Using the BIC analyze the data of Problem 19 to explore whether crabmeat also contributed to food poisoning. 21. (Goodness of fit test). Feller (1973) presents the following data on bombing of London during World War II. The entire area of South London is divided into 576 small regions of equal area and the number (nk) of regions with exactly k bomb hits are recorded.

Test the null hypothesis that bombing was at random rather than the general belief that special targets were being bombed. (Hint: Under H 0 use the Poisson model, under the alternative use the full multinomial model with 5 parameters and use BIC.) 22. (Hald's regression data). We present below a small set of data on heat evolved during the hardening of Portland cement and four variables that may be related to it (Woods et al. (1932), pp. 635-649). The sample size (n) is 13. The regressor variables (in percent of the weight) are x 1 = calcium aluminate (3Cao.Ab0 3), x 2 = tricalcium silicate (3CaO.Si0 2), x 3 = tetracalcium alumino ferrite (4Ca0.Ab0 3.Fe203), and X4 = dicalcium

204

6 Hypothesis Testing and Model Selection

Table 6.8. Cement Hardening Data X1 X2 X3 X4

7 1 11 11 7 11 3 1 2 21 1 11 10

26 6 29 15 56 8 31 8 52 6 55 9 7117 31 22 54 18 47 4 40 23 66 9 68 8

60 52 20 47 33 22 6 44 22 26 34 12 12

Y 78.6 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4

silicate (2CaO.Si0 2 ). The response variable is y = total calories given off during hardening per gram of cement after 180 days. Usually such a data set is analyzed using normal linear regression model of the form

where pis the number of regressor variables in the model, {30 , {31 , ... {3p are unknown parameters, and Ei 's are independent errors having a N(O, a 2 ) distribution. There are a number of possible models depending on which regressor variables are kept in the model. Analyze the data and choose one from this set of possible models using (a) BIC, (b) AIBF of the full model relative to all possible models.

7 Bayesian Computations

Bayesian analysis requires computation of expectations and quantiles of probability distributions that arise as posterior distributions. Modes of the densities of such distributions are also sometimes used. The standard Bayes estimate is the posterior mean, which is also the Bayes rule under the squared error loss. Its accuracy is assessed using the posterior variance, which is again an expected value. Posterior median is sometimes utilized, and to provide Bayesian credible regions, quantiles of posterior distributions are needed. If conjugate priors are not used, as is mostly the case these days, posterior distributions will not be standard distributions and hence the required Bayesian quantities (i.e., posterior quantities of inferential interest) cannot be computed in closed form. Thus special techniques are needed for Bayesian computations.

Example 7.1. Suppose X is N(B,rr 2 ) with known rr 2 and a Cauchy(f1,T) prior one is considered appropriate from robustness considerations (see Chapter 3, Example 3.20). Then

and hence the posterior mean and variance are

Note that the above integrals cannot be computed in closed form, but various numerical integration techniques such as IMSL routines or Gaussian quadrature can be efficiently used to obtain very good approximations of these. On the other hand, the following example provides a more difficult problem.

206

7 Bayesian Computations

Example 7. 2. Suppose X 1 , X 2 , ... , X k are independent Poisson counts with Xi '""Poisson(Oi)· ei are a priori considered related, and a joint multivariate normal prior distribution on their logarithm is assumed. Specifically, let vi = log(Oi) be the ith element of v and suppose

where 1 is the k-vector with all elements being 1, and JL, constants. Then, because

and n(v) ex exp (-

2 ~ 2 (v -J-Ll)' ((1- p)Ik + pll')-

and pare known

(v -J-Ll))

we have that n(v[x) ex exp {- 2:::::7= 1 { ev; - vixi}-

~(v- J-Ll)' ((1- p)Ik + pll')- 1 (v -J-Ll)}.

Therefore, if the posterior mean of ej is of interest, we need to compute

) E 1f(n·[ u1 x

= exp {- 2:::::7= 1 { ev' -

= E1f(

( ·)[ ) exp v1 x

= fnk exp(vj)g(v[x) dv

JR.k g (V [X ) d V

where g(v[x)

Vixi}-

;2 (v- J-Ll)' ((1- p)h + pll')-

(v- J-Ll)}.

This is a ratio of two k-dimensional integrals, and as k grows, the integrals become less and less easy to work with. Numerical integration techniques fail to be an efficient technique in this case. This problem, known as the curse of dimensionality, is due to the fact that the size of the part of the space that is not relevant for the computation of the integral grows very fast with the dimension. Consequently, the error in approximation associated with this numerical method increases as the power of the dimension k, making the technique inefficient. In fact, numerical integration techniques are presently not preferred except for single and two-dimensional integrals. The recent popularity of Bayesian approach to statistical applications is mainly due to advances in statistical computing. These include the E-M algorithm discussed in Section 7.2 and the Markov chain Monte Carlo (MCMC) sampling techniques that are discussed in Section 7.4. As we see later, Bayesian analysis of real-life problems invariably involves difficult computations while MCMC techniques such as Gibbs sampling (Section 7.4.4) and MetropolisHastings algorithm (M-H) (Section 7.4.3) have rendered some of these very difficult computational tasks quite feasible.

7.1 Analytic Approximation

207

7.1 Analytic Approximation This is exactly what we saw in Section 4.3.2 where we derived analytic large sample approximations for certain integrals using the Laplace approximation. Specifically, suppose

E1f (g(O)[x)

Ink g(O)J(x[B)Ir(B) d() Ink f(x[O)Jr(O) d()

is the Bayesian quantity of interest where g, 8. First, consider any integral of the form

and

(7.1)

are smooth functions of

I= { q(O)exp(-nh(O)) dO, Jnk where h is a smooth function with -h having its unique maximum at 8. Then, as indicated in Section 4.3.1 for the univariate case, the Laplace method involves expanding q and h about Bin a Taylor series. Let h' and q' denote the vectors of partial derivatives of h and q, respectively, and Llh and Llq denote the Hessians of hand q. Then writing

h(O)

h(O) + -(0- 0)' Llh(O)(O- 0) + · · · and

q(O)

+ (0- O)'h'(O) + 2(0- 0)' Llh(O)(O- 0) + · · ·

q(O)

+ (0- O)'q'(O) + 2(0- O)'Llq(O)(O- 0) + · · ·,

we obtain

Lk {

q(O) + (0- O)'q'(O) +

~(()- 0)' Llq(O)(O- 0) +

... }

~ ~ xe-n h(B) exp ( - n (0- ~ 0)' Llh(O)(O0) + · · ·) d()

e-nh(0)( 27r)k/2n-ki2[Llh(O)[-l/2 { q(O)

+ O(n-1)},

(7.2)

which is exactly (4.16). Upon applying this to both the numerator and denominator of (7.1) separately (with q equal tog and 1), a first-order approximation

easily emerges. It also indicates that a second-order approximation may be available if further terms in the Taylor series expansion are retained in the approximation. Suppose that gin (7.1) is positive, and let -nh(O) = logf(x[O)+logJr(O), -nh*(O) = -nh(O) + logg(O). Now apply (7.2) to both the numerator and

208

7 Bayesian Computations

denominator of (7.1) with q equal to 1. Then, letting()* denote the maximum of -h*, E = Llh" 1 (0), E* = Llh"}(o;), as mentioned in Section 4.3.2, Tierney and Kadane (1986) obtain the fantastic approximation

(7.3)

which they call fully exponential. This technique can be used in Example 7.2. Note that to derive the approximation in (7.3), it is enough to have the probability distribution of g(O) concentrate away from the origin on the positive side. Therefore, often when g is non-positive, (7.3) can be applied after adding a large positive constant to g, and this constant is to be subtracted after obtaining the approximation. Some other analytic approximations are also available. Angers and Delampady (1997) use an exponential approximation for a probability distribution that concentrates near the origin. We will not be emphasizing any of these techniques here, including the many numerical integration methods mentioned previously, because the availability of powerful and all-purpose simulation methods have rendered them less powerful.

7.2 The E-M Algorithm We shall use a slightly different notation here. Suppose YIO has density f(yiO), and suppose the prior on () is 1r(O), resulting in the posterior density 1r(Oiy). When 1r(Oiy) is computationally difficult to handle, as is usually the case, there are some 'data augmentation' methods that can help. The idea is to augment the observed data y with missing or latent data z to obtain the 'complete' data x = (y, z) so that the augmented posterior density 1r(Oix) = 1r(Oiy,z) is computationally easy to handle. The E-M algorithm (see Dempster et al. (1977), Tanner (1991), or McLachlan and Krishnan (1997)) is the simplest among such data augmentation methods. In our context, the E-M algorithm is meant for computing the posterior mode. However, if data augmentation yields a computationally simple posterior distribution, there are more powerful computational tools available that can provide a lot more information on the posterior distribution as will be seen later in this chapter. The basic steps in the iterations of the E-M algorithm are the following. Let p(zly, 0) be the predictive density of Z given y and an estimate 0 of 0. ')

(

Find z' = E(Ziy, () ), where() is the estimate of() used at the ith step of the iteration. Note the similarity with estimating missing values. Use z(i) to (

augment y and maximize 1r(Oiy,z') to obtain() ')

(

(i+I)

. Then find z

t+I

)

using

and continue this iteration. This combination of expectation followed by maximization in each iteration gives its name to the E-M algorithm.

O(i+IJ

7.2 The E-M Algorithm

209

Implementation of the E-M Algorithm Note that because 1f(Oiy) = 1f(0, zly)/p(zly, 0), we have that log1f(Oiy) = log1f(0, zly) -logp(zly, 0). Taking expectation with respect to ZIO(i), y on both sides, we get log1f(Oiy) = =

A(i) log1f(0, zly)p(zly, 0 ) dz-

A(i) logp(zly, O)p(zly, 0 ) dz

Q(O, O(i))- H(O, O(i))

(7.4)

(where Q and H are according to the notation of Dempster et al. (1977)). Then, the general E-M algorithm involves the following two steps in the ith iteration: i) E-Step: Calculate Q(O, 0 ); A

(

M-Step: Maximize Q(O, 0

(

) with respect to 0 and obtain 0

such that

maxQ(O,O(i)) = Q(o,o(i)). (}

Note that log 7r(O(i+l) IY) -log 1f( o(i) IY) = { Q(O(i+l)' o(i)) - Q(O(i)' o(i))} - { H(O(i+ll, O(i))- H(O(i), O(i))}. From the E-M algorithm, we have that Q(O 2 + 1 , 0 for any 0, ,

)

(

H(O, O(i))- H(O(i), O(i))

J -J [ _J

A(i) logp(zly, O)p(zly, 0 ) dz-

p(zly, 0) ] ( I ll(i)) dz log p(zly, O(i)) p z y, u

log

) ?: Q(O

logp(zly, 0

() 2

( 2')

(

). Further,

A(i) )p(zly, 0 ) dz

[p(zly, O(i))] (i) p(zly, O) p(zly, 0 ) dz

::; 0, because, for any two densities P1 and P2, Jlog(pl(x)/P2(x))Pl(x)dx is the Kullback-Leibler distance between p 1 and p 2 , which is at least 0. Therefore,

H( O(i+l), O(i)) - H( O(i), O(i)) S 0, and hence 1f(0(i+l)IY)?: 1f(0(i)IY) for any iteration i. Therefore, starting from any point, the E-M algorithm can usually be expected to converge to a local maximum.

210

7 Bayesian Computations

Table 7.1. Genetic Linkage Data Cell Count Probability Y1 = 125 2+4

~: : ~~ 1~~ =:j

f!_

Example 7.3. (genetic linkage model.) Consider the data from Rao (1973) on a certain recombination rate in genetics (see Sorensen and Gianola (2002) for details). Here 197 counts are classified into 4 categories as shown in Table 7.1, along with the corresponding theoretical cell probabilities. The multinomial mass function in this example is given by J(y[O) ex: (2 + O)Y1 (1- O)Y 2 +Y3 0Y4 , so that under the uniform(O, 1) prior one, the observed posterior density is given by

This is not a standard density due to the presence of 2 +e. If we split the first cell into two with probabilities 1/2 and e14, respectively, the complete data will be given by x = (x1, x2, x3, x4, xs), where x1 + x2 = Yl, X3 = y2, X4 = Y3 and x 5 = Y4· The augmented posterior density will then be given by

which corresponds with the Beta density. The E-step of E-M consists of obtaining Q(O, {j(i))

E [ (X2 + X 5 ) loge+ (X3 + X4) log(1 - 0) [y, {j(i)]

= {

[x2[y, {j(i)]

+ Y4} log 0 + (y2 + Y3) log(1- 0).

(7.5)

TheM-step involves finding {j(i+l) to maximize (7.5). We can do this by solving JLQ(O {j(i)) = 0 so that

{j( i+l) =

[x2[y, {j(i)] + Y4

-~-"---~,--------=----

[x2[y, {j(i)]

+ Y4 + Y2 + Y3

Now note that E 2 [y, {j(i)] = E 2 [X 1 + X 2 , {j(i)], and that '( ·l e(i) /4 X 2 [X 1 + X 2, 0' "'binomial(X1 + X2, l/HB(i)/ 4 ). Therefore,

and hence

(7.6)

7.3 Monte Carlo Sampling

211

Table 7.2. E-M Iterations for Genetic Linkage Data Example

Iteration i B(i) 1 .60825 2 .62432 3 .62648 4 .62678 5 .62682 6 .62682

t)(i)

'(i+l) _

+ Y4 + Y2 + Y3 + Y4

Y1 2+B<')

--~e<~,~)~-----------

Yl 2+B<')

(7.7)

In our example, (7. 7) converges to iJ = .62682 in 5 iterations starting from fj(o) = .5 as shown in Table 7.2.

7.3 Monte Carlo Sampling Consider an expectation that is not available in closed form. An alternative to numerical integration or analytic approximation to compute this is statistical sampling. This probabilistic technique is a familiar tool in statistical inference. To estimate a population mean or a population proportion, a natural approach is to gather a large sample from this population and to consider the corresponding sample mean or the sample proportion. The law of large numbers guarantees that the estimates so obtained will be good provided the sample is large enough. Specifically, let f be a probability density function (or a mass function) and suppose the quantity of interest is a finite expectation of the form

Ejh(X) =

(7.8)

h(x)f(x) dx

(or the corresponding sum in the discrete case). If i.i.d. observations xl' x2, ... can be generated from the density j, then

(7.9) converges in probability (or even almost surely) to Ejh(X). This justifies using hm as an approximation for Ejh(X) for large m. To provide a measure of accuracy or the extent of error in the approximation, we can again use a statistical technique and compute the standard error. If VarJh(X) is finite, then VarJ(hm) = VarJh(X)/m. Further, Var 1 h(X) can be estimated by

E 1 h 2 (X)-

(E1 h(X))

212

7 Bayesian Computations

and hence the standard error of

can be estimated by

If one wishes, confidence intervals for Eth(X) can also be provided using the central limit theorem. Because

Vm (hm- Eth(X)) Sm

(

---+ N 0,1

)

m-HxJ

in distribution, (hm - Zo:j2Sm/ yrn, hm + Zo:j2Sm/ yrn) can be used as an approximate 100(1- o:)% confidence interval for Eth(X), with zo:; 2 denoting the 100(1 - o:/2)% quantile of standard normal. The above discussion suggests that if we want to approximate the posterior mean, we could try to generate i.i.d. observations from the posterior distribution and consider the mean of this sample. This is rarely useful because most often the posterior distribution will be a non-standard distribution which may not easily allow sampling from it. Note that there are other possibilities as seen below.

Example

7.4.

(Example 7.1 continued.) Recall that

E,.(Bix)

J~00 Bexp (- ((l;~) ) ( T2 + (B- J.L) 2) -- dB (J

f~oo exp (- (02~~)2) (T2 + (B- J.L)2)-1 dB 1

f~oo B { ~¢ (~)} ( T2 + (B- J.L) 2) -- dB f~oo { ~¢ (o;;x)} (T2 + (B- J.L)2)-1 dB , where¢ denotes the density of standard normal. Thus E,.(Bix) is the ratio of expectation of h(B) = Bj(T 2 + (B- J.L) 2) to that of h(B) = 1/(T 2 + (B- J.L) 2), both expectations being with respect to the N(x, CT 2 ) distribution. Therefore, we simply sample B1 ,B2 , ... from N(x,CT 2 ) and use ~ E,.(Bix)

"'m

B· (T2 + (B· _

= ui=1 '

2:7:1 (T2

J.l

)2) --1

+ (Bi- J.L)2)-1

as our Monte Carlo estimate of E,.(Bix). Note that (7.8) and (7.9) are applied separately to both the numerator and denominator, but using the same sample of B's. It is unwise to assume that the problem has been completely solved. The sample of B's generated from N(x, CT 2 ) will tend to concentrate around x,

7.3 Monte Carlo Sampling

213

whereas to satisfactorily account for the contribution of the Cauchy prior to the posterior mean, a significant portion of the ()'s should come from the tails of the posterior distribution. It may therefore appear that it is perhaps better to express the posterior mean in the form

E,.(()ix)

() exp (- ( 8;~)

) n( ()) d()

f~oo exp (- ( 8~~)2)

n( ()) d()

then sample ()'s from Cauchy(JL, T) and use the approximation

'"'m () L..i=l

'"'m

L..i=l

(

exp -

exp

(

(8i -x)

)

(8i-x)2)

However, this is also not totally satisfactory because the tails of the posterior distribution are not as heavy as those of the Cauchy prior, and hence there will be excess sampling from the tails relative to the center. The implication is that the convergence of the approximation is slower and hence a larger error in approximation (for a fixed m). Ideally, therefore, sampling should be from the posterior distribution itself for a satisfactory approximation. With this view in mind, a variation of the above theme has been developed. This is called the Monte Carlo importance sampling. Consider (7.8) again. Suppose that it is difficult or expensive to sample directly from j, but there exists a probability density u that is very close to f from which it is easy to sample. Then we can rewrite (7.8) as

EJh(X)

h(x)f(x) dx

r h(x) u(x) f(x) u(x) dx

= Jx =

Eu { h(X)w(X)} ,

h(x)w(x)} u(x) dx

where w(x) = f(x)ju(x). Now apply (7.9) with f replaced by u and h replaced by hw. In other words, generate i.i.d. observations X 1 , X 2 , ... from the density u and compute

The sampling density u is called the importance function. We illustrate importance sampling with the following example.

214

7 Bayesian Computations

Example 7.5. Suppose X 1,X2, ... ,Xn are i.i.d. N(B,a 2), where both Band a 2 are unknown. Independent priors are assumed for B and a 2 , where B has a double exponential distribution with density exp( -IBI)/2 and a 2 has the prior density of (1 + a 2)- 2. Neither of these is a standard prior, but robust choice of proper prior all the same. If the posterior mean of B is of interest, then it is necessary to compute

Because 1r(B, a 2lx) is not a standard density, let us look for a standard density close to it. Letting x denote the mean of the sample x 1, x 2 , ... , Xn and s;, = :2::::~= 1 (xi- x) 2 /n, note that

s;}) exp( -IBI)(1 + a 2)- 2 [s; + (B- x)2r/2+1 (a2)-Cn/2+2) exp (- 2;2 { (B- x)2 + s;}) x { [s; + (B- x)2r(n/2+1)} exp( -IBI)( a2 )2

1r(B, a 21x) ex (a 2)-nf 2 exp (- ; 2 { (B- x) 2 + 2 =

ex u1 (a 2IB)u2(B) exp( -IBI)(--2 ) 2 , 1+a where u 1(a 2 IB) is the density of inverse Gamma with shape parameter n/2 + 1 and scale parameter ~{(B- x) 2 + s;,}, and u 2 is the Student's t density with d.f. n + 1, location x and scale a multiple of sn. It may be noted that the tails of exp( -IBI)( 1 ~: 2 ) 2 do not have much of an influence in the presence of u 1(a 2 IB)u 2(B). Therefore, u( B, a 2) = u 1(a 2IB)u 2(B) may be chosen as a suitable importance function. This involves sampling B first from the density u 2 (B), and given this B, sampling a 2 from u 1(a 2 IB). This is repeated to generate further values of (B, a 2 ). Finally, after generating m of these pairs (B, a 2 ), the required posterior mean of B is approximated by

In some high-dimensional problems, a combination of numerical integration, Laplace approximation and Monte Carlo sampling seems to give appealing results. Delampady et al. (1993) use a Laplace-type approximation to obtain a suitable importance function in a high-dimensional problem. One area that we have not touched upon is how to generate random deviates from a given probability distribution. Clearly, this is a very important subject being the basis of any Monte Carlo sampling technique. Instead of providing a sketchy discussion from this vast area, we refer the reader to

7.4 Markov Chain Monte Carlo Methods

215

the excellent book by Robert and Casella (1999). We would, however, like to mention one recent and very important development in this area. This is the discovery of a very efficient algorithm to generate a sequence of uniform random deviates with a very big period of 2 19937 - 1. This algorithm, known as the Mersenne twister (MT), has many other desirable features as well, details on which may be found in Matsumoto and Nishimura (1998). The property of having a very large period is especially important because Monte Carlo simulation methods, especially MCMC, require very long sequences of random deviates for proper implementation.

7.4 Markov Chain Monte Carlo Methods 7.4.1 Introduction A severe drawback of the standard Monte Carlo sampling or Monte Carlo importance sampling is that complete determination of the functional form of the posterior density is needed for their implementation. Situations where posterior distributions are incompletely specified or are specified indirectly cannot be handled. One such instance is where the joint posterior distribution of the vector of parameters is specified in terms of several conditional and marginal distributions, but not directly. This actually covers a very large range of Bayesian analysis because a lot of Bayesian modeling is hierarchical so that the joint posterior is difficult to calculate but the conditional posteriors given parameters at different levels of hierarchy are easier to write down (and hence sample from). For instance, consider the normal-Cauchy problem of Example 7.1. As shown later in Section 7.4.6, this problem can be given a hierarchical structure wherein we have the normal model, the conjugate normal prior in the first stage with a hyperparameter for its variance and this hyperparameter again has the conjugate prior. Similarly, consider Example 7.2 where we have independent observations Xi ""' Poisson(Bi)· Now suppose the prior on the 8/s is a conjugate mixture. We again see (Problem 14) that a hierarchical prior structure can lead to analytically tractable conditional posteriors. It turns out that it is indeed possible in such cases to adopt an iterative Monte Carlo sampling scheme, which at the point of convergence will guarantee a random draw from the target joint posterior distribution. These iterative Monte Carlo procedures typically generate a random sequence with the Markov property such that this Markov chain is ergodic with the limiting distribution being the target posterior distribution. There is actually a whole class of such iterative procedures collectively called Markov chain Monte Carlo (MCMC) procedures. Different procedures from this class are suitable for different situations. As mentioned above, convergence of a random sequence with the Markov property is being utilized in this procedure, and hence some basic understanding of Markov chains is required. This material is presented below. This

216

7 Bayesian Computations

discussion as well as the following sections are mainly based on Athreya et al. (2003).

7.4.2 Markov Chains in MCMC A sequence of random variables {Xn}n;:=:o is a Markov chain if for any n, given the current value, Xn, the past {Xj, j :::; n -1} and the future { Xj : j ;::: n + 1} are independent. In other words,

P(A n BJXn) = P(AJXn)P(BJXn),

(7.10)

where A and B are events defined respectively in terms of the past and the future. Among Markov chains there is a subclass that has wide applicability. They are Markov chains with time hom*ogeneous or stationary transition probabilities, meaning that the probability distribution of Xn+ 1 given Xn = x, and the past, Xj : j :::; n- 1 depends only on x and does not depend on the values of Xj : j :::; n- 1 or n. If the set S of values {Xn} can take, known as the state space, is countable, this reduces to specifying the transition probability matrix P = ((pij)) where for any two values i, j in S, Pij is the probability that Xn+ 1 = j given Xn = i, i.e., of moving from state ito state j in one time unit. For state space S that is not countable, one has to specify a transition kernel or transition function P(x, ·) where P(x, A) is the probability of moving from x into A in one step, i.e., P(Xn+ 1 E AJXn = x). Given the transition probability and the probability distribution of the initial value X 0 , one can construct the joint probability distribution of { Xj : 0 :::; j :::; n} for any finite n. For example, in the countable state space case

P(Xo = io, X1 = i1, ... , Xn-1 = in-1, Xn =in) = P(Xn = inJXo = io, ... , Xn-1 = in-d xP(Xo = io,X1 = i1, .. . Xn-1 = in-1)

= Pin~linP(Xo = io, · · ·, Xn-1 = in-1) = P(Xo = io)PioitPi 1 iz · · ·Pin~lin·

A probability distribution 1T is called stationary or invariant for a transition probability P or the associated Markov chain {Xn} if it is the case that when the probability distribution of X 0 is 1T then the same is true for Xn for all n ;::: 1. Thus in the countable state space case a probability distribution 1T = {1Ti : i E S} is stationary for a transition probability matrix P if for each j inS, P(X 1 = j)

P(X 1 = jJXo = i)P(Xo

= L 1TiPij = P(Xo

= j)

= i)

= 1Tj·

(7.11)

7.4 Markov Chain Monte Carlo Methods

In vector notation it says P with eigenvalue 1 and

7r =

( 1r 1 , 1r2 , ... )

is a left eigenvector of the matrix

7r=7rP.

(7.12)

Similarly, if S is a continuum, a probability distribution is stationary for the transition kernel P(-, ·) if

1r(A) =

217

L =is p(x) dx

with density p(x)

P(x, A)p(x) dx

for all A C S. A Markov chain {Xn} with a countable state spaceS and transition probability matrix P (Pij)) is said to be irreducible if for any two states i and J. the probability of the Markov chain visiting j starting from i is positive, i.e., for some n 2: 1,p~7) P(Xn = jiXo = i) > 0. A similar notion of irreducibility, known as Harris or Doeblin irreducibility exists for the general state space case also. For details on this somewhat advanced notion as well as other results that we state here without proof, see Robert and Casella (1999) or Meyn and Tweedie (1993). In addition, Tierney(1994) and Athreya et al. (1996) may be used as more advanced references on irreducibility and MCMC. In particular, the last reference uses the fact that there is a stationary distribution of the Markov chain, namely, the joint posterior, and thus provides better and very explicit conditions for the MCMC to converge.

=( =

Theorem 7.6. (law of large numbers for Markov chains) Let {Xn}n::o:o be a Markov chain with a countable state space S and a transition probability matrix P. Further, suppose it is irreducible and has a stationary probability distribution 1r = (1ri : i E S) as defined in (7.11). Then, for any bounded function h : S--+ R and for any initial distribution of X 0 n-l

.!_ n

i=O

h(Xi) --+

h(j)Jrj

(7.13)

in probability as n--+ oo. A similar law of large numbers (LLN) holds when the state space S is not countable. The limit value in (7.13) will be the integral of h with respect to the stationary distribution 1r. A sufficient condition for the validity of this LLN is that the Markov chain { Xn} be Harris irreducible and have a stationary distribution Jr. To see how this is useful to us, consider the following. Given a probability distribution 1r on a set S, and a function h on S, suppose it is desired to compute the "integral of h with respect to 1r", which reduces to 2: 1 h(j)1r1 in the countable case. Look for an irreducible Markov chain {Xn} with state space S and stationary distribution 1r. Then, starting from some initial value

218

7 Bayesian Computations

X 0 , run the Markov chain {Xj} for a period of time, say 0, 1, 2, ... n - 1 and consider as an estimate

(7.14) By the LLN (7.13), this estimate fJn will be close to Lj h(j)nj for large n. This technique is called Markov chain Monte Carlo (MCMC). For example, if one is interested in n(A) = LjEA 'Trj for some A C S then by LLN (7.13) this reduces to 7rn(A)

n-1

=- L n o

IA(Xj)----+ n(A)

in probability as n----+ oo, where IA(Xj) = 1 if Xj E A and 0 otherwise. An irreducible Markov chain {Xn} with a countable state spaceS is called aperiodic if for some i E S the greatest common divisor, g.c.d. {n : p~~) > 0} = 1. Then, in addition to the LLN (7.13), the following result on the convergence of P(Xn = j) holds.

L IP(Xn = j)- 'Trjl----+ 0

(7.15)

as n ----+ oo, for any initial distribution of X 0 . In other words, for large n the probability distribution of Xn will be close to 1r. There exists a result similar to (7.15) for the general state space case that asserts that under suitable conditions, the probability distribution of Xn will be close to 1r as n----+ oo. This suggests that instead of doing one run of length n, one could do N independent runs each of length m so that n = N m and then from the ith run use only the mth observation, say, Xm,i and consider the estimate 1 N

MN,m

L h(Xm,i)·

(7.16)

i=l

Other variations exist as well. Some of the special Markov chains used in MCMC are discussed in the next two sections.

7.4.3 Metropolis-Hastings Algorithm In this section, we discuss a very general MCMC method with wide applications. It will soon become clear why this important discovery has led to very considerable progress in simulation-based inference, particularly in Bayesian analysis. The idea here is not to directly simulate from the given target density (which may be computationally very difficult) at all, but to simulate an easy Markov chain that has this target density as the density of its stationary distribution. We begin with a somewhat abstract setting but very soon will get to practical implementation.

7.4 Markov Chain Monte Carlo Methods

219

Let S be a finite or countable set. Let 1r be a probability distribution on S. We shall call1r the target distribution. (There is room for slight confusion here because in our applications the target distribution will always be the posterior distribution, so let us note that 1r here does not denote the prior distribution, but just a standard notation for a generic target.) Let Q = ((Qij)) be a transition probability matrix such that for each i, it is computationally easy to generate a sample from the distribution {Qij : j E S}. Let us generate a Markov chain { Xn} as follows. If Xn = i, first sample from the distribution {qij : j E S} and denote that observation Yn. Then, choose Xn+l from the two values Xn and Yn according to

P(Xn+l P(Xn+l

YniXn, Yn)

XniXn, Yn)

p(Xn, Yn)

1- p(Xn, Yn),

(7.17)

where the "acceptance probability" p(·, ·) is given by . { 1fj Qji, pz,J ( . ") =mm

(7.18)

1fi Qij

for all (i,j) such that 1fiQij > 0. Note that {Xn} is a Markov chain with transition probability matrix P = ( (Pij)) given by Qi]Pij Pij = {

L Pik,

j # i, j = i.

(7.19)

kf'i

Q is called the "proposal transition probability" and p the "acceptance probability". A significant feature of this transition mechanism P is that P and satisfy HiPij = 1fjPji

for all i,j.

(7.20)

This implies that for any j

(7.21) is a stationary probability distribution for P. Now assume that S is irreducible with respect to Q and Hi > 0 for all i in S. It can then be shown that Pis irreducible, and because it has a stationary distribution 1r, LLN (7.13) is available. This algorithm is thus a very flexible and useful one. The choice of Q is subject only to the condition that S is irreducible with respect to Q. Clearly, it is no loss of generality to assume that Hi > 0 for all i in S. A sufficient condition for the aperiodicity of P is that Pii > 0 for some i or equivalently

or,

L #1

QijPij

< 1.

220

7 Bayesian Computations

A sufficient condition for this is that there exists a pair (i,j) such that 1ri% > 0 and 7rjqji < 1riqij. Recall that if P is aperiodic, then both the LLN (7.13) and (7.15) hold. If S is not finite or countable but is a continuum and the target distribution n( ·) has a density p( ·), then one proceeds as follows: Let Q be a transition function such that for each x, Q(x, ·) has a density q(x, y). Then proceed as in the discrete case but set the "acceptance probability" p(x, y) to be

. {p(y)q(y, x) } p(x, y) = mm p(x)q(x, y), 1 for all (x,y) such that p(x)q(x,y) > 0. A particularly useful feature of the above algorithm is that it is enough to know p( ·) up to a multiplicative constant as in the definition of the "acceptance probability" p(·, ·), only the ratios p(y) / p( x) need to be calculated. (In the discrete case, it is enough to know {ni} upto a multiplicative constant because the "acceptance probability" p(·, ·) needs only the ratios ndnj.) This assures us that in Bayesian applications it is not necessary to have the normalizing constant of the posterior density available for computation of the posterior quantities of interest.

7.4.4 Gibbs Sampling As was pointed out in Chapter 2, most of the new problems that Bayesians are asked to solve are high-dimensional. Applications to areas such as micro-arrays and image processing are some examples. Bayesian analysis of such problems invariably involve target (posterior) distributions that are high-dimensional multivariate distributions. In image processing, for example, typically one has N x N square grid of pixels with N = 256 and each pixel has k 2': 2 possible values. Thus each configuration has (256) 2 components and the state spaceS 2 has k( 256 ) configurations. To simulate a random configuration from a target distribution over such a large S is not an easy task. The Gibbs sampler is a technique especially suitable for generating an irreducible aperiodic Markov chain that has as its stationary distribution a target distribution in a highdimensional space but having some special structure. The most interesting aspect of this technique is that to run this Markov chain, it suffices to generate observations from univariate distributions. The Gibbs sampler in the context of a bivariate probability distribution can be described as follows. Let 1r be a target probability distribution of a bivariate random vector (X,Y). For each x, let P(x,·) be the conditional probability distribution of Y given X = x. Similarly, let Q(y, ·) be the conditional probability distribution of X given Y = y. Note that for each x, P(x, ·) is a univariate distribution, and for each y, Q(y, ·) is also a univariate distribution. Now generate a bivariate Markov chain Zn = (Xn, Yn) as follows: Start with some X 0 = x 0 . Generate an observation Yo from the distribution P(x 0 , ·). Then generate an observation X 1 from Q(Y0 , ·). Next generate an

7.4 Markov Chain Monte Carlo Methods

221

observation Y1 from P( X 1, ·) and so on. At stage n if Zn = (Xn, Yn) is known, then generate Xn+1 from Q(Yn, ·) and Yn+1 from P(Xn+1, ·). If 1r is a discrete distribution concentrated on {(xi, YJ) : 1 :S: i :S: K, 1 :S: j :S: L} and if 1fij = 1r(xi, YJ) then P(xi, YJ) = Hij /Hi. and

where 1fi. = Lj 1fij, 1f.j = Li 1fij· Thus the transition probability matrix R = ((r(ij),(k£))) for the {Zn} chain is given by '(ij),(k£) = Q(yj, xk)P(xk, Y£) 1fkj 1fk£ 1f.j 1fk.

It can be verified that this chain is irreducible, aperiodic, and has 1r as its stationary distribution. Thus LLN (7.13) and (7.15) hold in this case. Thus for large n, Zn can be viewed as a sample from a distribution that is close to 7f and one can approximate Li,j h(i,j)Hij by 2.:::7= 1 h(Xi, Y;)jn. As an illustration, consider sampling from (

~) ~

N2 ( (

i]).

~), [ ~

Note that the conditional distribution of X given Y = y and that of Y given X= x are X[Y = y ~ N(py, 1- p 2 ) and Y[X = x ~ N(px, 1- p 2 ).

(7.22)

Using this property, Gibbs sampling proceeds as described below to generate (Xn, Yn), n = 0, 1, 2, ... , by starting from an arbitrary value xo for Xo, and repeating the following steps fori = 0, 1, ... , n. 1. Given Xi for X, draw a random deviate from N(pxi, 1- p 2 ) and denote it by Y;. 2. Given Yi for Y, draw a random deviate from N(pyi, 1- p 2 ) and denote it by xi+1· The theory of Gibbs sampling tells us that if n is large, then (xn, Yn) is a random draw from a distribution that is close to N 2 (

(

~) , [ ~

i] ).

To see

why Gibbs sampler works here, recall that a sufficient condition for the LLN (7.13) and the limit result (7.15) is that an appropriate irreducibility condition holds and a stationary distribution exists. From steps 1 and 2 above and using (7.22), one has and xi+1

= pY; + .Jl=P2 ~i,

where 'T/i and ~i are independent standard normal random variables independent of Xi. Thus the sequence {Xi} satisfies the stochastic difference equation

222

7 Bayesian Computations

where ui+1 =

pyf1 - p 2 TJi

+ yi1- p2 ~i·

Because 7Ji,~i are independent N(O, 1) random variables, Ui+ 1 is also a normally distributed random variable with mean 0 and variance p 2 (1- p 2 ) + (1p2) = 1- p4 . Also {Ui}i2 1 being i.i.d., makes {Xi}i2 0 a Markov chain. It turns out that the irreducibility condition holds here. Turning to stationarity, note that if X 0 is a N(O, 1) random variable, then X 1 = p 2 X 0 + U1 is also a N(O, 1) random variable, because the variance of X 1 = p4 + 1- p4 = 1 and the mean of X 1 is 0. This makes the standard N(O, 1) distribution a stationary distribution for {Xn}· The multivariate extension of the above-mentioned bivariate case is very straightforward. Suppose 1r is a probability distribution of a k-dimensional random vector (X 1,Xz, ... ,Xk)· Ifu = (u 1,uz, ... ,uk) is any k-vector, let u_i = ( u 1, u 2, ... , Ui- 1, ui+ 1, ... , uk) be the k -1 dimensional vector resulting by dropping the ith component ui. Let ni(·lx-i) denote the univariate conditional distribution of Xi given that X_i (X 1,Xz,Xi_ 1,Xi+ 1, ... ,Xk) = X-i· Now starting with some initial value for X 0 = (x 01 , x 02 , ... , Xok) generate X 1 = (Xll,X 12 , ... ,X1k) sequentially by generating Xll according to the univariate distribution 7r1(·lxo_,) and then generating x12 according to 7rz(·I(Xll,xo3,Xo4,·· .,Xok) and so on. The most important feature to recognize here is that all the univariate conditional distributions, XiiX-i = x_i, known as full conditionals should easily allow sampling from them. This turns out to be the case in most hierarchical Bayes problems. Thus, the Gibbs sampler is particularly well adapted for Bayesian computations with hierarchical priors. This was the motivation for some vigorous initial development of Gibbs sampling as can be seen in Gelfand and Smith (1990). The Gibbs sampler can be justified without showing that it is a special case of the Metropolis-Hastings algorithm. Even if it is considered a special case, it still has special features that need recognition. One such feature is that full conditionals have sufficient information to uniquely determine a multivariate joint distribution. This is the famous Hammersley- Clifford theorem. The following condition introduced by Besag (1974) is needed to state this result.

Definition 7. 7. Let p(y 1, ... , Yk) be the joint density of a random vector Y = (Y1 , ... , Yk) and let p(i) (yi) denote the marginal density of Y;, i = 1, ... , k. If p(i) (Yi) > 0 for every i = 1, ... , k implies that p(y 1, . .. , Yk) > 0, then the joint density p is said to satisfy the positivity condition. Let us use the notation Pi(YiiY1, ... , Yi-1, Yi+1, ... , Yk) for the conditional density of Y;IY - i = y -i·

Theorem 7.8. (Hammersley-Clifford) Under the positivity condition, the joint density p satisfies

7.4 Markov Chain Monte Carlo Methods

223

for every y andy' in the support of p. Proof. For y and y' in the support of p, p(y1, · · · ,yk)

= Pk(YkiY1, · · · ,Yk-dP(Y1, · · · ,Yk-d _ Pk(YkiY1,···,Yk-d ( ') ('I )PY1,···,Yk-1,Yk Pk Yk Y1, · · ·, Yk-1

Pk(Yk IY1) ... ) Yk-d Pk-1 (Yk-11Y1, ... ) Yk-2, YU

Pk(Y~ IY1) ... ) Yk-d Pk-1 (Y~-11Y1, ... ) Yk-2, YU xp(y1, · · ·, Y~-1, YU

It can be shown also that under the positivity condition, the Gibbs sampler generates an irreducible Markov chain, thus providing the necessary convergence properties without recourse to the M-H algorithm. Additional conditions are, however, required to extend the above theorem to the non-positive case, details of which may be found in Robert and Casella (1999).

7.4.5 Rao-Blackwellization The variance reduction idea of the famous Rao-Blackwell theorem in the presence of auxiliary information can be used to provide improved estimators when MCMC procedures are adopted. Let us first recall this theorem. Theorem 7.9. (Rao-Blackwell theorem) Let o(X1 , X 2 , ... , Xn) be an estimator of with finite variance. Suppose that T is sufficient for and let o*(T), defined by o*(t) = E(o(X1 , X 2 , ... , Xn)IT = t), be the conditional expectation of o(X1 , X 2 , ... , Xn) given T = t. Then

The inequality is strict unless o ofT.

= o*,

or equivalently, o is already a function

Proof. By the property of iterated conditional expectation,

Therefore, to compare the mean squared errors (MSE) of the two estimators, we need to compare their variances only. Now,

224

7 Bayesian Computations

Var(8(X 1 , X2, ... , Xn)) = Var [E(8IT)] = Var(8*)

+ E [Var(8IT)]

> Var(8*), unless Var(8IT) = 0, which is the case only if 8 itself is a function ofT.

The Rao-Blackwell theorem involves two key steps: variance reduction by conditioning and conditioning by a sufficient statistic. The first step is based on the analysis of variance formula: For any two random variables SandT, because Var(S) = Var(E(SIT)) + E(Var(SIT)), one can reduce the variance of a random variable S by taking conditional expectation given some auxiliary information T. This can be exploited in MCMC. Let (Xj,Yj),j = 1,2, ... ,N be the data generated by a single run of the Gibbs sampler algorithm with a target distribution of a bivariate random vector (X, Y). Let h(X) be a function of the X component of (X, Y) and let its mean value be J-l· Suppose the goal is to estimate J-l. A first estimate is the sample mean of the h(Xj),j = 1, 2, ... , N. From the MCMC theory, it can be shown that as N -+ oo, this estimate will converge to J-l in probability. The computation of variance of this estimator is not easy due to the (Markovian) dependence of the sequence {Xj,j = 1, 2, ... , N}. Now suppose we make n independent runs of Gibbs sampler and generate (Xij, Yij ), j = 1, 2, ... , N; i = 1, 2, ... , n. Now suppose that N is sufficiently large so that (X iN, YiN) can be regarded as a sample from the limiting target distribution of the Gibbs sampling scheme. Thus (XiN, YiN), i = 1, 2, ... , n are i.i.d. and hence form a random sample from the target distribution. Then one can offer a second estimate of J-t-the sample mean of h(XiN ), i = 1, 2, ... , n. This estimator ignores a good part of the MCMC data but has the advantage that the variables h(XiN ), i = 1, 2, ... , n are independent and hence the variance of their mean is of order n- 1 . Now applying the variance reduction idea of the Rao-Blackwell theorem by using the auxiliary information YiN, i = 1, 2, ... , n, one can improve this estimator as follows: Let k(y) = E(h(X)IY = y). Then for each i, k(YiN) has a smaller variance than h(XiN) and hence the following third estimator, 1 n - Lk(YiN), n i=l

has a smaller variance than the second one. A crucial fact to keep in mind here is that the exact functional form of k(y) be available for implementing this improvement.

7.4 Markov Chain Monte Carlo Methods

225

7.4.6 Examples

Example 7.10. (Example 7.1 continued.) Recall that XIB"' N(B, 0' 2 ) with known 0' 2 and e "' Cauchy(f.l, T). The task is to simulate e from the posterior distribution, but we have already noted that sampling directly from the posterior distribution is difficult. What facilitates Gibbs sampling here is the result that the Student's t density, of which Cauchy is a special case, is a scale mixture of normal densities, with the scale parameter having a Gamma distribution (see Section 2. 7.2, Jeffreys test). Specifically,

so that 7r( e) may be considered the marginal prior density from the joint prior density of (B, A) where BIA"' N(ft, T 2 /A) and A"' Gamma(1/2, 1/2).

It can be noted that this leads to an implicit hierarchical prior structure with A being the hyperparameter. Consequently, 1r(Bix) may be treated as the marginal density from 1r(B, Alx). Now note that the full conditionals of 1r(B, Alx) are standard distributions from which sampling is easy. In particular, (7.23) (7.24) Thus, the Gibbs sampler will use (7.23) and (7.24) to generate (B, A) from

1r(B, Alx). Example 7.11. Consider the following example due to Casella and George given in Arnold (1993). Suppose we are studying the distribution of the number of defectives X in the daily production of a product. Consider the model (X I Y, e)"' binomial(Y, B), where Y, a day's production, is a random variable with a Poisson distribution with known mean A, and is the probability that any product is defective. The difficulty, however, is that Y is not observable, and inference has to be made on the basis of X only. The prior distribution is such that (B I Y = y) "' Beta( a, 1), with known a and 1 independent of Y. Bayesian analysis here is not a particularly difficult problem because the posterior distribution of BIX = x can be obtained as follows. First, note that XIB "'Poisson(AB). Next, Beta( a, I)· Therefore,

e"'

(7.25)

226

7 Bayesian Computations

The only difficulty is that this is not a standard distribution, and hence posterior quantities cannot be obtained in closed form. Numerical integration is quite simple to perform with this density. However, Gibbs sampling provides an excellent alternative. Instead of focusing on OIX directly, view it as a marginal component of (Y, (} I X). It can be immediately checked that the full conditionals of this are given by YIX = x, (} "'x + Poisson(.X(1 - 0) ), and OIX = x, Y = y"' Beta( a+ x, 1 + y- x) both of which are standard distributions. Example 7.12. (Example 7.11 continued.) It is actually possible here to sample from the posterior distribution using what is known as the accept-reject Monte Carlo method. This widely applicable method operates as follows. Let g(x)/ K be the target density, where K is the possibly unknown normalizing constant of the unnormalized density g. Suppose h(x) is a density that can be simulated by a known method and is close to g, and suppose there exists a known constant c > 0 such that g(x) < ch(x) for all x. Then, to simulate from the target density, the following two steps suffice. (See Robert and Casella (1999) for details.) Step 1. Generate Y"' hand U"' U(O, 1); Step 2. Accept X= Y if U::; g(Y)/{ch(Y)}; return to Step 1 otherwise. The optimal choice for cis sup{g(x)/h(x), but even this choice may result in undesirably large number of rejections. In our example, from (7.25),

g(O) = exp(-.XO)Ox+o-l(1-

0)'~'- 1 1{0::;

(}::; 1},

so that h(O) may be chosen to be the density of Beta(x +a,,). Then, with the above-mentioned choice for c, if(}"' Beta(x+a,/) is generated in Step 1, its 'acceptance probability' in Step 2 is simply exp(-.XO). Even though this method can be employed here, we, however, would like to use this technique to illustrate the Metropolis-Hastings algorithm. The required Markov chain is generated by taking the transition density q(z, y) = q(ylz) = h(y), independently of z. Then the acceptance probability is

p(z, y)

. {g(y)h(z) } mm g(z)h(y), 1

=min {exp (-.X(y- z)), 1}. Thus the steps involved in this "independent" M-H algorithm are as follows. Start at t = 0 with a value x 0 in the support of the target distribution; in this case, 0 < x 0 < 1. Given Xt, generate the next value in the chain as given below. (a) Draw Yt from Beta(x +a,,). (b) Let x = { Yt with probability Pt (t+l) Xt otherwise,

7.4 Markov Chain Monte Carlo Methods

227

where Pt = min{ exp ( ->.(yt- Xt)), 1}. (c) Set t = t + 1 and go to step (a). Run this chain until t = n, a suitably chosen large integer. Details on its convergence as well as why independent M-H is more efficient than acceptreject Monte Carlo can be found in Robert and Casella (1999). In our example, for x = 1, a= 1, 'Y = 49 and>.= 100, we simulated such a Markov chain. The resulting frequency histogram is shown in Figure 7.1, with the true posterior density super-imposed on it. Example 7.13. In this example, we discuss the hierarchical Bayesian analysis of the usual one-way ANOVA. Consider the model Yij

()i

Eij

N(O,O"l),j

+Eij,j = 1, ... ,ni;i = 1, ... ,k; = 1, ...

,ni;i

= 1, ...

,k,

(7.26)

and

0"7

be such that they

and are independent. Let the first stage prior on are i.i.d. with (}i

N(f.ktr, 0";),

()i

1, ... , k;

(]"7 "'inverse Gamma(a 1 , bl),

i = 1, ... , k.

The second stage prior on f.ktr and O"; is

/ ::=r\1\

1\ -

1\c\

"ltffl, 0

0.02

0.04

Fig. 7.1. M-H frequency histogram and true posterior density.

0.06

228

7 Bayesian Computations

Here a 1 , a 2 , b1 , b2 , J.Lo, and a5 are all specified constants. Let us concentrate on computing Sufficiency reduces this to considering only -

""" Yij, i = 1, ... , k

= -

n~ t

and

j=l

""" si2= ~(Yij-

-2· Yi) ,z =

1, ... , k.

j=l

From normal theory, 2 2 Yil8, u "'N(Oi, ai /ni),

1, ... , k,

which are independent and are also independent of STI8,u 2 i=1, ... ,k,

rvalx;;-ll

which again are independent. To utilize the Gibbs sampler, we need the full conditionals of7r(8,u 2 ,J.L7r,a;ly). It can be noted that it is sufficient, and in fact advantageous to consider the conditionals, (i) 7r(8lu 2 ,J.L7r,a;,y), (ii) 1r(u2 l8, fL1r, y), (iii) 7r(J.L7rla;,8,u 2 ,y), and (iv) 7r(a;IJ.L7r, 8, u 2 , y), rather than considering the set of all univariate full conditionals because of the special structure in this problem. First note that

a;,

and hence

(7.27) which determines (i). Next we note that, given 8, from (7.26), St 2 = I:,j~ 1 (Yij - Oi) 2 is sufficient for af, and they are independently distributed. Thus we have, and are independent, and

a; are i.i.d. inverse Gamma (a

bl), so that

7.4 Markov Chain Monte Carlo Methods

229 (7.28)

and they are independent fori= 1, ... , k, which specifies (ii). Turning to the full conditional of Jl1r, we note from the hierarchical structure that the conditional distribution of Jl1r I u 2 , y is the same as the conditional distribution of Jl1r 10';, (). To determine this distribution, note that

0';, (),

ei IJL1r, 0'; "" N(JL1f, 0'; ), fori= 1, ... , k and are i.i.d. and Jl1r ""N(Jlo, 0'6). Therefore, treating() to be a random sample from N(J11f, so that 1J = ~7= 1 B;jk is sufficient for J11r, we have the joint distribution,

0';),

Thus we obtain,

which provides (iii). Just as in the previous case, the conditional distribution of IJ11r, (), u 2 , y turns out to be the same as the conditional distribution of IJ11r, (). To obtain this, note again that

0';

for i = 1, 'k and are i.i.d. so that this time ~7=1 Further

0';.

L(Bi- Jl1r ) 210'; ""O';X~ and

(ei - jl1f ) 2 is sufficient for

0'; ""inverse Gamma(a2, b2),

i=l

so that

This gives us (iv), thus completing the specification of all the required full conditionals. It may be noted that the Gibbs sampler in this problem requires simulations from only the standard normal and the inverse Gamma distributions.

Reversible Jump MCMC There are situations, especially in model selection problems, where the MCMC procedure should be capable of moving between parameter spaces of different dimensions. The standard M-H algorithm described earlier is incapable

230

7 Bayesian Computations

of such movements, whereas the reversible jump algorithm of Green (1995) is an extension of the standard M-H algorithm to allow exactly this possibility. The basic idea behind this technique as applied to model selection is as follows. Given two models M 1 and M 2 with parameter sets fh and fh, which are possibly of different dimensions, fill the difference in the dimensions by supplementing the parameter sets of these models. In other words, find auxiliary variables 112 and 12 1 such that (8 1,1 12) and (82,12 1) can be mapped with a bijection. Now use the standard M-H algorithm to move between the two models; for moves of the M-H chain within a model, the auxiliary variables are not needed. We sketch this procedure below, but for further details refer to Robert and Casella (1999), Green (1995), Sorensen and Gianola (2002), Waagepetersen and Sorensen (2001), and Brooks et al. (2003). Consider models M 1 , M 2 , ... where model Mi has a continuous parameter space fh The parameter space for the model selection problem as a whole may be taken to be

Let f(xiMi, Oi) be the model density under model Mi, and the prior density be

where 1ri is the prior probability of model Mi and n(OiiMi) is the prior density conditional on Mi being true. Then the posterior probability of any B C U/7] is

where is the posterior density restricted to Mi. To compute the Bayes factor of Mk relative to Mz, we will need

P"(Mklx) nz P"(Mzlx) 7rk' where

P" ( Mi lx)

ni .fte. n(OiiMi)f(xiMi, Oi) dOi

= =-----'+'-~__,----,----,---------:---,------c-

l::j 7rj fej n( Oj IMj) f(xiMj, Oj) dOj

is the posterior probability of Mi. Therefore, for the target density n(Oix), we need a version of the M-H algorithm that will facilitate the above-shown computations. Suppose Oi is a vector of length ni. It suffices to focus on moves between Oi in model Mi and Oj in model Mj with ni < nj· The scheme provided by Green (1995) is as follows. If the current state of the chain is (Mi,Oi), a new value (Mj,Oj) is proposed for the chain from a proposal

7.4 Markov Chain Monte Carlo Methods

231

(transition) distribution Q(8i, d8j ), which is then accepted with a certain acceptance probability. To move from model Mi to Mj, generate a random vector V of length nj - ni from a proposal density nj-ni

'1/Jij(v) =

'1/J(vrn).

rn=l

Identify an appropriate bijection map

and propose the move from 8i to 8j using 8j = hij(8i, V). The acceptance probability is then

where

. (8 8 ) a,J

7r(8jiMj,X)Pji(8j) Iahij(8i, v) I 7r(8iiMi,x)Pij(8i)'I/Jij(v) 8(8i, v) '

with Pij(8i) denoting the (user-specified) probability that a proposed jump to model Mj is attempted at any step starting from 8i E th Note that "L, j Pii = 1.

Example 1.14. For illustration purposes, consider the simple problem of comparing two normal means as in Sorensen and Gianola (2002). Then, the two models to be compared are

To implement the reversible jump M-H algorithm we need the map, h 12 taking (v, o- 2 , V) to (v1 , v 2 , o- 2 ). A reasonable choice for this is the linear map

7 .4. 7 Convergence Issues As we have already seen, Monte Carlo sampling based approaches for inference make use of limit theorems such as the law of large numbers and the central limit theorem to justify their validity. When we add a further dimension to this sampling and adopt MCMC schemes, stronger limit theorems are needed. Ergodic theorems for Markov chains such as those given in equations (7.13) and (7.15) are these useful results. It may appear at first that this procedure

232

7 Bayesian Computations

necessarily depends on waiting until the Markov chain converges to the target invariant distribution, and sampling from this distribution. In other words, one needs to start a large number of chains beginning with different starting points, and pick the draws after letting these chains run sufficiently long. This is certainly an option, but the law of large numbers for dependent chains, (7.13) says also that this is unnecessary, and one could just use a single long chain. It may, however, be a good idea to use many different chains to ensure that convergence indeed takes place. For details, see Robert and Casella (1999). There is one important situation, however, where MCMC sampling can lead to absurd inferences. This is where one resorts to MCMC sampling without realizing that the target posterior distribution is not a probability distribution, but an improper one. The following example is similar to the normal problem (see Exercise 13) with lack of identifiability of parameters shown in Carlin and Louis (1996).

Example 7.15. (Example 7.11 continued.) Recall that, in this problem, (X

Y, e) "' binomial(Y, e), where Y I A "' Poisson( A). Earlier, we worked with

a known mean A, but let us now see if it is possible to handle this problem with unknown A. Because Y is unobservable and only X is observable, there exists an 'identifiability' problem here, as can be seen by noting that Xle "' Poisson(,\e). We already have the Beta(a,/) prior on e. Suppose 0

<: a="" according="" an="" consider="" ex="" i="" independent="" on="" prior="" rel="nofollow" to="" which=""> 0). Then, 7r(A,elx) ex exp(-,\e),\xex+a- 1(1- e)'- 1,0 <

e < 1,,\ > 0.

(7.31)

This joint density is improper because

exp(-Ae)Axex+a- 1(1-e)'- 1d,\de

11 (100 exp( -Ae)Ax d,\) ex+a-1(1- ep-1 de

r(x + 1) ex+a-1(1- e)'-1 de ex+1

r(x

+ 1) 1

e"'- 2 (1- ep- 1 de

= 00.

In fact, the marginal distributions are also improper. However, it has full conditional distributions that are proper:

Ale, X"' Gamma (x

+ 1, e)

and 7r(eiA, x) ex exp( -Ae)ex+<>- 1(1- e)'- 1.

Thus, for example, the Gibbs sampler can be successfully employed with these proper full conditionals. To generate from 1r(el,\,x), one may use the independent M-H algorithm described in Example 7.12. Any inference on the

7.5 Exercises

233

marginal posterior distributions derived from this sample, however, will be totally erroneous, whereas inferences can indeed be made on >..e. In fact, the non-convergence of the chain encountered in the above example is far from being uncommon. Often when we have a hierarchical prior, the prior at the final stage of the hierarchy is an improper objective prior. Then it is not easy to check that the joint posterior is proper. Then none of the theorems on convergence of the chains may apply, but the chain may yet seem to converge. In such cases, inference based on MCMC may be misleading in the sense of what was seen in the example above.

7.5 Exercises 1. (Flury and Zoppe (2000)) A total of m + n lightbulbs are tested in two independent experiments. In the first experiment involving n lightbulbs, the exact lifetimes Y1, ... , Yn of all the bulbs are recorded. In the second involving m lightbulbs, the only information available is whether these lightbulbs were still burning at some fixed time t > 0. This is known as right-censoring. Assume that the distribution of lifetime is exponential with mean 1je, and use n(e) ex: e- 1 . Find the posterior mode using the E-M algorithm. 2. (Flury and Zoppe (2000)) In Problem 1, use uniform(O, e) instead of exponential for the lifetime distribution, and n(e) = I(o,oo)(e). Show that the E-M algorithm fails here if used to find the posterior mode. 3. (Inverse c.d.f. method) Show that, if the c.d.f. F(x) of a random variable X is continuous and strictly increasing, then U = F(X) "' U[O, 1], and if V "' U[O, 1], then Y F- 1 (V) has c.d.f F. Using this show that if U "' U[O, 1], -ln U / (3 is an exponential random variable with mean (3- 1 . 4. (Box-Muller transformation method) Let U1 and U2 be a pair of independent Uniform (0, 1) random variables. Consider first a transformation to

and then let X

= R cos V; Y = R sin V.

Show that X and Y are independent standard normal random variables. 5. Prove that the accept-reject Monte Carlo method given in Example 7.12 indeed generates samples from the target density. Further show that the expected number of draws required from the 'proposal density' per observation is c- 1 . 6. Using the methods indicated in Exercises 1, 2, and 3 above, or combinations thereof, prove that the standard continuous probability distributions can be simulated.

234

7 Bayesian Computations

7. Consider a discrete probability distribution that puts mass p; on point x;, i = 0, 1, .... Let U"" U(O, 1), and define a new random variable Y as follows. y _ { xo if U ::::; Po; . x; if 2:~:~Pj < U::::; L~=oPj, i ~ 1. What is the probability distribution of Y? 8. Show that the random sequence generated by the independent M-H algorithm is a Markov chain. 9. (Robert and Casella (1999)) Show that the Gamma distribution with a non-integer shape parameter can be simulated using the accept-reject method or the independent M-H algorithm. 10. Gibbs Sampling for Multinomial. Consider the ABO Blood Group problem from Rao (1973). The observed counts in the four blood groups, 0, A, B, and ABare as given in Table 7.3. Assuming that the inheritance of these blood groups is controlled by three alleles, A, B, and 0, of which 0 is recessive to A and B, there are six genotypes 00, AO, AA, BO, BB, and AB, but only four phenotypes. If r, p, and q are the gene frequencies of 0, A, and B, respectively (with p + q + r = 1), then the probabilities of the four phenotypes assuming Hardy-Weinberg equilibrium are also as shown in Table 7.3. Thus we have here a 4-cell multinomial probability vector that is a function of three parameters p, q, r with p + q + r = 1. One may wish to formulate a Dirichlet prior for p, q, r. But it will not be conjugate to the 4-cell multinomial likelihood function in terms of p, q, r from the data, and this makes it difficult to work out the posterior distribution of p, q, r. Although no data are missing in the real sense of the term, it is profitable to split each of the nA and nB cells into two: nA into nAA, nAo with corresponding probabilities p 2 , 2pr and nB into nEB, nBo with corresponding probabilities q2 , 2qr, and consider the 6-cell multinomial problem as a complete problem with nAA, nEB as 'missing' data. Table 7.3. ABO Blood Group Data

Cell Count Probability no= 176 nA 182

= =

nAB=

60 17

+ 2pr l + 2qr

2pq

= no+ nA + nB +nAB, and denote the observed data by n = (no, nA, nB, nAB). Consider estimation of p, q, r using a Dirichlet prior with parameters a, {3, '"Y with the 'incomplete' observed data n. Let N

The likelihood upto a multiplicative constant is

L(p, q, r)

r2no (p2 + 2pr)nA (q2 + 2qr)nB (pq)nAB.

7.5 Exercises

235

The posterior density of (p, q, r) given n is proportional to

r2no+!-l(p2

+ 2pr)nA(q2 + 2qr)nB(p)nAB+a-l(q)nAB+.6-1.

Let nA = nAA + nAo, ns = nss + nso, and write noo for no. Verify that if we have the 'complete' data, ii = (noo, nAA, nAo, nss, nso, nAs), then the likelihood is, upto a multiplicative constant

where

n1 = nAA

+ 2nAB + 2nAo

+ nB

= 2nAB + nss + 2nso

= 2nAo + 2nso + noo.

Show that the posterior distribution of (p, q, r) given ii is Dirichlet with parameters n1 +a- 1, nt + f3- 1, n6 + 1- 1, when the prior is Dirichlet with parameters (a, /3, 1). Show that the conditional distributions of (nAA, nss) given ii and (p, q, r) is that of two independent binomials: 2

(nAAin,p, q, r)

binomial(nA,

p ), +2pr

(nssln,p,q,r) (p, q, rln, nAA, nss)

binomial(ns,

q ), and q +2qr 2

Dirichlet(n1 +a- 1, nt

+ /3-

1, n6

+ 1- 1).

Show that the Rao-Blackwellized estimate of (p, q, r) from a Gibbs sample of size m is

where the superscript i denotes the ith draw. 11. (M-H for the Weibull Model: Robert (2001)). The following twelve observations are from a simulated reliability study: 0.56, 2.26, 1.90, 0.94, 1.40, 1.39, 1.00, 1.45, 2.32, 2.08, 0.89, 1.68. A Wei bull model with the following density form is considered appropriate:

f(xla, ry) ex aryx"'- 1e-ryx"', 0 < x < oo, with parameters (a,ry). Consider the prior distribution

236

7 Bayesian Computations

7r( a, 17) ex e -a17{3-l e -t;1). The posterior distribution of (a,ry) given the data (x 1 ,x 2 , ... ,xn) has density

To get a sample from the posterior density, one may use the M-H algorithm with proposal density

1 {-a'a- - -11'17 } ,

q(a', ry'la, 17) = - exp

ary

which is a product of two independent exponential distributions with means a, 11· Compute the acceptance probability p((a',ry'), (o:(tl,ry(tl)) at the tth step of the M-H chain, and explain how the chain is to be generated. 12. Complete the construction of the reversible jump M-H algorithm in Example 7.14. In particular, choose an appropriate prior distribution, proposal distribution and compute the acceptance probabilities. 13. (Carlin and Louis (1996)) Suppose Yl,Y2, ... ,Yn is an i.i.d. sample with

Yil81, 82 '""N(81 + 82, a 2), where a 2 is assumed to be known. Independent improper uniform prior distributions are assumed for 81 and 82. (a) Show that the posterior density of (81,82IY) is

7r(8l' 82IY) ex exp( -n(8l + 82 - y) 2I (2a 2) )I( (81' 82) E R 2), which is improper, integrating to oo (over R 2 )). (b) Show that the marginal posterior distributions are also improper. (c) Show that the full conditional distributions of this posterior distribution are proper. (d) Explain why a sample generated using the Gibbs sampler based on these proper full conditionals will be totally useless for any inference on the marginal posterior distributions, whereas inferences can indeed be made on 81 +82. 14. Suppose X 1 , X 2 , ... , Xk are independent Poisson counts with Xi having mean 8i. 8i are a priori considered related, but exchangeable, and the prior k

7r(81, ... '8k) ex (1 +

L 8i)-(k+l)'

i=l is assumed. (a) Show that the prior is a conjugate mixture. (b) Show how the Gibbs sampler can be employed for inference.

7.5 Exercises

237

15. Suppose X 1 ,X2, ... ,Xn are i.i.d. random variables with

and independent scale-invariant non-informative priors on >. 1 and >. 2 are used. i.e., 1r(>.1, >-2) ex (>.1>-2)- 1I(>-1 > 0, A2 > 0). (a) Show that the marginals of the posterior, 1r(>. 1, >- 21x) are improper, but the full conditionals are standard distributions. (b) What posterior inferences are possible based on a sample generated from the Gibbs sampler using these full conditionals?

8 Some Common Problems in Inference

We have already discussed some basic inference problems in the previous chapters. These include the problems involving the normal mean and the binomial proportion. Some other usually encountered problems are discussed in what follows.

8.1 Comparing Two Normal Means Investigating the difference between two mean values or two proportions is a frequently encountered problem. Examples include agricultural experiments where two different varieties of seeds or fertilizers are employed, or clinical trials involving two different treatments. Comparison of two binomial proportions was considered in Example 4.6 and Problem 8 in Chapter 4. Comparison of two normal means is discussed below. Suppose the model for the available data is as follows. Yn, ... , Y1 n, is a random sample of size n 1 from a normal population, N(B 1,err), whereas Y21, ... , Y2n 2 is an independent random sample of size n2 from another normal population, N(B 2 , er§). All the four parameters 8 1, 8 2, err, and er§ are unknown, but the quantity of inferential interest is 7) = el - e2. It is convenient to consider the case, err = er§ = er 2 separately. In this case, (Yi, Y2,s 2) is jointly sufficient for (B1,B2,er 2) where s 2 = C2=~~ 1 (Yli- Y1? + 2::::7~ 1 (Y2j- Y2) 2)/(n1 + n2- 2). Further, given (8 1, 82, er 2),

and they are independently distributed. Upon utilizing the objective prior, n(B 1,B2,er 2) ex er- 2, one obtains

and hence

240

8 Some Common Problems in Inference

(8.1) Now, note that

Consequently, integrating out a 2 from (8.1) yields,

(8.2) or, equivalently

In many situations, the assumption that = a~ is not tenable. For example, in a clinical trial the populations corresponding with two different treatments may have very different spread. This problem of comparing means when we have unequal and unknown variances is known as the Behrens-Fisher problem, and a frequentist approach to this problem has already been discussed in Problem 17, Chapter 2. We discuss the Bayesian approach now. We have that (l\' si) is sufficient for (el' a?) and (Y2' s~) is sufficient for (e2' a~)' where s7 = I:7~ 1 (Yi1- fi) 2/(ni- 1), i = 1, 2. Also, given (0 1 , 02, af,

an,

and further, they are all independently distributed. Now employ the objective prior and proceed exactly as in the previous case. It then follows that under the posterior distribution also el and e2 are independent, and that and It may be immediately noted that the posterior distribution of TJ = el - e2, however, is not a standard distribution. Posterior computations are still quite easy to perform because Monte Carlo sampling is totally straightforward. Simply generate independent deviates el and e2 repeatedly from (8.3) and utilize the corresponding TJ = el - e2 values to investigate its posterior distribution. Problem 4 is expected to apply these results. Extension to the k-mean problem or one-way ANOVA is straightforward. A hierarchical Bayes approach to this problem and implementation using MCMC have already been discussed in Example 7.13 in Chapter 7.

8.2 Linear Regression

241

8.2 Linear Regression We encountered normal linear regression in Section 5.4 where we discussed prior elicitation issues in the context of the problem of inference on a response variable Y conditional on some predictor variable X. Normal linear models in general, and regression models in particular are very widely used. We have already seen an illustration of this in Example 7.13 in Section 7.4.6. We intend to cover some of the important inference problems related to regression in this section. Extending the simple linear regression model where E(YI,Bo, ,61 , X = x) = ,60 + ,61 x to the multiple linear regression case, E(Yif3, X = x) = ,6'x, yields the linear model (8.4) y=Xf3+E, where y is the n-vector of observations, X the n x p matrix having the appropriate readings from the predictors, ,13 the p-vector of unknown regression coefficients, and E the n-vector of random errors with mean 0 and constant variance a- 2 . The parameter vector then is (,13, a- 2 ), and most often the statistical inference problem involves estimation of ,13 and also testing hypotheses involving the same parameter vector. For convenience, we assume that X has full column rank p < n. We also assume that the first column of X is the vector of 1 's, so that the first element of ,13, namely ,61 , is the intercept. If we assume that the random errors are independent normals, we obtain the likelihood function for (,13, a- 2 ) as

f(ylf3, a-

)

[~a- r

exp {-

=[~o-r exp{- 2 ~ 2

2 ~ 2 (y- Xj3)'(y- X,6)}

[(y-y)'(y-y)+(,L3-;3)'X'X(f3-f3)J} (8.5)

where ;3 = (X'X)- 1 X'y, andy= X)3. It then follows that ;3 is sufficient for ,13 if a- 2 is known, and (;3, (y-y)'(y-y)) is jointly sufficient for (,13, a- 2 ). Further,

and is independent of

We take the prior,

(8.6) This leads to the posterior,

242

8 Some Common Problems in Inference (8.7)

It can be seen that ,B[,B,a- 2

Nv(,B,a- 2 (X'X)- 1 )

and that the posterior distribution of a- 2 is proportional to an inverse x;,-p" Integrating out a- 2 from this joint posterior density yields the multivariate t marginal posterior density for ,8, i.e., n(,B[y)

F(n/2)[X'X[ 112 s-P (r(1/2))P F((n- p)j2)(y'n- p)P

[

(,B-,B)'X'X(,B-,8)]-~

1+-'-----:-'---,---'::-----'-

(n- p)s2

(8.8)

where s 2 = (y- y)'(y- y)j(n- p). From this, it can be deduced that the posterior mean of ,B is ,B if n 2: p + 2, and the 100(1-a )% HPD credible region for ,B is given by the ellipsoid (8.9) where Fp,n-p(a) is the (1 -a) quantile of the Fp,n-p distribution. Further, if one is interested in a particular /3j, the fact that the marginal posterior distribution of /3j is given by (8.10) where djj is the jth diagonal entry of (X' x)- 1 , can be used. Conjugate priors for the normal regression model are of interest especially if hierarchical prior modeling is desired. This discussion, however, will be deferred to the following chapters where hierarchical Bayesian analysis is discussed. Example 8.1. Table 8.1 shows the maximum January temperatures (in degrees Fahrenheit), from 1931 to 1960, for 62 cities in the U.S., along with their latitude (degrees), longitude (degrees) and altitude (feet). (See Mosteller and Tukey, 1977.) It is of interest to relate the information supplied by the geographical coordinates to the maximum January temperatures. The following summary measures are obtained.

X' X

62.0 2365.0 5674.0 56012.0 2365.0 92955.0 217285.0 2244586.0 5674.0 217285.0 538752.0 5685654.0 [ 56012.0 2244586.0 5685654.0 1.7720873 X 108

l '

8.2 Linear Regression

243

Table 8.1. Maximum January Temperatures for U.S. Cities, with Latitude, Longitude, and Altitude City Latitude Longitude Altitude Max. Jan. Temp Mobile, Ala. 30 88 5 61 32 86 160 59 Montgomery, Ala. 58 134 50 30 Juneau, Alaska 33 112 1090 64 Phoenix, Ariz. 286 Little Rock, Ark. 34 51 92 340 Los Angeles, Calif. 34 118 65 San Francisco, Calif. 65 55 37 122 5280 Denver, Col. 42 39 104 40 New Haven, Conn. 41 72 37 Wilmington, Del. 41 135 39 75 44 Washington, D.C. 25 38 77 20 Jacksonville, Fla. 81 67 38 Key West, Fla. 74 24 5 81 Miami, Fla. 25 10 76 80 52 Atlanta, Ga. 84 1050 33 Honolulu, Hawaii 21 21 157 79 43 2704 Boise, Idaho 116 36 Chicago, Ill. 41 33 595 87 Indianapolis, Ind. 37 39 710 86 29 41 Des Moines, Iowa 805 93 27 42 620 Dubuque, Iowa 90 42 1290 Wichita, Kansas 97 37 Louisville, Ky. 44 450 38 85 64 New Orleans, La. 29 5 90 32 43 Portland, Maine 25 70 44 20 Baltimore, Md. 39 76 42 Boston, Mass. 21 37 71 42 Detroit, Mich. 585 83 33 84 23 Sault Sainte Marie, Mich. 46 650 Minneapolis -St. Paul, Minn. 815 22 44 93 40 St. Louis, Missouri 455 38 90 Helena, Montana 29 4155 112 46 Omaha, Neb. 32 1040 41 95 32 Concord, N.H. 290 43 71 Atlantic City, N.J. 43 74 10 39 Albuquerque, N.M. 46 106 4945 35 continues

(X' x)- 1 =

10- 5

94883.1914 -1342.5011 -485.0209 2.5756] -1342.5011 37.8582 -0.8276 -0.0286 -485.0209 -0.8276 5.8951 -0.0254 [ 2.5756 -0.0286 -0.0254 0.0009

244

8 Some Common Problems in Inference Table 8.1 continued

City Latitude Longitude Altitude Max. Jan. Temp 42 31 Albany, N.Y. 73 20 New York, N.Y. 40 40 73 55 Charlotte, N.C. 35 80 720 51 Raleigh, N.C. 35 78 365 52 Bismark, N.D. 46 1674 100 20 Cincinnati, Ohio 39 84 41 550 Cleveland, Ohio 41 81 35 660 Oklahoma City, Okla. 35 97 1195 46 45 Portland, Ore. 122 77 44 Harrisburg, Pa. 40 76 365 39 Philadelphia, Pa. 39 40 75 100 Charlestown, S.C. 32 61 79 9 Rapid City, S.D. 44 34 103 3230 Nashville, Tenn. 36 450 49 86 Amarillo, Tx. 35 101 3685 50 Galveston, Tx. 29 94 61 5 Houston, Tx. 29 64 95 40 Salt Lake City, Utah 40 111 4390 37 Burlington, Vt. 44 110 73 25 Norfolk, Va. 36 76 50 10 Seattle-Tacoma, Wash. 47 122 44 10 Spokane, Wash. 47 117 1890 31 Madison, Wise. 43 89 26 860 Milwaukee, Wise. 43 87 28 635 Cheyenne, Wyoming 41 104 6100 37 San Juan, Puerto Rico 18 66 35 81

X y =

2739.0 ) ( 100.8260) 99168.0 -1.9315 252007.0 'f3 = 0.2033 ' and s = 6.05185. A

(

2158463.0

-0.0017

On the basis of the analysis explained above, /3 may be taken as the estimate of (3 and the HPD credible region for it can be derived using (8.9). Suppose instead one is interested in the impact of latitude on maximum January temperatures. Then the 95% HPD region for the corresponding regression coefficient (32 can be obtained using (8.10). This yields the t-interval, ;32 ± sy'd22t 58 (.975), or (-2.1623, -1. 7007), indicating an expected general drop in maximum temperatures as one moves away from the Equator. If the joint impact of latitude and altitude is also of interest, then one would look at the HPD credible set for ((32 , (34 ). This is given by { (/32, /34) : (/32

+ 1.9315,/34 + 0.0017)C- 1 (/32 + 1.9315,/34 + 0.0017)' :'::: 2s 2F2,5s(a) },

8.3 Logit Model, Probit Model, and Logistic Regression

245

oM 0 0 0

oM 0 0 0 I

N 0 0 0 I

1"1 0 0 0 I

'
-2.4

-2

-2.2

-1.8

-1.6

-1.4

Fig. 8.1. Plot of 95% and 99% HPD credible regions for

where cis the appropriate 2 -4 C = 10

(

(/32 , /34 ).

2 block from (X' x)-1,

3.7858 -2.8636 X 10- 3 ) -2.8636 X 10- 3 9.2635 X 10- 5

Plotted in Figure 8.1 are the 95% and 99% HPD credible regions for

((32 , (34 ). Impact of altitude on maximum temperatures seems to be very limited for the case under consideration. Literature on Bayesian approach to linear regression is very large. Some of this material relevant to the discussion given above may be found in Box and Tiao (1973), Leamer (1978), and Gelman et al. (1995).

8.3 Logit Model, Probit Model, and Logistic Regression We consider a problem here that is related to linear regression but actually belongs to a broad class of generalized linear models. This model is useful for problems involving toxicity tests and bioassay experiments. In such experiments, usually various dose levels of drugs are administered to batches of animals. Most often the responses are dichotomous because what is observed is whether the subject is dead or whether a tumor has appeared. This leads to a setup that can be easily understood in the context of the following example.

246

8 Some Common Problems in Inference

Example 8.2. Suppose that k independent random variables y 1 , Y2, ... , Yk are observed, where Yi has the B(ni, Pi) probability distribution, 1 ::::; i ::::; k. Yi may be the number of laboratory animals cured of an ailment in an experiment involving ni such animals. It is certainly possible to make inference on each Pi separately based on the observed Yi (as discussed previously). This, however, is not really useful if we want to predict the results of a similar experiment in future. Suppose that the Pi are related to a covariate or an explanatory variable, such as dosage level in a clinical experiment. Then the natural approach is regression as described in the previous problem, because this allows us to explore and present the relationship between design (explanatory) variables and response variables, and (if needed) predictions of response at desired levels of the explanatory variables. Let ti be the value of the covariate that corresponds with Pi, i = 1, 2, ... , k. Because Pi's are probabilities, linking them to the corresponding ti 's through a linear map as was done earlier does not seem appropriate now. Instead it can be made through a link function H such that Pi = H((30 + {3 1 ti)· H, here, is a known cumulative distribution function (c.d.f.) and (30 and (31 are two unknown parameters. (If H is an invertible function, this is precisely H- 1 (pi) = (30 + {3 1 ti-) If the standard normal c.d.f. is used for H, the model is called the probit model, and if the logistic c.d.f. (i.e., H(z) = e-z /(1+e-z)) is used, it is called the logit model. The likelihood function for the unknown parameters, (30 and (3 1 , is then given by

Suppose 1r((30 , {3 1 ) is the prior density on ((30 , {3 1 ) so that the posterior density is

It may be noted that the sample size ni and the covariates ti (dose level) are

treated here as fixed, or equivalently the model conditional on those variables is analyzed (as in the linear regression problem). More generally, instead of a single covariate t, if we have a set of s covariates represented by x and the corresponding vector of coefficients {3, the posterior density of f3 will be k

1r(/3jdata) ex 1r(f3)

IT H(f3'xi)Y' (1- H(f3'xi)t'-y'.

(8.11)

i=1

8.3.1 The Logit Model

If we use the logit model whereby Pi= exp{ -f3'xi}/{1 + exp{ -f3'xi} }, and hence -log(pj(1- Pi))= f3'xi, we obtain the likelihood function

8.3 Logit Model, Probit Model, and Logistic Regression

247

which is largely intractable (but see Problem 7). Usually a flat prior such as

1r({3) ex: 1 is employed, but because f3 can be considered to be regression coefficients under reparameterization, an approximate conjugate normal prior can also be used. In such a case, a hierarchical prior structure is also meaningful. To motivate the approximate conjugate hierarchical prior structure, consider the following large sample approximation. For simplicity, let there be only one covariate t. Assume that the ni are large enough for a satisfactory normal approximation of the binomial model. If we let Pi = yi/ ni, then (approximately) these Pi are independent N (Pi, Pi ( 1 - Pi) I ni) random variates. Now let ei = -log(pi/(1- Pi)) and {Ji = -log(pi/(1- Pi)). It can be shown that, approximately, (ei - ei)Jnipi(1- Pi) are independent N(O, 1) random variates. Then (again approximately), the likelihood function for ((30 , (3 1 ) is (8.13) where Wi = niPi(1- Pi) are known weights. This suggests a bivariate normal prior on ((30 , (3 1 ) as the first level in the hierarchical structure. Now the problem is very similar to that discussed in Section 5.4. Further, there is also another important use for the approximate likelihood in (8.13). Its product with the conjugate normal prior discussed above can be used as a natural proposal density for the M-H algorithm in the computations required for inferential purposes (see Problem 9). If instead a flat prior on f3 is to be employed, then (8.13) itself (up to a constant) can be used as the proposal density.

Example 8.3. (Example 8.2 continued). The data given in Table 8.2 is from Finney (1971) (which originally appeared in Martin (1942)) where results of spraying rotenone of different concentrations on the chrysanthemum aphids in batches of about fifty are presented. The concentration is in milligrams per liter of the solution and the dosage x is measured on the logarithmic scale. The median lethal dose LD50, the dose at which 50% of the subjects will expire, is one of the quantities of inferential interest. The plot of p = yIn against x shown in Figure 8. 2 is S-shaped, so a linear fit for p based on x is unsatisfactory. Instead, as suggested by Figure 8.3, the logistic regression is more appropriate here. Suppose that a flat prior on ((30 , (31 ) is to be used. Then the implementation of M-H algorithm as explained above using the bivariate normal proposal density is straightforward. A scatter plot of 1000 values of ((30 , (3 1 ) so obtained is shown in Figure 8.4.

248

8 Some Common Problems in Inference

N 0

0.4

0.6

0.8

1.2

Fig. 8.2. Plot of proportion of deaths against dosage level.

N 0

0.2

0.4

0.6

0.8

Fig. 8.3. Plot of logistic response function:

1.2

e-5+?x /(1

+ e-5+?x).

8.3 Logit Model, Probit Model, and Logistic Regression

249

Table 8.2. Toxicity of Rotenone Concentration (Ci) Dose (xi= log(ci)) Batch Size (ni) Deaths (yi) 2.6 0.4150 6 50 16 48 3.8 0.5797 24 46 5.1 0.7076 42 49 7.7 0.8865 44 1.0086 50 10.2

"7 I 0

"""" \D I

-rl

fll

+' (!)

<:0 I

-rl I

5 beta 0

Fig. 8.4. Scatter plot of 1000 (!30, !31) values sampled using M-H algorithm.

A comparison of the estimates of {30 and {31 obtained using MLE and posterior means are shown in Table 8.3. Histograms of the M-H samples of {30 and /31 are shown in Figure 8.5 and Figure 8.6, respectively. They seem to be skewed and hence the posterior estimates seem more appropriate. Let us consider the estimation of LD50 next. Note that for the logit model LD50 is that dosage level t 0 for which E(yi/nilti = t 0 ) = exp( -({30 + {31 to))/(1 + exp(-(/30 + {31 to))) = 0.5. This means that LD50 = -/30 /{3 1 . We Table 8.3. Estimates of f)0 and !31 from Different Methods Method tJo s.e. (tJo) !31 s.e. (!31) MLE from logistic regression 4.826 -7.065 Posterior estimates 4.9727 0.6312 -7.266 0.8859

250

8 Some Common Problems in Inference

r--~

r--

I--

f---1

Fig. 8.5. Histogram of 1000 f3o values sampled using M-H algorithm.

1-1--

,---

r-r-

1--

-~-----

1--

,--r-

-10

__r-8

-6

Fig. 8.6. Histogram of 1000 /3r values sampled using M-H algorithm.

-4

8.3 Logit Model, Probit Model, and Logistic Regression

251

can easily estimate this using our M-H sample, and we obtain an estimate of 0.6843 (in the logarithmic scale) with a standard error of 0.022.

8.3.2 The Probit Model As mentioned previously, if the standard normal c.d.f., is used for the link function H above, we obtain the probit model. Then, assuming that n(/3) is the prior density on f3 the posterior density is obtained as k

n(f31data) ex n(/3)

II <J>(f3'xi)Y; (1- <J>(f3'xi)t;-y;.

(8.14)

i=l Analytically, this is even less tractable than the posterior density for the logit model. However, the following computational scheme developed by Albert and Chib (1993) based on the Gibbs sampler provides a convenient strategy. First consider the simpler case involving Bernoulli y/s, i.e., ni = 1. Then, k

n(f31data) ex n(/3)

II <J>(f3'xi)Y; (1- <J>(f3'xi)) -y;. 1

i=l The computational scheme then proceeds by introducing k independent latent variables Z 1 , Z 2 , ... , Zk, where Zi '"'"' N(f3'xi, 1). If we let Yi = I(Zi > 0), then Y1 , ... , Yk are independent Bernoulli with Pi = P(Yi = 1) = <J>(f3'xi)Now note that the posterior density of f3 and Z = (Z1 , ... , Zk) given y = (Yl, · .. , Yk) is k

n(/3, Zly) ex n(/3)

II {I(Zi > O)I(yi = 1) + I(Zi:::; O)J(yi = 0)} ¢(Zi- f3'xi), i=l

(8.15) where¢ is the standard normal p.d.f. Even though (8.15) is not a joint density which allows sampling from it directly, it allows Gibbs sampler to handle it since n(f31Z, y) and n(ZI/3, y) allow sampling from them. It is clear that k

n(/3IZ,y) ex n(/3)

II ¢(Zi- f3'xi),

(8.16)

i=l whereas

N(f3'xi, 1) truncated at the left by 0 if Yi = 1; zi I(3, y'"'"' { N(f3'xi, 1) truncated at the right by 0 if Yi = 0.

(8.17)

Sampling Z from (8.17) is straightforward. On the other hand, (8.16) is simply the posterior density for the regression parameters in the normal linear model with error variance 1. Therefore, if a flat noninformative prior on f3 is used, then

252

8 Some Common Problems in Inference Table 8.4. Lethality of Chloracetic Acid

Dose Fatalities .0794 1 .1000 2 .1259 1 .1413 0 .1500 1 .1558 2

Dose Fatalities .1778 4 .1995 6 .2239 4 .2512 5 .2818 5 .3162 8

,BjZ,y"' Ns(/3z, (X'x)- 1 ), where X = (x~, ... , x~)' and /3z = (X'X)- 1 X'Z. If a proper normal prior is used a different normal posterior will emerge. In either case, it is easy to sample ,8 from this posterior conditional on Z. Extension of this scheme to binomial counts Y1 , Y2 , ... , Yk is straightforward. We let Yi = I:j~ 1 Yi 1 where Yi 1 = I(Z; 1 > 0), with Z;1 "'N(,B'x;, 1) are independent, j = 1, 2, ... , ni, i = 1, 2, ... , k. We then proceed exactly as above but at each Gibbs step we will need to generate I:i= 1 n; many (truncated) normals Z; 1 . This procedure is employed in the following example. Example 8.4. (Example 8.2 continued). Consider the data given in Table 8.4, taken from Woodward et al. (1941) where several data sets on toxicity of certain acids were reported. This particular data set examines the relationship between exposure to chloracetic acid and the death of mice. At each of the dose levels (measured in grams per kilogram of body weight), ten mice were exposed. The median lethal dose LD50 is again one of the quantities of inferential interest. The Gibbs sampler as explained above is employed to generate a sample from the posterior distribution of ((30 ,(31 ) given the data. A scatter plot of 1000 points so generated is shown in Figure 8. 7. From this sample we have the estimate of (-1.4426, 5.9224) for ((30 , (31 ) along with standard errors of 0.4451 and 2.0997, respectively. To estimate the LD50, note that for the probit model LD50 is the dosage level t 0 for which E(y;jn;jti = t 0 ) = (f3o+f31 t 0 ) = 0.5. As before, this implies that LD50 = - (30 / (3 1 . Using the sample provided by the Gibbs sampler we estimate this to be 0.248.

8.4 Exercises 1. Show how a random deviate from the Student's tis to be generated. 2. Construct the 95% HPD credible set for ()r -fh for the two-sample problem in Section 8.1 assuming = a~.

8.4 Exercises

253

...-i

If)

-3

-2

-1

beta_O

Fig. 8. 7. Scatter plot of 1000 (f3o, /h) values sampled using Gibbs sampler.

3. Show that Student's t can be expressed as scale mixtures of normals. Using this fact, explain how the 95% HPD credible set for fh - (h can be constructed for the case given in (8.3). 4. Consider the data in Table 8.5 from a clinical trial conducted by Mr. S. Sahu, a medical student at Bangalore Medical College, Bangalore, India (personal communication). The objective of the study was to compare two treatments, surgical and non-surgical medical, for short-term management of benign prostatic hyperplasia (enlargement of prostate). The random observable of interest is the 'improvement score' recorded for each of the patients by the physician, which we assume to be normally distributed. There were 15 patients in the non-surgical group and 14 in the surgical group. Table 8.5. Improvement Scores

Apply the results from Problems 2 and 3 above to make inferences about the difference in mean improvement in this problem. 5. Consider the linear regression model (8.4). (a) Show that (!3, (y - y)' (y - y)) is jointly sufficient for ({3, 0" 2 ).

254

8 Some Common Problems in Inference

(b) Show that .i3ia 2 "'Nv(!3, a 2 (X' x)- 1 ), (y- y)'(y- y))ia 2 "'a 2 x;,_v, and they are independently distributed. (c) Under the default prior (8.6), show that .BIY has the multivariate t distribution having density (8.8). 6. Construct 95% HPD credible set for (/32 , {33 ) in Example 8.1. 7. Consider the model given in (8.12). (a) What is a sufficient statistic for ,8? (b) Show that the likelihood equations for deriving MLE of ,8 must satisfy

8. Justify the approximate likelihood given in (8.13). 9. Consider a multivariate normal prior on ,8 for Problem 7. (a) Explain how the M-H algorithm can be used for computing the posterior quantities. (b) Compare the above scheme with an importance sampling strategy where the importance function is proportional to the product of the normal prior and the approximate normal likelihood given in (8.13). 10. Apply the results from Problem 9 to Example 8.3 with some appropriate choice for the hyperparameters. 11. Justify (8.16) and (8.17), and explain how Gibbs sampler handles (8.15). 12. Analyze the problem in Example 8.4 with an additional data point having 9 fatalities at the dosage level of 0.3400. 13. Analyze the problem in Example 8.4 using logistic regression. Compare the results with those obtained in Section 8.3.2 using the probit model.

High-dimensional Problems

Rather than begin by defining what is meant by high-dimensional, we begin with a couple of examples.

Example 9.1. (Stein's example) Let N(J.Lpx 1 , Epxp a 2 Ipxp) be a p-variate normal population and X;= (X; 1 , ... , X;p), i = 1, ... , n ben i.i.d. p-variate samples. Because E = a 2 I, we may alternatively think of the data asp independent samples of size n from p univariate normal populations N(J-Lj, a 2 ), j = 1, ... ,p. The parameters of interest are the J-L/S. For convenience, we initially assume a 2 is known. Usually, the number of parameters, p, is large and the sample size n is small compared with p. These have been called problems with large p, small n. Note that n in Stein's example is the sample size, if we think of the data as a p-variate sample of size n. However, we could also think of the data as univariate samples of size n from each of p univariate populations. Then the total sample size would be np. The second interpretation leads to a class of similar examples. Note that the observations are not exchangeable except in subgroups, in this sense one may call them partially exchangeable. Example 9.2. Let f(x[J-Lj),j = 1, ... ,p, denote the densities for p populations, and X;j, i = 1, ... , n,j = 1, ... ,p denote p samples of size n from these p populations. As in Example 9.1, f(x[J-Lj) may contain additional common parameters. The object is to make inference about the /-Lj 's. In several path-breaking papers Stein (1955), James and Stein (1960), Stein (1981), Robbins (1955, 1964), Efron and Morris (1971, 1972, 1973a, 1975) have shown classical objective Bayes or classical frequentist methods, e.g., maximum likelihood estimates, will usually be inappropriate here. See also Kiefer and Wolfowitz (1956) for applications to examples like those of Neyman and Scott (1948). These approaches are discussed in Sections 9.1 through 9.4, with stress on the parametric empirical Bayes (PEB) approach of Efron and Morris, as extended in Morris (1983).

256

9 High-dimensional Problems

It turns out that exchangeability of J-Lr, ... , f.-Lp plays a fundamental role in all these approaches. Under this assumption, there is a simple and natural Bayesian solution of the problem based on a hierarchical prior and MCMC. Much of the popularity of Bayesian methods is due to the fact that many new examples of this kind could be treated in a unified way. Because of the fundamental role of exchangeability of J-lj 's and the simplicity, at least in principle, of the Bayesian approach, we begin with these two topics in Section 9.1. This leads in a natural way to the PEB approach in Sections 9.2 and 9.3 and the frequentist approach in Section 9.4. All the above sections deal with point or interval estimation. In Section 9.6 we deal with testing and multiple testing in high-dimensional problems, with an application to microarrays. High-dimensional inference is closely related to model selection in high-dimensional problems. A brief overview is presented in Sections 9.7 and 9.8. We discuss several general issues in Sections 9.5 and 9.9.

9.1 Exchangeability, Hierarchical Priors, Approximation to Posterior for Large p, and MCMC We follow the notations of Example 9.1 and Example 9.2, i.e., we consider p similar but not identical populations with densities f(xJJ-Lr), ... ,J(xiJ-Lp)· There is a sample of size n, X 1 j, ... , Xnj, from the jth population. These p populations may correspond with p adjacent small areas with unknown per capita income J-Lr, ... , f.-Lp, as in small area estimation, Ghosh and Meeden (1997, Chapters 4, 5). They could also correspond with p clinical trials in a particular hospital and J-Lj, j = 1, ... , p, are the mean effects of the drug being tested. In all these examples, the different studied populations are related to each other. In Morris (1983), which we closely follow in Section 9.2, the p populations correspond to p baseball players and J-Lj 's are average scores. Other such studies are listed in Morris and Christiansen (1996). In order to assign a prior distribution for the J-lj 's, we model them as exchangeable rather than i.i.d. or just independent. An exchangeable, dependent structure is consistent with the assumption that the studies are similar in a broad sense, so they share many common elements. On the other hand, independence may be unnecessarily restrictive and somewhat unintuitive in the sense that one would expect to have separate analysis for each sample if the J-lj 's were independent and hence unrelated. However, to justify exchangeability one would need a particular kind of dependence. For example, Morris (1983) points out that the baseball players in his study were all hitters; his analysis would have been hard to justify if he had considered both hitters and pitchers. Using a standard way of generating exchangeable random variables, we introduce a vector of hyperparameters 11 and assume J-L/S are i.i.d. 7r(J-LITJ) given '11· Typically, if f(xJJL) belongs to an exponential family, it is convenient to

9.1 Exchangeability, Hierarchical Priors, and MCMC

257

take n(t-ti1J) to be a conjugate prior. It can be shown that for even moderately large p-in the baseball example of Morris (1983), p = 18- there is a lot of information in the data on 1]. Hence the choice of a prior for 17 does not have much influence on inference about flj 's. It is customary to choose a uniform or one of the other objective priors (vide Chapter 5) for 1]. We illustrate these ideas by exploring in detail Example 9.1. Example 9.3. (Example 9.1, continued) Let f(xlt-tj) be the density of N(ftj, CT 2 ) where we initially assume CT 2 is known. We relax this assumption in Subsection 9.1.1. The prior for flj is taken to be N (77 1 , 772) where 77 1 is the prior guess about the flj 's and 772 is a measure of the prior uncertainty about this guess, vide Example 2.1, Chapter 2. The prior for 771,772 is 1r(77 1 ,772), which we specify a bit later. Because (Xj = 2:::~ 1 Xijln,j = 1, ... ,p) form a sufficient statistic and Xj 's are independent, the Bayes estimate for flj under squared error loss is E(t-t11X)

where X = (Xij, z Example 2.1)

= E(t-tjiX)

= 1, ... , n,

E(t-tjiX,1J)n(1JIX)d1J.

= 1, ... ,p),

= (X1, ... , Xp) and (vide

(9.1) with B = (CT 2In) I (772 + CT 2In), depends on data only through Xj. The Bayes estimate of flj, on the other hand, depends on Xj as above and also on (X 1 , . . . , Xp) because the posterior distribution of 17 depends on all the X1 's. Thus the Bayes estimate learns from the full sufficient statistic justifying simultaneous estimation of all the flj 's. This learning process is sometimes referred to as borrowing strength. If the modeling of flj 's is realistic, we would expect the Bayes estimates to perform better than the Xj 's. This is what is strikingly new in the case of large p, small n and follows from the exchangeability of flj 's. The posterior density n(1JIX) is also easy to calculate in principle. For known CT 2 , one can get it explicitly. Integrating out the flj 's and holding 1J fixed, we get Xj 's are independent and

(9.2) Let the prior density of (77 1 , 772) be constant on R x R +. (See Problem 1 and Gelman et al. (1995) to find out why some other choices like 1r(771,772) = 11772 are not suitable here.) Given (9.2) and n( 771, 772) as above,

258

9 High-dimensional Problems

"'p -

where X = P L..Jj=l XJ and S = In a similar way,

7r(JLJX)

"'p (XJ- - X)L..Jj=l

Jfi

7!'(/Lj IXj' 'TJ )7r( TJJX) dTJ,

(9.4)

j=l

where (!LJIXJ, TJ) are independent normal with .

mean as m (9.1) and vanance

'f/2a 'f/2

/n/

+a 2 n

(9.5)

and

(9.6) is the product of a normal and inverse-Gamma (as given in (9.3)). Construction of a credible interval for /Lj is, in principle, simple. Consider 7r(/LjJX) and fix 0

P{fL (X) ~ /LJ ~ PJ(X)JX} = 1- a:. -) In general, to calculate -fL.) and PJ one would have to resort to MCMC which is explained in Subsection 9.1.1. For large p, good approximations are available. To do this we anticipate to some extent Section 9.2. Because p is large, we can invoke the theorem on posterior normality (Chapter 4). Thus the posterior for 'TJ is nearly normal with mean fJ and variances of order 0(1/p), fJ being the MLE of TJ based on the "likelihood" p

IT f(xJITJ). j=l

Hence, 7r(TJJ:i:) is approximately (in the sense of convergence in distribution) degenerate at fJ. This implies

9.1 Exchangeability, Hierarchical Priors, and MCMC

n(ttjiX)

259

n(ttjiXj,ry)n(ryiX)dry

= n(ttjiXj, i))

(approximately) .

(9.7)

This in turn implies the Bayes estimate of tlj is (9.8)

towards f71, and which which, by (9.1), is a shrinkage estimate that shrinks depends on all the sample means. The approximation (9.8) is correct up to 0(1/p). A similar argument provides an approximation to the posterior s.d. but the accuracy is only 0(1/ fo). Simulations indicate the approximation for the Bayes estimate is quite good but that for the posterior s.d. is much less accurate. It is known, vide Morris (1983), that the approximation is also inadequate for credible intervals. As a final application of this approximation we indicate it is possible to check whether the prior n(ttj 117) is consistent with data. More precisely, we check the consistency of f (xj 117) with data, but a check for f (xj 117) is indirectly a check for n(ttji1J). In the light of the data, 17 = i) is the most likely value of the hyperparameter 11· Under i), Xj's are i.i.d normal with mean and variance given by (9.2) with 17 replaced by i). We can now check the fit of this model to the empirical distribution via the Q-Q plot. For each 0 < p < 1, we plot the 100pth quantiles for the theoretical and empirical distributions as (x(p),y(p)). If the fit is good, the resulting curve {(x(p), y(p)), 0 < p < 1} should scatter around an equiangular line passing through the origin.

9.1.1 MCMC and E-M Algorithm We begin with p exponential densities of the same form, namely,

The conjugate prior density for the jth model is proportional to k

exp{ryoc(8j)+L17ieji},

j=1, ... ,p.

(9.10)

i=l

Note that the hyperparameters (1]o, ry 1 , ... , 1Jk) are the same for all j. Finally, consider a prior n(ry) for the hyperparameters. Let X= (X 1 , ... ,Xp) and 8 = (8 1 , ... ,8p)· The conditional density of 8 given X, 17 is

n(8IX,ry) ex:

j=l

i=l

IT exp{(ryo + n)c(8j) + L(tji(xj) + 1Ji)Bji}

(9.11)

260

9 High-dimensional Problems

which shows conditionally Oj's remain independent. Also p

7r(7JJX, 0) ex: exp{pd(1J)

+ (TJo + n) L

c(Oj)

j=l

L(TJi

+ tji(xj))(}ji}1r(1J)

j=l i=l

(9.12) where exp(d(17)) is the normalizing constant of the expression in (9.10). By (9.12), the Bayes formula and cancellation of some common terms in the numerator and denominator of the Bayes formula, p

7r(7JJX, 0) ex: exp{pd(1J)

+ TJo L

c(Oj)

j=l

L TJigji}7r(1J).

j=l i=l

If d( 1J) has an explicit form, as is often the case, one can apply Gibbs sampling to draw samples from the joint posterior of 0 and 1J using the conditionals 7r(OJX,7J) and 7r(7JJX,O). Otherwise one can apply Metropolis-Hastings. In general, the approximations based on r, are still valid but computation of r, is non-trivial. It turns out that the E-M algorithm can be applied, vide Gelman et al. (1995, Chapter 9). 'We illustrate the algorithms for MCMC and E-M in the case of N(J.Lj 1 (]" 2 ), j = 1, ... , p, with (J.L 1 , ... , /.Lp) and (]" 2 unknown. MCMC is quite straightforward here. Recall Example 7.13 from Chapter 7. The hierarchical Bayesian analysis of the usual one-way ANOVA was discussed there. With minimal modification, the same algorithm fits here to allow Gibbs sampling. Application of the E-M algorithm is also easy, as discussed in Gelman et al. (1995). We assume as before that /.Lj are i.i.d. N(TJ 1 , TJ 2 ), and further take 7r(TJ 1 ,(J" 2,TJ 2) = 1/(]" 2. Then, recall from Section 7.2 that we need to apply the E and M steps to

--"'"'"'(X··-" -) 2 + constants 2(]"2 ~ ~ 2) t")

(9.13)

j=l i=l

The E-step requires the conditional distribution of (JL, (]" 2) given X and the current value (TJic),TJ~c)) of (TJ 1 ,TJ2)· This is just the normal, inverse Gamma model. In theM-step we need to maximize E(cl(log7r(JL,TJ 1 ,(J" 2,TJ2JX)) as a function of (TJ 1 , TJ 2 ) which is straightforward.

9.2 Parametric Empirical Bayes To explain the basic ideas, we consider once more the special case of N (J.Lj, (]" 2), (]" 2 known. Explicit formulas are available in this special case for comparison

9.2 Parametric Empirical Bayes

261

with the estimates of Stein. Another interesting special case is discussed in Carlin and Louis (1996, Chapter 3). The PEB approach was introduced by Efron and Morris in a series of papers including Efron and Morris (1971, 1972, 1973a, 1973b, 1975, 1976). In this section we generally follow Morris (1983). Efron and Morris tried to take an intermediate position between a fully Bayes and a fully frequentist approach by treating the likelihood as given by f(xji1J) obtained by integrating out the JL/s as in (9.2). The 17's are treated as unknown parameters as in frequentist analysis whereas the JLj 's are treated as random variables as in Bayesian analysis. This leads to a reduction of a high-dimensional frequentist problem about JLj 's to a low-dimensional semifrequentist problem about 1], about which there is a lot of information in the data. The fully Bayesian and the PEB approach differ in that no prior is assigned to 1], and 1J is estimated by MLE or by a suitable unbiased estimate. So one may, if one wishes, think of the PEB approach as an approximation to the hierarchical Bayes approach of Section 9.1. A disadvantage of PEB is that accounting for the uncertainty about 1J is more difficult than in hierarchical Bayes~ a point that would be discussed again in subsection 9.2.1. An advantage is that one gets an explicit estimate for JLj, namely, (9.1) with 1J replaced by an estimate of 11· Note that under the likelihood (9.2), the complete sufficient statistic is the pair -

j=l

(X=~ LXj, Also,

s = L(.Xj- X) 2 ).

(9.14)

j=l

X and ,

IJ 2 jn

B= ( p - 3 ) S

(9.15)

are unbiased estimates of ry 1 and IJ2jn B = --;::--,------'--tJ2/n + 'r/2

(9.16)

Then the best unbiased predictor of the RHS of (9.1) is flj = (1 - B)Xj

+ i3x

(9.17)

which is the famous James-Stein-Lindley estimate of JLj. It shrinks Xj towards the overall mean X. The amount of shrinkage is determined by B, which is close to 1 if S/(p-3) is close to tJ 2/n and close to zero if Sj(p- 3) is much larger than tJ 2jn. Note that if Sj(p- 3) is small, as in the first case, then the Xj's are close to X indicating JL/S are close to each other. This justifies a fairly strong shrinkage towards the grand mean. On the other hand, a large Sj(p- 3) indicates heterogeneity among the JLj 's, suggesting relatively large weight for xj Because 0

262

9 High-dimensional Problems

E(Sj(p- 1))

a2 = -

+ 172,

(9.18)

an unbiased estimate of 172 is r]2 = Sj(p - 1) - a 2 jn. Because 172 ~ 0, a more plausible estimate is fJt = max(O, i]2 ), the positive part of the unbiased estimate. This suggests changing the estimate of B to

(9.19) which is the James-Stein-Lindley positive part estimate. If we take 171 = 0, i.e., J-L/S are i.i.d N(O, 172), then the two estimates are of the form (9.20) These are the James-Stein and James-Stein positive part estimates. They shrink the estimate towards an arbitrary point zero and so do not seem attractive in the exchangeable case. But they have turned out to be quite useful in estimating coefficients in an orthogonal expansion of an unknown function with white noise as error, vide Cai et al. (2000). We study frequentist properties of these two estimates in Section 9.4.

9.2.1 PEB and HB Interval Estimates Morris defines a random confidence interval (J-L (X),pj(X)) for /1j to have -J PEB confidence coefficient ( 1 - a) if Pry{!!:_j:::; /-Lj:::; pj} ~ 1- a.

Lets;= jectured

(9.21)

[1- ((p- 1)/p)B]a 2/n + (2/(p- 3))B 2(.Xj- X) 2. Morris has con(9.22)

is a PEB confidence interval with confidence coefficient 1 -a. It is shown in Basu et al. (2003) that the conjecture is not true but the violations are so rare and so small in magnitude that it hardly matters. Basu et al. (2003) suggest an adjustment that would make (9.22) true up to o(p- 2 ). It is also shown there that asymptotically the adjusted interval is equivalent to a PEB interval proposed by Carlin and Louis (1996, Chapter 3). A trouble with Morris's interval is that it is somewhat ad hoc. We are not told how exactly it is derived. It seems he puts a noninformative prior on 171, 172 and adjusts somewhat the HB credible interval to get a conservative frequentist coverage probability. There is a natural alternative that does not require additional adjustment and ensures (9.21) with the inequality replaced by an equality up to O(p- 2 ). To do this, one has to choose a prior for 71 that is probability matching in the sense of

9.3 Linear Models for High-dimensional Parameters

263

where P{Jlj P{Jlj

> P1IX} < -J1) IX}

a/2, a/2,

(9.24)

and the probabilities in (9.24) are the posterior probabilities under the prior for 11· This leads to probability matching differential equations. A solution is (9.25) vide Datta, Ghosh, and Mukerjee (2000).

9.3 Linear Models for High-dimensional Parameters We can extend the HB and PEB approach to a more general setup using covariates and linear models. The parameters are no longer exchangeable but are affected by a common set of low-dimensional hyperparameters assuming the role of 11· The model in Sections 9.1 and 9.2 is a special case of the linear model below. Following Morris (1983), we change our notations slightly }jjBj "'N(Bj, V),

j = 1, ... ,p

independent,

(9.26)

and given {3, A, (9.27) Ej's are i.i.d N(O, A). Here pis at least moderately large, r is relatively small. Morris allows the variance of Ej to depend on j, which is often a more realistic assumption. Keeping the same variance A for all j simplifies the algebra considerably. In the HB analysis we need to put a further prior on {3. A conjugate prior for {3 given A is (9.28) Finally, A is given an inverse Gamma or a uniform or the standard prior 1/A for scale parameters. Assuming Vis known, our specification of priors is complete. To do MCMC we partition the parameters and (random) hyperparameters into three sets (8, {3, A). The conditionals are as follows. (1) Given {3, A (andY), 8 is multivariate normal. (2) Given 8, A (andY), {3 is multivariate normal. (3) Given 8, {3 (andY), A has an inverse Gamma distribution. You are asked to find the parameters of these conditionals in Problem 6.

264

9 High-dimensional Problems

In the PEB approach, one first notes (9.29) where

o: = (1- B)Yi + BZU3

(9.30)

with B = Vj(V +A). Here Zi is the ith column vector of Z. This is the shrinkage estimate corresponding with (9.1) of Section 9.1. In the PEB approach one has to estimate {3 and B either by maximizing the likelihood of the independent Yi 's with (9.31) or by finding a suitable unbiased estimate as in (9.18). Let

13 = (Z' z)- 1 (Z'Y). The statistic

13 and s = (Y- zi3)'(Y- zl3)

are independent, complete sufficient statistics for ({3, A). Hence the best unbiased estimates for {3 and B are 13 and

B = (p- r- 2)V/S (vide Problem 10). Substituting in the shrinkage estimate

{Ji

for

{}i,

one gets

(1 - B)Yi + BZ' 13.

This is the analogue of James-Stein-Lindley estimate under the regression model. In Problem 8, you are asked to show that the PEB risk of Oi, namely E13,A (0~- Oi) 2 is smaller than the PEB risk of Yi, namely, E13,A (Yi- Oi) 2 . The relative strength of the PEB estimate comes through the use of 13, which is based on the full data set Y. In Section 8.3, linear regression is discussed as a common statistical problem where an objective Bayesian analysis is done. You may wish to explore how similar ideas are used in this section to model a high-dimensional problem.

9.4 Stein's Frequentist Approach to a High-dimensional Problem

Once again we study Example 9.1. Let 1 's be independent, 1 "'N(~-11 , u 2 /n). Classical criteria like maximum likelihood, minimaxity or minimizing variance

9.4 Stein's Frequentist Approach to a High-dimensional Problem

265

among unbiased estimates, all lead to (X 1 , ... , Xp) as estimate of (J.L 1 , ... , /-Lp)· Let p 2: 3. Stein, in a series of papers, Stein (1956), James and Stein (1960), Stein (1981), showed that if we define a loss function p

L(X, p,)

2)x1 -

J.LJ) 2

(9.32)

j=l

and generally p

L(T, p,)

2)T1 (X)- f-LJ) 2

(9.33)

j=l

for a general estimate T, it is possible to choose aT that is better than X in the sense E,_.(L(T, p,)) < E,_.(L(X, p,)) for all p,. (9.34) Stein (1956) provides a heuristic motivation that suggests X is too large in a certain sense explained below. To see this compare the expectation of the squared norm of X with the squared norm of p,. (9.35) The larger the p the bigger the deviation between the LHS and RHS. So one would expect at least for sufficiently large p, one can get a better estimate by shrinking each coordinate of X suitably towards zero. We present below two of the most well-known shrinkage estimates, namely, the James-Stein and the positive part James-Stein estimate. Both have already appeared in Section 9.2 as PEB estimates. It seems to us that the PEB approach provides the most insight about Stein's estimates, even though the PEB interpretation came much later. Morris points out that there is no exchangeability or prior in Stein's approach but summing the individual losses produces a similar effect. Moreover, pooling the individual losses would be a natural thing to do only when the different J.L/s are related in some way. If they are totally unrelated, Stein's result would be merely a curious fact with no practical significance, not a profound new data analytic tool. It is in the case of exchangeable high-dimensional problems that it provides substantial improvement. We present two approaches to proving that the Stein-James estimate is superior to the classical estimate. One is based on Stein (1981) with details as in Ibragimov and Has'minskii (1981). The other is an interesting variation on this due to Schervish (1995).

Stein's Identity. Let X rv N(J.L,a 2 ) and ¢(x) be a real valued function differentiable on R with ¢' (u )du = ¢( x) - ¢(a). Then

a 2 E(¢'(X))

E((X- J.L)¢(X)).

266

9 High-dimensional Problems

Proof. Integrating by parts or changing the order of integration E(¢'(X))

y 27rcr

= cr-

¢'(x) exp {

-oo

-(~ -/")

2 } dx

d exp { - (X - f..l) cf>(x)-d 2 cr 2 27rcr _ 00 X

1 tn=

1 1

2 }

E(¢(X)(X- f..l)).D

(9.36)

For more details see the proof in Stein (1981). As a corollary we have the following result. Corollary. Let (X1 , X 2 , ... , Xp) be a random vector"' Np(J.L, ~I). Let¢=

(c/>1, ¢2, ... , c/>p) : RP ---+ RP be differentiable, El ~I < oo,

J: '/i!::dxj and assume that J

c/>j(xl, ... ,xj-l,x,xj+l, ... ,xp)

c/>j(xl, ... ,Xj- 1 ,x,XJ+l, .. .) exp{ -~:-;;;t}---+ 0 as lxl---+ oo. Then (9.37) We now return to Example 9.1. The classical estimate for J.L is X

(X 1 , X 2, ... , Xp)· Consider an alternative estimate of the form (9.38) where g(x) = (g 1 ,g, ... ,gp): RP---+ RP with g satisfying the conditions in the corollary. Then, by the corollary, E~-tiiX- J.LII2 - E~-tiiX 1

+ n-lcr2g(X)- J.LII2

= -2n- cr =

2EJ-t{(X- J.L)'g(X)}- n- 2 cr 4 E~-tllg(X)II 2

-2n- 2 cr 4 EJ-t{t 1

%Jl }- n- 2 cr4 E~-tllg(X)II 2 .

(9.39)

Now suppose g(x) = grad(log¢(x)), where¢ is a twice continuously differentiable function from RP into R. Then (9.40) where Ll

82 2:: p1 a;:r·

Hence

E~-tiiX- J.LII 2 - E~-tllll- J.LII 2 = n- 2 cr4 E~-tllgll 2 - n- 2 cr4 EJ-t { ¢(~) .:1¢(X)} (9.41)

9.4 Stein's Frequentist Approach to a High-dimensional Problem

267

which is positive if ¢(x) is a positive non-constant super harmonic function, 1.e,

LJ.¢ :::; 0.

(9.42)

Thus ji is superior to X if (9.42) holds. It is known that such functions exist if and only if p ~ 3. To show the superiority of the James-Stein positive part estimate for p ~ 3, take llxll-(p- 2 ) if llxll ~ VP- 2 ¢(x) = { (p- 2)-CP- 2 )/ 2 exp { ~(p- 2)- llxll 2 )} otherwise.

(9.43)

Then grad(log¢) is easily verified to be the James-Stein positive partestimate. To show the superiority of the James-Stein estimate, take (9.44) We observed earlier that shrinking towards zero is natural if one modeled /-lj 's as exchangeable with common mean equal to zero. We expect substantial improvement if fL = 0. Calculation shows (9.45) if fL = 0, a- 2 = 1, n = 1. It appears that borrowing strength in the frequentist formulation is possible because Stein's loss adds up the losses of the component decision problems. Such addition would make sense only when the different problems are connected in a natural way, in which case the exchangeability assumption and the PEB or hierarchical models are also likely to hold. It is natural to ask how good are the James-Stein estimates in the frequentist sense. They are certainly minimax since they dominate minimax estimates. Are they admissible? Are they Bayes (not just PEB)? For the James-Stein positive part estimate the answer to both questions is no, see Berger (1985a, pp. 542, 543). On the other hand, Strawderman (1971) constructs a proper Bayes minimax estimate for p ~ 5. Berger (1985a, pp. 364, 365) discusses the question of which among the various minimax estimates to choose. Note that the PEB approach leads in a natural way to James-Stein positive part estimate, suggesting that it can't be substantially improved even though it is not Bayes. See in this connection Robert (1994, p. 66). There is a huge literature on Stein estimates as well as questions of admissibility in multidimensional problems. Berger (1985a) and Robert (1994) provide excellent reviews of the literature. There are intriguing connections between admissibility and recurrence of suitably constructed Markov processes, see Brown (1971), Srinivasan (1981), and Eaton (1992, 1997, 2004).

268

9 High-dimensional Problems

When extreme J.L's may occur, the Stein estimates do not offer much improvement. Stein (1981) and Berger and Dey (1983) suggest how this problem can be solved by suitably truncating the sample means. For Stein type results for general ridge regression estimates see Strawderman (1978) where several other references are given. Of course, instead of zero we could shrink towards an arbitrary Jlo· Then a substantial improvement will occur near Jlo· Exactly similar results hold for the James-Stein-Lindley estimate and its positive part estimate if p 2: 4. For the James-Stein estimate, Schervish (1995, pp. 163-165) uses Stein's identity as well as (9.40) but then shows directly (with a 2 = 1, n = 1) 2

j=l

11911 + 2 ~ a.gj

-(p- 2) 2 =

L_,P

2 1 xj

< 0.

Clearly for ji = James-Stein estimate,

- JL 112 = E,_. II JL-

2)2} , E,_. { (p'L, XJ

which shows how the risk can be evaluated by simulating a noncentral -distribution.

9.5 Comparison of High-dimensional and Low-dimensional Problems In the low-dimensional case, where n is large or moderate and p small, the prior is washed away by the data, the likelihood influences the posterior more than the prior. This is not so when p is much larger than n - the so-called high-dimensional case. The prior is important, so elicitation, if possible, is important. Checking the prior against data is possible and should be explored. We discuss this below. In the high-dimensional cases examined in Sections 9.2 and 9.3 some aspects of the prior, namely 7r(Jlj liJ), can be checked against the empirical distribution. We have discussed this earlier mathematically, but one can approach this from a more intuitive point of view. Because we have many Jlj 's as sample from 7r(Jlj li7) and xj 's provide approximate estimates of Jlj 's, the empirical distribution of the Xj's should provide a check on the appropriateness of

7r(J.LjliJ). Thus there is a curious dichotomy. In the low-dimensional case, the data provide a lot of information about the parameters but not much information about their distribution, i.e., the prior. The opposite is true in highdimensional problems. The data don't tell us much about the parameters but there is information about the prior.

9.6 High-dimensional Multiple Testing (PEB)

269

This general fact suggests that the smoothed empirical distribution of estimates could be used to generate a tentative prior if the likelihood is not exponential and so conjugate priors cannot be used. Adding a location-scale hyperparameter TJ could provide a family of priors as a starting point of objective high-dimensional Bayesian analysis. Bernardo (1979) has shown that at least for Example 9.1 a sensible Bayesian analysis can be based on a reference prior with a suitable reparameterization. It does seem very likely that this example is not an exception but a general theory of the right reparameterization needs to be developed.

9.6 High-dimensional Multiple Testing (PEB) Multiple tests have become very popular because of application in many areas including microarrays where one searches for genes that have been expressed. We provide a minimal amount of modeling that covers a variety of such applications arising in bioinformatics, statistical genetics, biology, etc. Microarrays are discussed in Appendix D. Whereas PEB or HB high-dimensional estimation has been around for some time, PEB or HB high-dimensional multiple testing is of fairly recent origin, e.g., Efron et al. (2001a), Newton et al. (2003), etc. We have p samples, each of size n, from p normal populations. In the simplest case we assume the populations are hom*oscedastic. Let 0" 2 be the common unknown variance, and the means /Jl, ... , /Jp· For /Jj, consider the hypotheses Hoj: /Jj = O,Hlj: /Jj ~ N(TJI,TJ2),j = 1, ... ,p. The data are Xij, i = 1, ... n,j = 1, ... ,p. In the gene expression problem, Xij, i = 1, ... n are n i.i.d. observations on the expression of the jth gene. The value of IXij I may be taken as a measure of observed intensity of expression. If one accepts Hoj, it amounts to saying the jth gene is not expressed in this experiment. On the other hand, accepting H 1 j is to assert that the jth gene has been expressed. Roughly speaking, a gene is said to be expressed when the gene has some function in the cell or cells being studied, which could be a malignant tumor. For more details, see the appendix. In addition to Hoj and H 1 j, the model involves Jro = probability that Hoj is true and 1r1 = 1 - Jro = probability that H 1 j is true. If

I·= { 1 if H 1 j is true; 1

0 if Hoj is true,

then we assume h, ... , Ip are i.i.d. ~ B(1, 1rl). The interpretation of 1r1 has a subjective and a frequentist aspect. It represents our uncertainty about expression of each particular gene as well as approximate proportion of expression among p genes. If 0" 2, 1r1 , 7] 1 , TJ 2 are all known, Xj is sufficient for /Jj and a Bayes test is available for each j. Calculate the posterior probability of H 1 j:

270

9 High-dimensional Problems

1rd1(Xj)

7r1j =

-----,~-,---_;____.::__:__~~

1r1h(Xj)

+ 7rofo(Xj)

which is a function of Xj only. Here fo and and H 1j· 1

If 1r1 J· > -2

7r1j

are densities of xj under Hoj

accept H1 1·

and

accept Hoj·

This test is based only on the data for the jth gene. In practice, we do not know 1r1 , r/1, ry 2 . In PEB testing, we have to estimate all three. In HB testing, we have to put a prior on (1r 1 , ry 1 , ry 2 ). To us a natural prior would be a uniform for 1r1 on some range (0, 8), 8 being upper bound to 1r 1 , uniform prior for ry 1 on Rand uniform or some other objective prior for 'f/2· In the PEB approach, we have to estimate 1r 1 , ry 1 , ry 2 . If a 2 is also unknown, we have to put a prior on a 2 also or estimate it from data. An estimate of a 2 is Li Lj(Xij- Xj) 2 j{p(n -1)}. For fixed 1r 1 , we can estimate ry 1 and ry 2 by the method of moments using the equations, (9.46) (9.47)

from which it follows that 1 -

TJ1 =-X,

(9.48)

7r1

'f/2 A

1{1""'- L.)Xi -X)

2 -2 - -a - 1-7r1 (-)2}+ - X n 1r1

1r1

(9.49)

Alternatively, if it is felt that ry 1 = 0, then the estimate for ry2 is given by 'f/2 A

= -1

1r1

{ 1

L (X- -X)1

- -(j2 n

}

(9.50)

Now we may maximize the joint likelihood of Xi's with respect to 1r 1 . Using these estimates, we can carry out the Bayes test for each j, provided we know 1r 1 or put a prior on 1r1 . We do not know of good PEB estimates of 7r1·

Scott and Berger (2005) provide a very illuminating fully Bayesian analysis for microarrays.

9.6 High-dimensional Multiple Testing (PEB)

271

9.6.1 Nonparametric Empirical Bayes Multiple Testing Nonparametric empirical Bayes (NPEB) solutions were introduced by Robbins (1951, 1955, 1964). It is a Bayes solution based on a nonparametric estimate of the prior. Robbins applied these ideas in an ingenious way in several problems. It was regarded as a breakthrough, but the method never became popular because the nonparametric methods did not perform well even in moderately large samples and were somewhat unstable. Recently Efron et al. (2001a, b) have made a successful application to a microarray with p equal to several thousands. The data are massive enough for NPEB to be stable and perform well. After some reductions the testing problem takes the following form. For j = 1, 2, ... ,p, we have random variables Zj· Zj '"'"'fo(z) under Hoj and Zj '"'"'fi(z) under H 1 j where fo is completely specified but fi(z) # f 0 (z) is completely unknown. This is what makes the problem nonparametric. Finally, as in the case of parametric empirical Bayes, the indicator of H 1 j is Ij = 1 with probability 1r1 and = 0 with probability 7ro = 1 - 1r 1. If 1r1 and h were known we could use the Bayes test of Hoj based on the posterior probability of H 1 j

P(H1 ·I J

·)zJ

1r1h(zj)

7rlfi(zj) + (1- 7ri)fo(zj) ·

Let f(z) = 1r 1!I(z) + (1 - 7r1)fo(z). We know fo(z). Also we can estimate f(z) using any standard method- kernel, spline, nonparametric Bayes, vide Ghosh and Ramamoorthi (2003) - from the empirical distribution of the Zj 's. But since 1r 1 and h are both unknown, there is an identifiability problem and hence estimation of 1r1, his difficult. The two papers, Efron et al. (2001a, b), provide several methods for bounding 1r1. One bound follows from 7ro ~ min[f(z)/ fo(z)], z

1r1 ~ 1- min[f(z)/ fo(z)]. z

So the posterior probability of H 1 j is

P{Hdz} J

1- 7rofo(zj) > 1- {min f(z)} fo(zj) f(zj) z fo(z) f(zj)

which is estimated by 1- { minz la~z})} Jc~:}, where j is an estimate off as mentioned above. The minimization will usually be made over observed values of z. Another bound is given by 7ro -<

JA f(z)dz

JAfo(zdz ) .

272

9 High-dimensional Problems

Now minimize the RHS over different choices of A. Intuition suggests a good choice would be an interval centered at the mode of f 0 (z), which will usually be at zero. A fully Bayesian nonparametric approach is yet to be worked out. Other related papers are Efron (2003, 2004). For an interesting discussion of microarrays and the application of nonparametric empirical Bayes methodology, see Young and Smith (2005).

9.6.2 False Discovery Rate (FDR) The false discovery rate (FDR) was introduced by Benjamini and Hochberg (1995). Controlling it has become an important frequentist concept and method in multiple testing, specially in high-dimensional problems. We provide a brief review, because it has interesting similarities with NPEB, as noted, e.g., in Efron et al. (200la, b). We consider the multiple testing scenario introduced earlier in this section. Consider a fixed test. The (random) FDR for the test is defined as ~/:;ji{v(Z)>o}, where U =total number of false discoveries, i.e., number of true H 0 j's that are rejected by the test for a z, and V = total number of discoveries, i.e., number of H 0 j's that are rejected by a test. The (expected) FDR is

To fix ideas suppose all H 0 j's are true, i.e., all f.L/S are zero, then U = V and so

and FDR = PJ.L=O( at least one H 01 is rejected) = Type 1 error probability under the full null.

This is usually called family wise error rate (FWER). The Benjamini-Hochberg (BH) algorithm (see Benjamini and Hochberg (1995)) for controlling FDR is to define ]o = max{j: P(j) :::;; lo:} p

where P1 = the P-value corresponding with the test for jth null and P(j) = jth order statistic among the P-values with P(l) = the smallest, etc. The algorithm requires rejecting all H 01 for which P1 :::;; P(jo)· Benjamini and Hochberg (1995) showed this ensures

where Po is the number of true H 01 's. It is a remarkable result because it is valid for all J.L. This exact result has been generalized by Sarkar (2003).

9.7 Testing of a High-dimensional Null as a Model Selection Problem

273

Benjamini and Liu (1999) have provided another algorithm. See also Benjamini and Yekutieli (2001). Genovese and Wasserman (2001) provide a test based on an asymptotic evaluation of j 0 and a less conservative rejection rule. An asymptotic evaluation is also available in Genovese and Wasserman (2002). See also Storey (2002) and Donoho and Jin (2004). Scott and Berger (2005) discuss FDR from a Bayesian point of view. Controlling FDR leads to better performance under alternatives than controlling FWER. Many successful practical applications of FDR control are known. On the other hand, from a decision theoretic point of view it seems more reasonable to control the sum of false discoveries and false negatives rather than FDR and proportion of false negatives.

9.7 Testing of a High-dimensional Null as a Model Selection Problem 1 Selection from among nested models is one way of handling testing problems as we have seen in Chapter 6. Parsimony is taken care of to some extent by the prior on the additional parameters of the more complex model. As in estimation or multiple testing, consider samples of size r from p normal populations N(JLi, a 2 ). For simplicity a 2 is assumed known. Usually a 2 will be unknown. Because S 2 = l.:.:i l.:.:j(Xij -.Xi) 2 jp(r-1) is an unbiased estimate of a 2 with lots of degrees of freedom, it does not matter much whether we put one of the usual objective priors for a 2 or pretend that a 2 is known to be S 2 . We wish to test H 0 : f.Li = 0 Vi versus H 1 : at least one f.L # 0. This is sometimes called Stone's problem, Berger et al. (2003), Stone (1979). We may treat this as a model selection problem with M 0 Ho : f.Li = 0 Vi and M 1 = H 0 U H 1 , i.e., M 1 : 1-£ E RP. In this formulation, M 0 c M 1 whereas H 0 and H 1 are disjoint. On grounds of parsimony, H 0 is favored if both M 0 and M 1 are equally plausible. To test a null or select a model, we have to define a prior n(J-t) under M 1 and calculate the Bayes factor

There is no well developed theory of objective priors, specially for testing problems. However as in estimation it appears natural to treat /-Lj 's as exchangeable rather than independent. A popular prior in this context is the Zellner and Siow (1980) multivariate Cauchy prior

7r 1

(1-£ ) =

p+l 1r-2-aP

, 1-£ J-t)_(p!l) 2 a

Section 9.7 may be omitted at first reading.

274

9 High-dimensional Problems

(9.51) Another plausible prior is the smooth Cauchy prior given by

'i;;:)

where M ( ~, ~, is the hypergeometric 1 F 1 function of Abramowitz and Stegun (1970). It is tempting to use the difference (between the two models) of BIC as an approximation to the logarithm of Bayes factor (BF) even though it was developed by Schwarz for low-dimensional problems. Stone was the first to point out that the use of BIC is problematic in high-dimensional problems. Berger et al. (2003) have developed a generalization of BIC called GBIC, which provides a good approximation to the integrated likelihood for priors like the above Cauchy priors which are obtained by integrating the scale parameter for N(f..Li, a 2 ). In Stone's problem one has the normal linear model setup (9.52) It is assumed that as n ---* oo, p---* oo and r is fixed. Under these assumptions, Berger et al. (2003) provide a Laplace approximation and a GBIC. The GBIC also approximates the BIC for low-dimensional problems. The formula for LlGBIC (the difference of GBIC for the comparison of M 1 and M 0 ) is given by r -,p ) p ) + log p LlGBIC =(-X X- -log(rc - - - (9.53) 2 2 p 2 2 ' where

L xi . Table

9.1, taken from Berger et al. (2003) provides

p i=1

some idea of the accuracy of BIC, GBIC and Laplace approximation. One has p = 50 and r = 2 for these calculations and the multivariate Cauchy prior

was used. Substantial new results appear in Liang et al. (2005). They propose a mixture of Zellner's (Zellner (1986)) popular g-prior. In Zellner's form, the prior looks like J.LJM1 "'N(O, ~(Z'Z)- 1 ) where Z is the design matrix (in our problem only composed of O's and 1's). This g is usually elicited through an empirical Bayes method. The above authors consider a family of mixtures of g-priors (under which the Zellner-Siow Cauchy prior is a special case) and use those for model selection. They propose Laplace approximations to the

9. 7 Testing of a High-dimensional Null as a Model Selection Problem

275

Table 9.1. Comparison of the Performance of GBIC and Laplace Approximation with BIC Cp

0.1 0.5 1.0 1.5 2.0 10.0

True Log Bayes Factor i1BIC i1GBIC i1Laplace Approx -8.5348 -110.129 -1.956 -8.5776 -3.8251 -90.129 -1.956 -3.9083 6.0388 -65.129 5.715 5.9236 20.8203 -40.129 20.579 20.7564 -15.129 38.387 38.4814 38.4408 397.369 384.871 398.151 397.369

marginal likelihood under these general priors and show that the models thus selected are generally correct asymptotically if the complex model is true. Under the null model, this type of consistency still holds under the ZellnerSiow prior. Further generalizations to non-normal problems appear in Berger (2005) and Chakrabarti and Ghosh (2005a). Both papers provide generalizations of BIC when the observations come from an exponential family of distributions in high-dimensional problems. In Table 9.2, using simulation results reported in Chakrabarti and Ghosh (2005a), the performance of GBIC and the Laplace approximation (logm 2 ) with BIC are compared in approximating the integrated likelihood under the more complex model (denoted by m 2 ) when the more complex model is actually true and observations come from Bernoulli, exponential, and Poisson distributions. In this case one hasp groups of observations, each group having a (potentially) different parameter value and each group has r observations. Under the simpler model, these different groups are assumed to have the same (specified) parameter value, while for the more complex model the parameter vector is assumed to belong to RP. See the paper for details on the priors used. In principle, the same methods apply to any two nested models Mo: J-li = 0,1:::; i:::; Pl,Pl < p versus M1 : f.L E RP. Table 9.2. Approximation to Integrated Likelihood in the Exponential Family Distribution Bernoulli Bernoulli Exponential Exponential Poisson Poisson

50 50 50 50 50 50

10 200 10 200 10 200

logm2 -327.45 -4018.026 -662.526 -22186.199 -671.504 -15704.585

logm2 -327.684 -4018.072 -661.979 -22186.100 -670.775 -15704.618

BIG -349.577 -4052.757 -640.320 -22178.759 -683.383 -15713.139

GBIC -327.863 -4018.587 -660.384 -22186.117 -671.374 -15705.010

276

9 High-dimensional Problems

9.8 High-dimensional Estimation and Prediction Based on Model Selection or Model A veraging2 Given a set of data from an experiment or observational study done on a given population, a statistician is asked the following three questions quite frequently. First, which among a given set of possible statistical models seems to be the correct model describing the underlying mechanism producing the data? Second, what will be the predicted value of a future observation, if the experimental conditions are kept at predetermined levels? Third, what is the estimate of a single parameter or a vector (may be infinite dimensional) of parameters? We will focus in this section on some Bayesian approaches to answer the last two types of questions. But before going into the details, we will explain briefly in the next paragraph how one would pose the above three questions from a decision theoretic point of view and what is the basic difference in the Bayesian approaches in tackling such questions. Bayesian approaches to such questions are basically dictated by the goal of obtaining decision theoretic optimality, and hence the solutions are also heavily dependent upon the type of loss functions being used. The loss function, on the other hand, is mostly determined by the goal of the statistician or practitioner. The goal of the statistician in the first problem above is to select the correct model (which is assumed to be one in the list of models considered). The loss function often used in this problem is the 0-1 loss function. In the Bayesian approach to model selection, the statistician would put prior probabilities on the set of candidate models and a simple argument shows that for this loss, the optimum Bayesian model would be the posterior mode, i.e., the model that has the maximum posterior probability. As explained in the earlier section, BIC and GBIC can be used to select a model using the Bayesian paradigm with 0-1 loss if the sample size is large, in appropriate situations, as they approximate the integrated likelihood and hence can be used to find the model with highest posterior probability. On the other hand, if one is interested in answering the second or third question above (i.e., if one is interested in prediction or estimation of a parameter), the problem can be approached in two different ways. First, one might be interested in finding a particular model that does the best job of prediction (in some appropriate sense). Secondly, one might only want a predicted value, not a particular model for repeated future use in prediction. In either case, the most popular loss function is the squared prediction error loss, i.e., the square of the difference between the predicted/estimated value and the value being predicted/estimated. The best predictor/estimator turns out to be the Bayesian model averaging estimate (to be explained later) and the best predictive model is the one which minimizes the expected posterior predictive loss. We now consider the problem of optimal prediction from a Bayesian approach. We use the ideas, notations, and results of Barbieri and Berger (2004) 2

Section 9.8 may be omitted at first reading.

9.8 High-dimensional Estimation and Prediction

277

for this part. Consider the canonical model y = X/3

+ E,

(9.54)

where y is an n x 1 vector of observations, X is the n x k full rank design matrix, f3 is the unknown k x 1 vector of regression coefficients and E is the n x 1 vector of random errors, which are i.i.d. N(O, a 2 ), a 2 being known or unknown. Our goal is to predict a future observation y*, given by y* = x*/3

+ E,

(9.55)

where x* = (x~, ... , xk,) is the value of the covariate vector for which the prediction is to be made. We consider the loss in predicting y* by fj* as L(fj*,y*) = (fj*- y*)2;

(9.56)

i.e., the squared error prediction loss. Assume that we have submodels

(9.57) where I= (l 1 , ... , lk) with li = 1 or 0 according as the ith covariate is in the model M 1 or not, X 1 is a matrix containing columns of X corresponding with the nonzero coordinates of I and f3I is the corresponding vector of regression coefficients. Let k1 denote the number of covariates included in the model; then X1 is of dimension (n x k1) and f3I is a (k1 x 1) vector. We put prior probabilities P(MI) to each model M1 included in the model space such that 2: 1 P(MI) = 1, and given model M1, a prior 7rJ(f3I,a) is assumed on the parameters (!3 1, a) included in model M 1. Using standard posterior calculations, one obtains the quantities (a) p 1 = P(M11y), the posterior probability of model M1 and (b) 7rJ(f3I,aly), the posterior distribution of the unknown parameters in M 1. With this setup in mind, we shall now discuss two optimal prediction strategies, as described below. First note that the best predictor of y* for a given value of x* comes out as fj* = E(y* IY), where the expectation is taken with respect to the posterior/predictive distribution of y* given y. This follows by noting that

(9.58) where the expectation inside is taken with respect to the posterior distribution of y* given y. But note that y* = E(y*ly) = l:PIE(y*ly, MI) = x* l:PIHI,6I,

(9.59)

where H1 is a (k x ki) matrix such that x* H1 is the subvector of x* corresponding to the nonzero coordinates of I and ,61 is the posterior mean of f3I with respect to n 1(j3 1,aly). Noting that if we knew that M 1 were the true model, then the optimal predictor of y* for x fixed at x* would be given by

278

9 High-dimensional Problems

H1{31,

= x*

y* = E(y*iy) = x*j3

we have

=x* :~.::>1H1{3 1 I

(9.60)

:~.::>1fjj.

(9.61)

y* is called the Bayesian model averaging estimate, in that it is a weighted average of the optimal Bayesian predictors under each individual model, the weights being the posterior probabilities of each model. Many authors have argued the use of the model averaging estimate as an appropriate predictive estimate. They justify this by saying that in using model selection to choose the best model and then making inference based on the assumption that the selected model is true, does not take into account the fact that there is uncertainty about the model itself. As a result, one might underestimate the uncertainty about the quantity of interest. See, for example, Madigan and Raftery (1994), Raftery, Madigan, and Hoeting (1997), Hoeting, Madigan, Raftery, and Volinsky (1999), and Clyde (1999); just to name a few, for detailed discussion on this point of view. However if the number of models in the model space is very large (e.g., in case all subsets of parameters are allowed in the model space, as will happen in high or even moderately high dimensions), the task of computing the Bayesian model averaging estimate exactly might be virtually impossible. Moreover, it is not prudent to keep in the model average those models that have small posterior probability indicating relative incompatibility with observed data. There are some proposals to get around this difficulty, as discussed in the literature cited above. Two of them are based on the 'Occam's window' method of Madigan and Raftery (1994) and the Markov chain Monte Carlo approach of Madigan and York (1995). In the first approach, the averaging is done over a small set of appropriately selected models, which are parsimonious and supported by data. In the second approach, one constructs a Markov chain with state space same as the model space and equilibrium distribution {P(MIIY)} where M1 varies over the model space. Upon simulation from this chain, the Bayesian model averaging estimator is approximated by taking average value of the posterior expectations under each model visited in the chain. But it must be commented that Bayesian model averaging (BMA) has its limitations in high-dimensional problems. Each approach addresses both issues but it is unclear how well. Although BMA is the optimal predictive estimation procedure, often a single model is desired for prediction. For example, choice of a single model will require observing only the covariates included in the model. Also, as noted earlier, in high dimensions, BMA has its problems. We will assume now that the future predictions will be made for covariates x* such that

E(x*'x*)

exists and is positive definite. A frequent choice of Q is Q = X'X, i.e., the future covariates will be like the ones observed in the past. In general, the best

9.8 High-dimensional Estimation and Prediction

279

single model will depend on x*, but we present here some general characterizations which give the optimal predictive model without this dependence. In general, the optimal predictive model is not the one with the highest posterior probability. However, there are interesting exceptions. If there are only two models, it is easy to show the posterior mode with shrinkage estimate is optimal for prediction (Berger ( 1997) and Mukhopadhyay ( 2000)). This also holds sometimes in the context of variable selection for linear models with orthogonal design matrix, as in Clyde and Parmigiani (1996). As Berger (1997) notes, it is easy to see that if one is considering only two models, say M 1 and M 2 with prior probabilities ~ each and proper priors are assigned to the unknown parameters under each model, the best predictive model turns out to be M 1 or M2 according as the Bayes factor of M 1 to M 2 is greater than one or not, and hence the best predictive model is the one with the highest posterior probability. The characterizations we will describe here are in terms of what is called the 'median probability model.' If it exists, the median probability model M1* is defined to be the model consisting of those variables only whose posterior inclusion probabilities are at least ~. The posterior inclusion probability for variable i is

Pi=

P(MI/y).

(9.62)

l:li=l

So, l* is defined coordinatewise as li = 1 if Pi 2 ~ and li = 0 otherwise. It is possible that the median probability model does not exist, in that the variables included according to the definition of l* do not correspond with any model under consideration. But in the variable selection problem, if we are allowed to include or exclude any variable in the possible models, i.e., all possible values of l are allowed, then the median probability model will obviously exist. Another important class of models is a class of models with 'graphical model structure' for which the median probability model will always exist (this fact follows directly from the definition below).

Definition 9.4. Suppose that for each variable index i, there is a corresponding index set I(i) of other variables. A subclass of linear models is said to have 'graphical model structure' if it consists of all models satisfying the condition 'for each i, if variable xi is in the model, then variables Xj with j E J(i) are in the model. ' The class of models with 'graphical model structure' includes the class of models with all possible subsets of variables and sequences of nested models, MI(j)> j = 0,1, ... ,k, where l(j) = (1, ... ,1,0, ... ,0) withj ones and k - j zeros. For the all subsets scenario, I ( i) is the null set while in the nested case I(i) = {j : 1 :::; j < i} for i 2 2 and J(i) is the null set fori = 0 or 1. The latter are natural in many examples including polynomial regression models, where j refers to the degree of polynomial used. Another example of nested models is provided by nonparametric regression (vide Chapter 10,

280

9 High-dimensional Problems

Sections 10.2, 10.3). The unknown function is approximated by partial sums of its Fourier expansion, with all coefficients after stage j assumed to be zero. Note that in this situation, the median probability model has a simple description; one calculates the cumulative sum of posterior model probabilities beginning from the smallest model, and the median probability model is the first model for which this sum equals or exceeds ~. Mathematically, the median probability model is Ml(j*), where

(9.63) We present some results on the optimality of the posterior median model in prediction. The best predictive model is found as follows. Once a model is selected, the best Bayesian predictor assuming that model is true is obtained. In the next stage, one finds the model such that the expected prediction loss (this expectation does not assume any particular model is true, but is an overall expectation) using this Bayesian predictor is minimized. The minimizer is the best predictive model. There are some situations where the median probability model and the highest posterior probability are the same. Obviously, if there is one model with posterior probability greater than ~, this will be trivially true. Barbieri and Berger (2004) observe that when the highest posterior probability model has substantially larger probability than the other models, it will typically also be the median probability model. We describe another such situation later in the corollary to Theorem 9.8. We state and prove two simple lemmas. Lemma 9.5. (Barbieri and Berger, 2004) Assume Q exists and is positive definite. The optimal model for predicting y* under the squared error loss, is the unique model minimizing

(9.64) where (3 is defined in (9.61). Proof. As noted earlier, yj is the optimal Bayesian predictor assuming M1 is the true model. The optimal predictive model is found by minimizing with respect to I, where I belongs to the space of models under consideration, the quantity E(y* - yi) 2 . Minimizing this is equivalent to minimizing for each y the quantity E[(y*- yi) 2 1y]. It is easy to see that for a fixed x*, (9.65) where the symbols have been defined earlier and C is a quantity independent of I. The expectation above is taken with respect to the predictive distribution of y* given y and x*. So the optimal predictive model will be found by finding

9.8 High-dimensional Estimation and Prediction

281

the minimizer of the expression obtained by taking a further expectation over x* on the second quantity on the right hand side of (9.65). By plugging in the values of f){ and f}*, we immediately get

-* - YI A* )2 (Y

= (HI{3- I -

{3- )' X *' X * (HI{3- I - {3-) ·

(9.66)

The lemma follows. The uniqueness follows from the fact that Q is positive definite. D

Lemma 9.6. (Barbieri and Berger, 2004) If Q is diagonal with diagonal elements Qi > 0, and the posterior means {3I satisfy {3I = H{B (where is the posterior mean under the full model as in (9.54)) then

(9.67) i=l

Proof. From the fact {3I

HI'

73, it follows that (9.68)

where D(p) is the diagonal matrix with diagonal elements Pi, by noting that i

HI(i,j)

1 if li

1 and j

I: lr

and HI(i,j)

0 otherwise. Similarly,

r=l

R(MI)

(HIHI'73- D(p)73)'Q(HIHI'73- D(p)73)

= {3

(D(l)- D(p))Q(D(l)- D(p)){3,

from where the result follows.

(9.69)

73,

Rerr::ark 9. 1. The condition {3I = HI' simply means that the posterior mean of f3I is found by taking the relevant coordinates of the posterior mean in the full model as in (9.54). As Barbieri and Berger (2004) comment, this will happen in two important cases. Assume X' X is diagonal. In the first case, if one uses the reference prior 7TI (f3I, a) = 1 j a or a constant prior if a is known, the LSE becomes same as the posterior means and the diagonality of (X'X) implies that the above condition will hold. Secondly, suppose in the full model7r({3, a)= Nk(J.l, a 2 Ll) where Ll is a known diagonal matrix, and for the submodels the natural corresponding prior Nk 1 (HI' J.l, a 2 HI' LlHI)- Then it is easy to see that for any prior on a 2 or if a 2 is known, the above will hold. We now state the first theorem.

Theorem 9.8. (Barbieri and Berger, 2004) If Q is diagonal with Qi > 0 and = HI' and the models have graphical model structure, then the median probability model is the best predictive model.

{3I

73,

282

9 High-dimensional Problems - 2

Proof. Because q; > 0, (3; ;::: 0 for each i and p; (defined in (9.62)) does not depend on l, to minimize R(M1) among all possible models, it suffices to minimize (l; - p;) 2 for each individual i and that is achieved by choosing l; = 1 if p; ;::: ~ and l; = 0 if p; < ~, whence l as defined will be the median probability model. The graphical model structure ensures that this model is among the class of models under consideration. D Remark 9.9. The above theorem obviously holds if we consider all submodels, this class having graphical model structure; provided the conditions of the theorem hold. By the same token, the result will hold under the situation where the models under consideration are nested.

Corollary 9.10. (Barbieri and Berger, 2004) If the conditions of the above theorem hold, all submodels of the full model are allowed, a 2 is known, X'X is diagonal and (3; 's have N(JL;, >.;a 2 ) distributions and k

II (p?)ii (1 -

P(MI) =

p~)(l-li)'

(9. 70)

i=l

where p~ is the prior probability that variable x; is in the model, then the optimal predictive model is the model with highest posterior probability which is also the median probability model. Proof. Let /J; be the least squares estimate of (3; under the full model. Because X' X is diagonal, /J; 's are independent and the likelihood under Mt factors as k

L(MI)

II(>.~)li(,\~)1-li i=l

where >.~ depends only on /J; and (3;, >.~ depends only on /J; and the constant of proportionality here and below depend an Y and /J; 's. Also, the conditional prior distribution of (J;'s given Mt has a factorization k

7r(/31MI) = ITfN(JL;, >.;a2)]li [8{0}]1-li i=l

where 8{0} =degenerate distribution with all mass at zero. It follows from (9.70) and the above two factorizations that the posterior probability of M 1 has a factorization

which in turn implies that the marginal posterior of including or not including ith variable is proportional to the two terms respectively in the ith factor. This completes the proof, vide Problem 21. (The integral can be evaluated as in Chapter 2.) D

9.8 High-dimensional Estimation and Prediction

283

We have noted before that if the conditions in Theorem 9.8 are satisfied and the models are nested, then the best predictive model is the median probability model. Interestingly even if Q is not necessarily diagonal, the best predictive model turns out to be the median probability model under some mild assumptions, in the nested model scenario. Consider Assumption 1: Q = rX'X for some 1 > 0, i.e., the prediction will be made at covariates that are similar to the ones already observed in the past. Assumption 2: 1 = b/3 1, where b > 0, i.e, the posterior means are proportional to the least squares estimates.

Remark 9.11. Barbieri and Berger (2004) list two situations when the second assumption will be satisfied. First, if one uses the reference prior ;q(,BI, a) = 1/a, whereby the posterior means will be the LSE's. It will also be satisfied with b = c/(1 +c), if one uses g-type normal priors of Zellner (1986), where 7rJ(,l3Iia) "'Nk 1 (0, ca 2(XiXI)- 1) and the prior on a is arbitrary. Theorem 9.12. For a sequence of nested models for which the above two conditions hold, the best predictive model is the median probability model.

Proof. See Barbieri and Berger (2004).

Barbieri and Berger(2004, Section 5) present a geometric formulation for identification of the optimal predictive model. They also establish conditions under which the median probability model and the maximum posterior probability model coincides; and that it is typically not enough to know only the posterior probabilities of each model to determine the optimal predictive model. Till now we have concentrated on some Bayesian approaches to the prediction problem. It turns out that model selection based on the classical Akaike information criterion (AIC) also plays an important role in Bayesian prediction and estimation for linear models and function estimation. Optimality results for AIC in classical statistics are due to Shibata (1981, 1983), Li (1987), and Shao (1997). The first Bayesian result about AIC is taken from Mukhopadhyay (2000). Here one has observations {Yij : i = 1, ... , p, j = 1, ... , r, n = pr} given by Yij = Jli

+ Eij,

(9.71)

where Eij are i.i.d. N(O, a 2 ) with a 2 known. The models are M1 : Jli = 0 for all i and M2 : ry 2 = limp->oo ~ I:f= 1 JLT > 0. Under M2, we assume a N(O, T 2 Ip) prior on J1 where T 2 is to be estimated from data using an empirical Bayes method. It is further assumed that p ----t oo as n ----t oo. The goal is to predict a future set of observations { Zij} independent of {Yij} using the usual prediction error loss, with the 'constraint' that once a model is selected, least squares estimates have to be used to make the predictions. Theorem 9.13 shows that the constrained empirical Bayes rule is equivalent to AIC asymptotically. A weaker result is given as Problem 17.

284

9 High-dimensional Problems

Theorem 9.13. (Mukhopadhyay, 2000) Suppose M 2 is true, then asymptotically the constrained empirical Bayes rule and AIC select the same model. Under M 1 , AIC and the constrained empirical Bayes rule choose M 1 with probability tending to 1. Also under M 1 , the constrained empirical Bayes rule chooses M 1 whenever AIC does so.

The result is extended to general nested problems in Mukhopadhyay and Ghosh (2004a). It is however also shown in the above reference that if one uses Bayes estimates instead of least squares estimates, then the unconstrained Bayes rule does better than AIC asymptotically. The performance of AIC in the PEB setup of George and Foster (2000) is studied in Mukhopadhyay and Ghosh (2004a) . As one would expect from this, AIC also performs well in nonparametric regression which can be formulated as an infinite dimensional linear problem. It is shown in Chakrabarti and Ghosh (2005b) that AIC attains the optimal rate of convergence in an asymptotically equivalent problem and is also adaptive in the sense that it makes no assumption about the degree of smoothness. Because this result is somewhat technical, we only present some numerical results for the problem of nonparametric regression. In the nonparametric regression problem i

Yi = f(-) + Ei, z = 1, ... , n, n

(9.72)

one has to estimate the unknown smooth function f. In Table 9.3, we consider n = 100 and f(x) = (sin (21rx) ) 3 , (cos (1rx) )4 , 7 +cos (27rx), and esin (Z7rX), 1 the loss function L (f, ]) = 0 (f (x) - j (x) )2 dx, and report the average loss of modified James-Stein estimator of Cai et al. (2000), AIC, and the kernel method with Epanechnikov kernel in 50 simulations. To use the first two methods, we express f in its (partial sum) Fourier expansion with respect to the usual sine-cosine Fourier basis of [0, 1] and then estimate the Fourier coefficients by the regression coefficients. Some simple but basic insight about the AIC may be obtained from Problems 15-17. It is also worth remembering that AIC was expected by Akaike to perform well in high-dimensional estimation or prediction problem when the true model is too complex to be in the model space.

9.9 Discussion Bayesian model selection is passing through a stage of rapid growth, especially in the context of bioinformatics and variable selection. The two previous sections provide an overview of some of the literature. See also the review by Ghosh and Samanta (2001). For a very clear and systematic approach to different aspects of model selection, see Bernardo and Smith (1994). Model selection based on AIC is used in many real-life problems by Burnham and Anderson (2002). However, its use for testing problems with 0-1

9.10 Exercises

285

Table 9.3. Comparison of Simulation Performance of Various Estimation Methods

in Nonparametric Regression Function Modified James-Stein AIC Kernel Method [Sin(27rx) ]'=~ 0.0793 0.2165 0.0691 [Cos(7rx)] 4 7 + Cos(27rx) e~in(27rX)

0.2235 0.2576 0.2618

0.078 0.0529 0.0850

0.091 0.5380 0.082

loss is questionable vide Problem 16. A very promising new model selection criterion due to Spiegelhalter et al. (2002) may also be interpreted as a generalization of AIC, see, e.g., Chakrabarti and Ghosh (2005a). In the latter paper, GBIC is also interpreted from the information theoretic point of view of Rissanen (1987). We believe the Bayesian approach provides a unified approach to model selection and helps us see classical rules like BIC and AIC as still important but by no means the last word in any sense. We end this section with two final comments. One important application of model selection is to examine model fit. Gelfand and Ghosh (1998) (see also Gelfand and Dey (1994)) use leave-kout cross-validation to compare each collection of k data points and their predictive distribution based on the remaining observations. Based on the predictive distributions, one may calculate predicted values and some measure of deviation from the k observations that are left out. An average of the deviation over all sets of k left out observations provides some idea of goodness of fit. Gelfand and Ghosh (1998) use these for model selection. Presumably, the average distance for a model can be used for model check also. An interesting work of this kind is Bhattacharya (2005). Another important problem is computation of the Bayes factor. Gelfand and Dey (1994) and Chib (1995) show how one can use MCMC calculations by relating the marginal likelihood of data to the posterior via P(y) = L(Biy)P(B) / P(Biy). Other relevant papers are Carlin and Chib (1995), Chib and Greenberg (1998), and Basu and Chib (2003). There are interesting suggestions also in Gelman et al (1995).

9.10 Exercises 1. Show that 7r(7J2IX) is an improper density if we take 7r(7Jb 172) = 1/7]2 in Example 9.3. 2. Justify (9.2) and (9.3). 3. Complete the details to implement Gibbs sampling and E-M algorithm in Example 9.3 when J.L and 0' 2 are unknown. Take 7r(7Jb 0' 2, 172) = 1/0' 2. 4. Let Xi's be independent with density f(xiBi), i = 1, 2, ... ,p, Bi E R. Consider the problem of estimating(}= (81, ... , Bp)' with loss function

286

9 High-dimensional Problems p

L(O,a)

i=l

5. 6.

8. 9.

10. 11.

= LL(8i,ai) = L(8i- ai) 2 ,

O,a E J?P.

i=l

i.e., the total loss is the sum of the losses in estimating 8i by ai. An estimator for (J is the vector (T1 (X), T2 (X), ... , Tp(X)). We call this a compound decision problem with p components. (a) Suppose sup 8 f(x\<5) = f(x\T(x)), i.e., T(x) is the MLE (of 8j in f(x\8j)). Show that (T(X 1 ), T(X 2), ... T(Xp)) is the MLE of 0. (b) Suppose T(X) (not necessarily the T(X) of (a)) satisfies the sufficient condition for a minimax estimate given at the end of Section 1.5. Is (T(XI), T(X2), ... , T(Xp)) minimax for (J in the compound decision problem? (c) Suppose T(X) is the Bayes estimate with respect to squared error loss for estimating 8 of f(x\8). Is (T(X 1 ), ... , T(Xp)) a Bayes estimate for 0? (d) Suppose T = (T1 (X 1 ), •.• ,Tp(Xp)) and Tj(Xi) is admissible in the jth component decision problem. Is T admissible? Verify the claim of the best unbiased predictor (9.17). Given the hierarchical prior of Section 9.3 for Morris's regression setup, calculate the posterior and the Bayes estimate as explicitly as possible. Find the full conditionals of the posterior distribution in order to implement MCMC. Prove the claims of superiority made in Section 9.4 for the James-SteinLindley estimate and the James-Stein positive part estimate using Stein's identity. Under the setup of Section 9.3, show that the PEB risk of Bi is smaller than the PEB risk of Yi. Refer to Sections 9.3 and 9.4. Compare the PEB risk of Bi and Stein's frequentist risk of 0 and show that the two risks are of the same form but one has E(B) and the other Ee(B). (Hint: See equations (1.17) and (1.18) of Morris (1983)). Consider the setup of Section 9.3. Show that B is the best unbiased estimate of B. (Disease mapping) (See Section 10.1 for more details on the setup.) Suppose that the area to be mapped is divided into N regions. Let Oi and Ei be respectively the observed and expected number of cases of a disease in the ith region, i = 1, 2, ... , N. The unknown parameters of interest are 8i, the relative risk in the ith region, i = 1, 2, ... , N. The traditional model for oi is the Poisson model, which states that given (81' ... '8N ), 0/s are independent and Oi\8i"' Poisson (Ei8i)· Let 81 ,82, ... ,8N be i.i.d."' Gamma(a,b). Find the PEB estimates of 8 1 ,82 , ... ,8N. In Section 10.1, we will consider hierarchical Bayes analysis for this problem.

9.10 Exercises

287

12. Let Yi be i.i.d N(Bi, V), i = 1, 2, ... ,p. Stein's heuristics (Section 9.4) shows IIYII 2 is too large in a frequentist sense. Verify by a similar argument that if ()i are i.i.d uniform on R then IIYII 2 is too small in an improper Bayesian sense, i.e., there is extreme divergence between frequentist probability and naive objective Bayes probability in a high-dimensional case. 13. (Berger (1985a, p. 542)) Consider a multi parameter exponential family f(xiO) = c(O) exp(O'T(x))h(x), where x and 0 are vectors of the same dimension. Assuming Stein's loss, show that (under suitable conditions) the Bayes estimate can be written as gradient(log m(x)) - gradient(log h(x)) where m( x) is the marginal density of x obtained by integrating out 0. 14. Simulate data according to the model in Example 9.3, Section 9.1. (a) Examine how well the model can be checked from the data Xij, i = 1,2, .. . n, j = 1,2, .. . p. (b) Suppose one uses the empirical distribution of Xj 's as a surrogate prior for /-lj 's. Compare critically the Bayes estimate of p, for this prior with the PEB estimate. 15. (Stone's problem) Let Yij = o+f-Li+Eij, Eij "'N(O, u 2 ), i = 1, 2, ... ,p, j = 1, 2, ... , r, n = pr with u 2 assumed known or estimated by S 2 = I:f= 1 l::j= 1 (Yij- }i) 2 jp(r- 1). The two models are

M1 : f-Li

= 0\fi

and M 2

p, E RP.

Suppose n--+ oo,plognjn--+ oo and I:f= 1 (f-li- p) 2 /(p -1)--+ T 2 > 0. (a) Show that even though M 2 is true, BIC will select M 1 with probability tending to 1. Also show that AIC will choose the right model M 2 with probability tending to one. (b) As a Bayesian how important do you think is this notion of consistency? (c) Explore the relation between AIC and selection of model based on estimation of residual sum of squares by leave-one-out cross validation. 16. Consider an extremely simple testing problem. X"' N(f-L, 1). You have to test H 0 : fL = 0 versus H 1 : fL -=f. 0. Is AIC appropriate for this? Compare AIC, BIC, and the usual likelihood ratio test, keeping in mind the conflict between P-values and posterior probability of the sharp null hypothesis. 17. Consider two nested models and an empirical Bayes model selection rule with the evaluation based on the more complex model. Though you know the more complex model is true, you may be better off predicting with the simpler model. Let Yij = /-li + Eij, Eij i.i.d N(O, u 2 ), i = 1, 2, ... , p, j = 1, 2, ... , r with known u 2 . The models are

M1: p,

M2: ,_, E RP,p,"' Np(O,T 2 Ip),T 2

> 0.

(a) Assume that in PEB evaluation under M 2 you estimate moment estimate:

by the

288

9 High-dimensional Problems

Show with PEB evaluation ofrisk under M 2 and M 1 , Y is preferred if and only if AIC selects M 2 . (b) Why is it desirable to have large pin this problem? (c) How will you try to justify in an intuitive way occasional choice of the simple but false model? (d) Use (a) to motivate how the penalty coefficient 2 arises in AIC. (This problem is based on a result in Mukhopadhyay (2001)). 18. Burnham and Anderson (2002) generated data to mimic a real-life experiment of Stromberg et al. (1998). Select a suitable model from among the 9 models considered by Ghosh and Samanta (2001). The main issue is computation of the integrated likelihood under each model. You can try Laplace approximation, the method based on MCMC suggested at the end of Section 9.9, and importance sampling. All methods are difficult, but they give very close answers in this problem. The data and the models can be obtained from the Web page http://www.isical.ac.in/-tapas/book 19. Let Xi rv N(f1, 1),i = 1, ... ,nand f1 rv N(ry 1 ,ry2 ). Find the PEB estimate of ry 1 and ry2 and examine its implications for the inadequacy of the PEB approach in low-dimensional problems. 20. Consider NPEB multiple testing (Section 9.6.1) with known 1r 1 and an estimate J of (1-Jrl)fo + Jrif1 . Suppose for each i, you reject Hoi : fli = 0 if fo(xi) ~ ](xi)a, where o < a < 1. Examine whether this test provides any control on the (frequentist) FDR. Define a Bayesian FDR and examine if, for small 1r 1 , this is also controlled by the test. Suggest a test that would make the Bayesian FDR approximately equal to a. (The idea of controlling a Bayesian FDR is due to Storey (2003). The simple rules in this problem are due to Bogdan, Ghosh, and Tokdar (personal communication).) 21. For all subsets variable selection models show that the posterior median model and the posterior mode model are the same if p

P(Mz!X) = IJp~i(1- Pi)l-l, i=l

where li

1 if the ith variable is included in M 1 and li

= 0 otherwise.

10 Some Applications

The popularity of Bayesian methods in recent times is mainly due to their successful applications to complex high-dimensional real-life problems in diverse areas such as epidemiology, microarrays, pattern recognition, signal processing, and survival analysis. This chapter presents a few such applications together with the required methodology. We describe the method without going into the details of the critical issues involved, for which references are given. This is followed by an application involving real or simulated data. We begin with a hierarchical Bayesian modeling of spatial data in Section 10.1. This is in the context of disease mapping, an area of epidemiological interest. The next two sections, 10.2 and 10.3, present nonparametric estimation of regression function using wavelets and Dirichlet multinomial allocation. They may also be treated as applications involving Bayesian data smoothing. For several recent advances in Bayesian nonparametrics, see Dey et al. (1998) and Ghosh and Ramamoorthi (2003).

10.1 Disease Mapping Our first application is from the area of epidemiology and involves hierarchical Bayesian spatial modeling. Disease mapping provides a geographical distribution of a disease displaying some index such as the relative risk of the disease in each subregion of the area to be mapped. Suppose that the area to be mapped is divided into N regions. Let Oi and Ei be respectively the observed and expected number of cases of a disease in the ith region, i = 1, 2, ... , N. The unknown parameters of interest are the relative risk in the ith region, i = 1, 2, ... , N. Here Ei is a simple-minded expectation assuming all regions have the same disease rate (at least after adjustment for age), vide Banerjee et al. (2004, p. 158). The relative risk is the regional effect in a multiplicative model of expected number of cases: E(Oi) = Eiei· If ei = 1, we have E( Oi) = Ei. The objective is to make inference about 8/s across regions. Among other things, this helps epidemiologists and public health professionals

ei,

290

10 Some Applications

to identify regions or cluster of regions having high relative risks and hence needing attention and also to identify covariates causing high relative risk. The traditional model for Oi is the Poisson model, which states that given (e1, ... , eN), Qi 's are independent and

(10.1) Under this model Ei 's are assumed fixed. The classical maximum likelihood estimate of ei is {Ji = Od Ei, known as the standardized mortality ratio (SMR) for region i and Var(Bi) = ed Ei, which may be estimated as Bd Ei· However, it was noted in Chapter 9 that the classical estimates may not be appropriate here for simultaneous estimation of the parameters e1, e2, ... , eN. As mentioned in Chapter 9, because of the assumption of exchangeability of e1' ... 'eN' there is a natural Bayesian solution to the problem. A Bayesian modeling involves specification of prior distribution of (e 1, .. . eN)· Clayton and Kaldor (1987) followed the empirical Bayes approach using a model that assumes

(10.2) and estimating the hyperparameters a and b from the marginal density of {Oi} given a, b (see Section 9.2). Here we present a full Bayesian approach adopting a prior model that allows for spatial correlation among the ei 's. A natural extension of (10.2) could be a multivariate Gamma distribution for (e 1, ... , eN). We, however, assume a multivariate normal distribution for the log-relative risks log ei, i = 1, ... , N. The model may also be extended to allow for explanatory covariates xi which may affect the relative risk. Thus we consider the following hierarchical Bayesian model Oijei are independent ,....., Poisson (Eiei) where logei = x~/3

+ ¢i,

(10.3)

i = 1, ... , N.

The usual prior for 4> = ( ¢ 1, ... , ¢ N) is given by the conditionally autoregressive (CAR) model (Besag, 1974), which is briefly described below. For details see, e.g., Besag (1974) and Banerjee et al. (2004, pp. 79-83, 163, 164). Suppose the full conditionals are specified as ¢il¢j, j

i"' N('2:~ aij¢j, al), i = 1, 2, ... , N.

(10.4)

These will lead to a joint distribution having density proportional to exp {

-~¢' D- 1(I- A)¢}

(10.5)

where D = Diag(ai, ... , a~) and A= (aij)NxN· We look for a model that allows for spatial correlation and so consider a model where correlation depends

10.1 Disease Mapping

291

on geographical proximity. A proximity matrix W = (Wij) is anN x N matrix where Wij spatially connects regions i and j in some manner. We consider here binary choices. We set Wii = 0 for all i, and fori -I= j, Wij = 1 if i is a neighbor of j, i.e., i and j share some common boundary and Wij = 0 otherwise. Also, wi/s in each row may be standardized as Wij = Wij/Wio where Wio = L:j Wij is the number of neighbors of region i. Returning to our model (10.5), we now set aij = awij/Wio and uf = )..jwio· Then (10.5) becomes exp {-

where Dw = Diag(ww, W2o, ... 'WNo). This also ensures that t(Dw- aW) is symmetric. Thus the prior for ¢ is multivariate normal ¢'"" N(O, E) withE= )..(Dw- aW)- 1 .

n- 1 (!- A)

(10.6)

We take 0 < a < 1, which ensures propriety of the prior and positive spatial correlation; only the values of a close to 1 give enough spatial similarity. For a = 1 we have the standard improper CAR model. One may use the improper CAR prior because it is known that the posterior will typically emerge as proper. For this and other relative issues, see Banerjee et al. (2004). Having specified priors for all the unknown parameters including the spatial variance parameter ).. and propriety parameter a ( 0 < a < 1), one can now do Bayesian analysis using MCMC techniques. We illustrate through an example.

Example 10.1. Table 10.1 presents data from Clayton and Kaldor (1987) on observed (Oi) and expected (Ei) cases of lip cancer during the period 19751980 for N = 56 counties of Scotland. Also available are Xi, values of a covariate, the percentage of the population engaged in agriculture, fishing, and forestry (AFF), for the 56 counties. The log-relative risk is modeled as (10. 7) where the prior for (¢ 1, ... , ¢N) is as specified in (10.6). We use vague priors for (30 and {3 1 and a prior having high concentration near 1 for the parameter a. The data may be analyzed using WinBUGS. A WinBUGS code for this example isyut in the web page of Samanta. A part of the results- the Bayes estimates ()i of the relative risks for the 56 counties - are presented in Table 10.1. The ()/s are smoothed by pooling the neighboring values in an automatic adaptive way as suggested in Chapter 9. The estimates of (30 and (31 are obtained as jj0 = -0.2923 and jj1 = 0.3748 with estimates of posterior s.d. equal to 0.3426 and 0.1325, respectively.

292

10 Some Applications

Table 10.1. Lip Cancer Incidence in Scotland by County: Observed Numbers (Oi), Expected Numbers (Ei), Values of the Covariate AFF (xi), and Bayes Estimates of the Relative Risk (8;). County 1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

9 1.4 16 4.705 39 8.7 16 4.347 11 3.0 10 3.287 9 2.5 24 2.981 15 4.3 10 3.145 8 2.4 24 3.775 26 8.1 10 2.917 7 2.3 7 2.793 6 2.0 7 2.143 20 6.6 16 2.902 13 4.4 7 2.779 5 1.8 16 3.265 3 1.1 10 2.563 8 3.3 24 2.049 17 7.8 7 1.809 9 4.6 16 2.070 2 1.1 10 1.997 7 4.2 7 1.178 9 5.5 7 1.912 7 4.4 10 1.395 16 10.5 7 1.377 31 22.7 16 1.442 11 8.8 10 1.185 7 5.6 7 0.837 19 15.5 1 1.188 15 12.5 1 1.007 7 6.0 7 0.946 10 9.0 7 1.047

County 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56

16 14.4 10 11 10.2 10 5 4.8 7 3 2.9 24 7 7.0 10 8 8.5 7 11 12.3 7 9 10.1 0 11 12.7 10 8 9.4 1 6 7.2 16 4 5.3 0 10 18.8 1 8 15.8 16 2 4.3 16 6 14.6 0 19 50.7 1 3 8.2 7 2 5.6 1 3 9.3 1 28 88.7 0 6 19.6 1 1 3.4 1 1 3.6 0 1 5.7 1 1 7.0 1 0 4.2 16 0 1.8 10

1.222 0.895 0.860 1.476 0.966 0.770 0.852 0.762 0.886 0.601 1.008 0.569 0.532 0.747 0.928 0.467 0.431 0.587 0.470 0.433 0.357 0.507 0.481 0.447 0.399 0.406 0.865 0.773

10.2 Bayesian Nonparametric Regression Using Wavelets Let us recall the nonparametric regression problem that was stated in Example 6.1. In this problem, it is of interest to fit a general regression function to a set of observations. It is assumed that the observations arise from a realvalued regression function defined on an interval on the real line. Specifically, we have (10.8) Yi = g(xi) + Ei, i = 1, ... ,n, and Xi E / , where Ei are i.i.d. N(O,a 2 ) errors with unknown error variance a 2 , and g is a function defined on some interval I c R 1 .

10.2 Bayesian Nonparametric Regression Using Wavelets

293

It can be immediately noted that a Bayesian solution to this problem involves specifying a prior distribution on a large class of regression functions. In general, this is a rather difficult task. A simple approach that has been successful is to decompose the regression function g into a linear combination of a set of basis functions and to specify a prior distribution on the regression coefficients. In our discussion here, we use the (orthonormal) wavelet basis. We provide a very brief non-technical overview of wavelets including multiresolution analysis (MRA) here, but for a complete and thorough discussion refer to Ogden (1997), Daubechies (1992), Hernandez and Weiss (1996), Muller and Vidakovic (1999), and Vidakovic (1999).

10.2.1 A Brief Overview of Wavelets

Consider the function

,P(x) ··• {

0 _<S: X < 1/2; 1/2 _<S: X _<S: 1; otherwise.

(10.9)

which is known as the Haar wavelet, simplest of the wavelets. Note that its dyadic dilations along with integer translations, namely,

(10.10) provide a complete orthonormal system for £ 2 (R). This says that any f E £ 2 (R) can be approximated arbitrarily well using step functions that are simply linear combinations of wavelets 1/Jj,k(x). What is more interesting and important is how a finer approximation for f can be written as an orthogonal sum of a coarser approximation and a detail function. In other words, for j E Z, let

= {

E £ (R) :

[k2-j, (k

is piecewise constant on intervals

+ 1)2-J), k

E Z }·

(10.11)

Now suppose PJ f is the projection off E £ 2 (R) onto Vj. Then note that

+ gj-1 1 pj- 1 + 2:: < J, 1/Jj-1,k > 1/Jj-1,k,

pj f = pj-1 f

(10.12)

kEZ

with gJ- 1 being the detail function as shown, so that

(10.13) where Wj = span {1/Jj,k, k E Z}. Also, corresponding with the 'mother' wavelet 1j; (Haar wavelet in this case), there is a father wavelet or scaling function

294

10 Some Applications

¢ = I[o,l] such that Vj = span { cPj,k, k E Z}, where ¢1,k is the dilation and translation of¢ similar to the definition (10.10), i.e., (10.14) In fact, the sequence of subspaces {Vj} has the following properties: 1. · · ·

v-2 c v_l c

2. njEZVJ

v1 c v2 c · · ·.

Vo c

{O},UjEZVJ

£ 2(R).

3. j E Vj iff f(2.) E VJ+l· 4. f E V0 implies f(.- k) E V0 for all k E Z. 5. There exists ¢ E Vo such that span { c/Jo,k = ¢(.- k), k E Z} =

Vo.

Given this ¢, the corresponding '¢ can be easily derived (see Ogden (1997) or Vidakovic (1999)). What is interesting and useful to us is that there exist scaling functions ¢ with desirable features other than the Haar function. Especially important are Daubechies wavelets that are compactly supported and each having a different degree of smoothness.

Definition: Closed subspaces {Vj}jEZ satisfying properties 1-5 are said to form a multi-resolution analysis (MRA) of £ 2 (R). If Vj =span { ¢1,k, k E Z} form an MRA of £ 2 (R), then the corresponding¢ is also said to generate this MRA. In statistical inference, we deal with finite data sets, so wavelets with compact support are desirable. Further, the regression functions (or density functions) that we need to estimate are expected to have certain degree of smoothness. Therefore, the wavelets used here should have some smoothness also. The Haar wavelet does have compact support but is not very smooth. In the application discussed later, we use wavelets from the family of compactly supported smooth wavelets introduced by Daubechies (1992). These, however, cannot be expressed in closed form. A sketch of their construction is as follows. Because, from property 5 above of MRA, ¢ E Vo c V1 , we have ¢(x) =

"2: hkcPl,k(x),

(10.15)

kEZ

where the 'filter' coefficients hk are given by hk =< ¢, cPl,k >=

¢(x)¢(2x- k) dx.

(10.16)

For compactly supported wavelets ¢, only finitely many hk 's will be non-zero. Define the 2n-periodic trigonometric polynomial

mo(w)

_..!__

"2: hke-ikw

(10.17)

hkEZ

associated with { hk}. The Fourier transforms of ¢ and '¢ can be shown to be of the form

10.2 Bayesian Nonparametric Regression Using Wavelets

295

(10.18) (10.19) Depending on the number of non-zero elements in the filter { hk}, wavelets of different degree of smoothness emerge. It is natural to wonder what is special about MRA. Smoothing techniques such as linear regression, splines, and Fourier series all try to represent a signal in terms of component functions. At the same time, wavelet-based MRA studies the detail signals or differences in the approximations made at adjacent resolution levels. This way, local changes can be picked up much more easily than with other smoothing techniques. With this short introduction to wavelets, we return to the nonparametric regression problem in (10.8). Much of the following discussion closely follows Angers and Delampady (2001). We begin with a compactly supported wavelet function 1/J E cs, the set of real-valued functions with continuous derivatives up to orders. We note that then g has the wavelet decomposition (10.20)

with

(h(x) 1/Jj,k(x)

= =

¢(x- k), and 2jf 2 1jJ(2jx- k),

where Kj is such that cPk(x) and 1/Jj,k(x) vanish on T whenever lkl > Kj, and ¢ is the scaling function ('father wavelet') corresponding with the 'mother wavelet' '1/J. Such Kj's exist (and are finite) because the wavelet function that we have chosen has compact support. For any specified resolution level J, we have

g(x)

CXkcPk(x)

lki:S:Ko =

gJ(x)

+L j=O

P]k'I/Jj,k(x)

lki:S:Kj

L L

P]k'I/Jj,k(x)

j=J+llki:S:Kj

+ RJ(x),

(10.21)

where

gJ(x)

ak¢k(x)

P]k'I/Jj,k(x), and

RJ(x)

L L j=J+llki:S:Kj

P]k'I/Jj,k(x).

(10.22)

296

10 Some Applications

In the representation (10.22), we note that the

"'i

where = RJ(xi)· Because the amount of information available in the likelihood function to estimate the infinitely many parameters fJjk, j > J, [k[ :::; Kj (arising from the higher levels of resolution and appearing in TJi) is very limited, it is best to treat these T/i as nuisance parameters and eliminate them by integrating out with respect to the prior given in (10.24) while estimating 9J. Otherwise, one will need to elicit some very informative prior on these parameters, thus attracting prior robustness issues as well. One other important issue is how large J should be. Note that the number of unknown parameters in the model grows exponentially with J, so it cannot be large for practical reasons. Also, there is no need for large J because its purpose is to check for local details only.

10.2.2 Hierarchical Prior Structure and Posterior Computations In the first-stage prior specification, ak and (Jjz are all assumed to be independent normal random variables with mean 0. A common prior variance of T 2 is assigned for ak, whereas to accommodate the decreasing effect of the 'detail' coefficients (Jjz, their variance is assumed to be 2- 2i 8 T 2 . Now a joint prior distribution on a 2 and T 2 completes the prior specification. Even though conditionally, given T 2 , ak and (Jjz are normally distributed, unconditionally they do have heavy tailed prior distributions possessing robustness properties. Let us now introduce some notations to facilitate the derivation of posterior quantities. Let 1 = (a',/3')', where a = (ak)[kf::C:Ko' and j3 = (fJjk)o::o:;j::o:;J,fki::C:KJ. Then the first stage prior specified above is

N2Ko+l+M(3

(

0, T

r)'

where

( I2Ko+l

L1M(3

)

with M{3 = L,f= 0 (2Kj + 1) and the diagonal matrix L1 being the variancecovariance matrix of (3. Also, (10.24) where, to keep the covariance structure of

"'i simple, we choose

10.2 Bayesian Nonparametric Regression Using Wavelets

297

for some moderate value of c. Further, let X = (

(10.25)

Y=Xi+u, where u = 1J fact that

+€

'"'"Nn(O, E) withE= IJ 2 ln

+ 1 2 Qn.

This follows from the

Yj/,1],1J 2 ,1 '"'"Nn(Xi+1J,IJ 2 ln), 2 2 11!1 '"'"Nn(0,1 Qn).

(10.26)

From (10.25) and using standard hierarchical Bayes techniques ( cf. Lindley and Smith (1972)) and matrix identities ( cf. Searle (1982)), it follows that Yj~J 2 , 1 2 '"'"Nn(O, 1J 2In+ 1 2 (X r X'+ Qn)), 2

iiY,~J ,1 '"'"N(AY,B),

(10.27) (10.28)

where

+ 1 2 (xrx' + Qn))- 1 , 1 = 1 2 F- 1 4 FX' (iJ 2 ln + 1 2 (XFX' + Qn)) - Xr.

A= B

FX' (IJ 2 ln

To proceed to the second-stage calculations, some algebraic simplifications are needed (see Angers and Delampady (1992)). Spectral decomposition yields XF X'+ Qn = HDH', where D = diag(d1, d2, ... , dn) is the matrix of eigenvalues and H is the orthogonal matrix of eigenvectors. Thus,

~J 2 ln

+ 1 2 (XFX' + Qn) = =

H (IJ ln 1

+ 1 2 D) H'

H (vln +D) H',

(10.29)

where v = IJ 2/1 2. Using this spectral decomposition, the marginal density of Y given 1 2 and v can be written as

(10.30) where t = (t1, ... , tn)' = H'Y.

298

10 Some Applications

To derive the wavelet smoother, all that we need to do now is to eliminate the hyper- and nuisance parameters from the first-stage posterior distribution by integrating out these variables with respect to the second-stage prior on them. This is what we will do now. Alternatively, one could employ an empirical Bayes approach and estimate a 2 and T 2 from equation (10.27) and replace a 2 and T 2 by their estimates in equation (10.28) to approximate ::Y. However, this will underestimate the variance of the wavelet estimator, Y = X-9. Suppose, then, n 2 ( T 2 , v) is the second stage prior. It is well known in the context of hierarchical Bayesian analysis (see Chapter 9, specially equation (9.7) and Berger, 1985a) that the sensitivity of the second and higher stage hyper-priors on the final Bayes estimator is somewhat limited. Therefore, for computational ease, we choose n 2 (T 2 ,v) = n 22 (v)(T 2 )-a for some suitable choice of a> 0; n 22 is the prior specified for v. Once a and n 22 are specified, using equation (10.28) along with (10.29) and taking the expectation with respect to T 2 , we have that

E(r I Y)

(10.31) + D)- 1 I Y] t, where the expectation is taken with respect to n 22 (v I Y). Again using equa=

1' =

rX'HE [(vln

tions (10.28) and (10.29), the posterior covariance matrix of 1 can be written as

Var(r I Y)

1 E [ n ___j_d t2 I y -n + 2a .2=1 v + i

1 - --rx'HE n+ 2a

[(~ trd) (vin + n)~ v+

+E [-9(v)-9(v)' I Y],

z=1

Yl H'Xr (10.32)

where "9(v) = FX'H(vln + D)- 1 t. To compute these expectations, one can use several techniques. Because they involve only single dimensional integrals, standard numerical integration methods will work quite well. Several versions of the standard Monte Carlo approach can be employed quite satisfactorily and efficiently also. An example illustrating the methodology follows. Example 10.2. This is based on data provided by Prof. Abraham Verghese (F.R.E.S.) of the Indian Institute of Horticultural Research, Bangalore, India (personal communication), which have already been analyzed in Angers and Delampady (2001). The variable of interest y that we have chosen from the data set is the weekly average humidity level. The observations were made from June 1, 1995, to December 13, 1998. (For some reason, the observations were not recorded on the same day of the week every time.) We have chosen time (day of recording the observation) as the covariate x. (Any other available covariate can be used also because wavelet-based smoothing with respect to any arbitrary covariate (measured in some general way) can be handled with

10.3 Estimation Using Dirichlet Multinomial Allocation

,. ,.

'•Q 0

09:, 0 0

0 0

0'•

-··

,.,'•

,, '~

299

'·

9:,

'.'<> 0

·'

,, ··' •O

-_,'

,, ,,

' 200

400

600

800

1000

1200

1400

Days

Fig. 10.1. Wavelet smoother and its error bands for the Humidity data.

our methodology.) For illustration purposes, we have chosen the model with J = 6; the hyperparameter a is 0.5 and the prior 1r22 corresponds with an F distribution with degrees of freedom 24 and 4. We have used compactly supported Daubechies wavelets for this analysis. As explained earlier, these cannot be expressed in closed form, but computations with these wavelets are possible using any of the several statistical and mathematical software packages. In Figure 10.1, we have plotted f'JJ (solid line) along with its error bands (dotted lines), ±2JVar(y I Y), where

Var(y I Y) = Var(gJ(x)

+ TJ + s I Y).

More details on this example as well as other studies can be found in Angers and Delampady (2001).

10.3 Estimation of Regression Function Using Dirichlet Multinomial Allocation In Section 10.2, wavelets are used to represent the nonparametric regression function in (10.8) and a prior is put on the wavelet coefficients. Here we present an alternative approach based on the observation that the unknown regression function is locally linear and hence one may use a high-dimensional

300

10 Some Applications

parametric family for modeling locally linear regression. Suppose we have a regression problem with a response variable Y and a regressor variable X. Let (X 1 , Yl), ... (Xn, Yn) be independent paired observations on (X, Y). Consider first the usual normal linear regression model where given values of the regressor variables Xi's, the Yi 's are independently normally distributed with common variance a~ and mean E(Yi[xi) = (31 + (32xi, a linear function of Xi· Let Zi = (Xi, Yi) be independent, Zi having the density

where fx (x[f.li, al) and jy(y[x, (31i, f32i, a~) denote respectively N(pi, af) density for xi and N(f3li + f32iX, (T~) density for Yi given x, cf>i = (f.li, af' f3li, f32i), i = 1, ... ,n. For simplicity we assume a~ is known, say, equal to 1. For the remaining parameters ¢i, i = 1, ... , n, we have the Dirichlet multinomial allocation (DMA) prior, defined in the next paragraph. (1) Let k ""'p(k), a distribution on {1, 2, ... , n}. (2) Given k, ¢>i, i = 1, ... ,n have at most k distinct values B1, ... ,ek, where ei 's are i.i.d. ""' Go and Go is a distribution on the space of (p, a 2 , (31 , (32 ) (our choice of Go is mentioned below). (3) Given k, the vector of weights (w 1 , ... , wk)""' Dirichlet (5 1 , ... , 6k)· (4) Allocation variables a 1 , ... , an are independent with

P(ai

Wj, j

1, ... , k.

(5) Finally c/>i = Bai, i = 1, ... , n. For simplicity, we illustrate with a known k (which will be taken appropriately large). We refer to Richardson and Green (1997) for the treatment of the case with unknown k; see also the discussion of this paper by Cruet and Robert, and Green and Richardson (2001). Under this prior c/>i = (f.li, af, fJ1i, f32i), i = 1, ... , n are exchangeable. This allows borrowing of strength, as in Chapter 9, from clusters of (xi, Yi)'s with similar values. To see how this works, one has to calculate the Bayes estimate through MCMC. We take Go to be the product of a normal distribution for p, an inverse Gamma distribution for a 2 and normal distributions for (31 , and (32 . The full conditionals needed for sampling from the posterior using Gibbs sampler can be easily obtained, see Robert and Casella (1999) in this context. For example, the conditional posterior distribution of a 1 , ... , an given other parameters are as follows: k

= j

with probability Wjj(Zi[Bj)j

L Wrf(Zi[Br)· r=1

= 1, ... , k,

= 1, ... , nand a 1 , ... , an are independent.

10.3 Estimation Using Dirichlet Multinomial Allocation

301

Due to conjugacy, the other full conditional distributions can be easily obtained. You are invited to calculate the conditional posteriors in Problem 4. Note that given k, ()b ... ,(h and W1, ... , Wk, we have a mixture with k components. Each mixture models a locally linear regression. Because ()i and wi are random, we have a rich family of locally linear regression models from which the posterior chooses different members and assigns to each member model a weight equal to its posterior probability density. The weight is a measure of how close is this member model to data. The Bayes estimate of the regression function is a weighted average of the (conditional) expectations of locally linear regressions. We illustrate the use of this method with a set of data simulated form a model for which E(Yix) = sin(2x) +E. We generate 100 pairs of observations (Xi, Yi) with normal errors Ei. A scatter plot of the data points and a plot of the estimated regression at each Xi (using the Bayes estimates of f31i, {32 i) together with the graph of sin(2x)

1.5

0.5

-0.5

-1

-1.5 -5

... 0

Fig. 10.2. Scatter plot, estimated regression, and true regression function.

302

10 Some Applications

are presented in Figure 10.2. In our calculation, we have chosen hyperparameters of the priors suitably to have priors with small information. Seo (2004) discusses the choice of hyperpriors and hyperparameters in examples of this kind. Following Muller et al. (1996), Seo (2004) also uses a Dirichlet process prior instead of the DMA. The Dirichlet process prior is beyond the scope of our book. See Ghosh and Ramamoorthi (2003, Chapter 3) for details. It is worth noting that the method developed works equally well if X is non-stochastic (as in Section 10.2) or has a known distribution. The trick is to ignore these facts and pretend that X is also random as above. See Muller et al. (1996) for further discussion of this point.

10.4 Exercises 1. Verify that Haar wavelets generate an MRA of £ 2 (R). 2. Indicate how Bayes factors can be used to obtain the optimal resolution level J in (10.21). 3. Derive an appropriate wavelet smoother for the data given in Table 5.1 and compare the results with those obtained using linear regression in Section 5.4. 4. For the problem in Section 10.3, explain how MCMC can be implemented, deriving explicitly all the full conditionals needed. 5. Choose any of the high-dimensional problems in Chapters 9 or 10 and suggest how hyperparameters may be chosen there. Discuss whether your findings will apply to all the higher levels of hierarchy.

A Common Statistical Densities

For quick reference, listed below are some common statistical densities that are used in examples and exercise problems in the book. Only brief description including the name of the density, the notation (abbreviation) used in the book, the density itself, the range of the variable argument, and the parameter values and some useful moments are supplied.

A.l Continuous Models 1. Univariate normal (N(J-L, IT 2)):

f(x[J-L, IT2) = (27riT2)-1/2 exp ( -(x- J-L)2 /(21T2))' -OO

00, -OO

f.-l

00 1

IT 2 > 0.

Mean = J-L, variance = IT . Special case: N(O, 1) is known as standard normal. 2. Multivariate normal (Np(f.L, E)):

f(x[J.L, E) = (21r )-P/ 2 [E[- 112 exp ( -(x -

f.L )' E-

x E RP, f.L E RP, Epxp positive definite. Mean vector = J.L, covariance or dispersion matrix = E. 3. Exponential (Exp(>.)):

f(x[>.) = >.exp(->.x),x > o,>. > 0. Mean= 1/>., variance= 1/>.2 . 4. Double exponential or Laplace (DExp(J-L, IT)):

f(x[J-L,IT)

-oo < X < 00 1 -OO < f.-l < 00 1 IT Mean = J-L, variance = 21T 2.

1 ( -[xJ-L[) , -exp --

21T

> 0.

(x- f.L)) ,

304

A Common Statistical Densities

5. Gamma (Gamma( a,>.)):

f(xlo:, >.) =

r~:) x"- 1 exp( ->.x), X> 0, 0: > 0, A> 0.

Mean= o:j>., variance= o:j>. 2. Special cases: (i) Exp(>.) is Gamma(!,>.). (ii) Chi-square with n degrees of freedom (x~) is Gamma(n/2, 1/2). 6. Uniform (U(a,b)): 1

f(xia, b) = b _a I(a,b)(x), -oo
_ T(o: + /3) a-1( 1- x )/3-1 Ico, 1) (x ) , o: > o, {3 > o. f (x Io:, f3 ) - T(o:)T(/3) x Mean= o:f(o: + {3), variance= o:f3/{(o: + f3) 2(o: Special case: U(O, 1) is Beta(l, 1). 8. Cauchy (Cauchy(J.L, a 2 )):

-oo < J.l < oo, a 2 > 0. Mean and variance do not exist. 9. t distribution (t(o:,J.L,a 2)):

f(xio:,J.L, a2) = T((o: + 1)/2) (1 a,;a:;rr(o:/2)

+ (x- J.L)2) -(a+l)/2' o:a 2