Large-scale_2003.dvi

Large-Scale Simultaneous Hypothesis Testing:
The Choice of a Null Hypothesis
Bradley Efron
Abstract
Current scientiﬁc techniques in genomics and image processing routinely produce hy- pothesis testing problems with hundreds or thousands of cases to consider simultaneously.
This poses new diﬃculties for the statistician, but also opens new opportunities. In particu- lar it allows empirical estimation of an appropriate null hypothesis. The empirical null may be considerably more dispersed than the usual theoretical null distribution that would be used for any one case considered separately. An empirical Bayes analysis plan for this situ- ation is developed, using a local version of the false discovery rate to examine the inference issues. Two genomics problems are used as examples to show the importance of correctly Key Words: local false discovery rate, empirical Bayes, microarray analysis, empirical null
1. Introduction
Until recently “simultaneous inference” meant considering two or ﬁve or perhaps ten hypothesis tests at the same time, as in Miller’s classic 1981 text. Rapid progress in tech- nology, particularly in genomics and imaging, has vastly upped the ante for simultaneous inference problems: now 500 or 5000 or even 50,000 tests may need to be evaluated at once, raising new problems for the statistician, but also opening new analytic opportunities. This paper concerns the choice of an appropriate null hypothesis in large-scale testing situations, and how this choice aﬀects well-known inference methods such as the false discovery rate.
Simultaneous hypothesis testing begins with a collection of null hypotheses H1, H2, . . . , HN , corresponding test statistics, possibly not independent, Y1, Y2, . . . , YN , and their p-values, P1, P2, . . . , PN , with Pi measuring how strongly yi, the observed value of Yi, contradicts Hi, for instance Pi = prob {|Y i| > |yi|}. “Large-scale” means that N is a big number, say at least N > 100.
It is convenient though not necessary to work with z-values instead of the Yi’s or Pi’s, zi = Φ−1(Pi), i = 1, 2, . . . , N , Φ indicating the standard normal cumulative distribution function (cdf), Φ−1(.95) = 1.645 etc. If Hi is exactly true then zi will have a standard normal distribution We will call (1.4) the theoretical null hypothesis.
Our motivating example concerns an HIV study of 1391 patients, investigating which of 6 Protease Inhibitor (“PI”) drugs cause mutations at which of 74 sites on the viral genome.
Each patient provided a vector of predictors x = (x1, x2, . . . , x6) ,
xj = 1 or 0 indicating whether or not the patient used P Ij, 1 ≤ v = (v1, v2, . . . , v74) ,
vk = 1 or 0 indicating whether or not a mutation occurred at site k. Remark A of Section 7 describes the study in a little more detail.
For each of the 74 genomic sites, a separate logistic regression analysis was run using all 1391 cases, with that site’s mutation indicators as responses and the PI indicators as predictors. Together these yielded 444 = 6×74 z-values, one for testing each null hypothesis,that drug j does not cause mutations at site k, j = 1, 2, . . . 6 and k = 1, 2, . . . , 74. The z- values were based on the usual approximation i = 1, 2, . . . , 444 , (using a single subscript i in place of (j, k)) where yi is the maximum likelihood estimate (MLE) of the logistic regression coeﬃcient and sei its approximate large-sample standard Figure 1 shows a histogram of the 444 z-values, with negative zi’s indicating greater mutational eﬀects. The smooth curve f (z) is a natural spline with seven degrees of freedom, ﬁt to the histogram counts by Poisson regression. It emphasizes the central peak near z = 0, presumably the large majority of uninteresting drug-site combinations that have negligible mutation eﬀects. Near its center the peak is well-described by a normal density having mean -0.35 and standard deviation 1.20, which we will call the empirical null hypothesis, zi|Hi ∼ N(−0.35, 1.202) . Section 3 describes the estimation methodology for (1.8), with a brief discussion of the normality assumption in Remark D of Section 7.
The diﬀerence between the theoretical null N (0, 1) and empirical null N (−0.35, 1.202) may not seem worrisome here but we will see that it substantially aﬀects any simultaneous inference procedure. A more dramatic example is given in Section 6, for a microarray analysis where going from the theoretical to empirical null totally negates any ﬁndings of signiﬁcance.
Situations going in the reverse direction also occur.
In classic situations involving only a single hypothesis test we must, of necessity, employ the theoretical null hypothesis z ∼ N(0, 1). The main point of this paper is that large-scaletesting situations permit empirical estimation of the null distribution. Sections 3 through 5 concern reasons why the empirical and theoretical null might diﬀer, and which might be preferable in diﬀerent situations.
There are scientiﬁc as well as statistical diﬀerences between small-scale and large-scale z-values (<-- larger mutation effects) Figure 1:Histogram of 444 z-values from the Drug-mutation analysis; smooth curve f (z)
is natural spline ﬁt to histogram counts. The central peak near z = 0 is approximately N (−0.35, 1.202): the “empirical null hypothesis”. Simultaneous hypothesis tests for the 444cases depend critically on the choice between the empirical or theoretical N (0, 1) null. hypothesis testing situations. A single hypothesis test is most often run with the expectation and hope of rejecting the null, “with 80% power” in a typical clinical trial. Nobody wants to reject 80% of N = 5000 null hypotheses. The usual point of large-scale testing is to identify a small percentage of interesting cases that deserve further investigation. While not exactly looking for a needle in a haystack, we don’t want the whole haystack either. An important assumption of what follows is that the proportion of interesting cases is small, perhaps 1%, or 5% of N , but not more than 10%. This is made explicit in Section 2, in our description of the local false discovery rate as an analytic tool for large-scale testing. There are situations where the 10% limit is irrelevant, for example, in constructing prediction models, but these The terminology “Interesting/Uninteresting” used in this paper in preference to “Sig- niﬁcant/Nonsigniﬁcant” is discussed near the end of Section 5. We conclude in Sections 7 and 8 with remarks, including most of the technical details, and a summary.
2. The Local False Discovery Rate It is convenient to discuss large-scale testing prob-
lems in terms of the local false discovery rate (fdr), an empirical Bayes version of Benjamini and Hochberg’s (1995) methodology focusing on densities rather than tail areas; see Efron et al. (2001) and Efron and Tibshirani (2002).
We begin with a simple Bayes model. Suppose that the N z-values fall into two classes, “Uninteresting” or “Interesting”, corresponding to whether or not zi is generated according to the null hypothesis, with prior probabilities p0 and p1 = 1 − p0, for the classes; and thatzi has density either f0(z) or f1(z) depending on its class, p0 = Prob{Uninteresting}, f0(z) density if Uninteresting (Null) p1 = Prob{Interesting}, f1(z) density if Interesting (Non-Null) . The smooth curve in Figure 1 estimates the mixture density f (z), f (z) = p0f0(z) + p1f1(z) . According to Bayes theorem the a posteriori probability of being in the Uninteresting class Prob{Uninteresting|z} = p0f0(z)/f (z) . Here we deﬁne the local false discovery rate to be fdr(z) ≡ f0(z)/f (z) , ignoring the factor p0 in (2.3), so fdr(z) is an upper bound on Prob{Uninteresting|z}. Infact p0 can be roughly estimated, see Remark B, but we are assuming that p0 is near 1, say p0 ≥ 0.90, so fdr(z) is not a ﬂagrant overestimator.
The local fdr provides a useful methodology for identifying Interesting cases in a situ- ation like that of Figure 1: (1) estimate f (z) from the observed ensemble of z-values, for example by the natural spline ﬁt to the histogram counts; (2) assign a null density f0(z); (3) calculate fdr(z) = f0(z)/f (z); (4) report as Interesting those cases with fdr(zi) less than some threshold value, perhaps fdr(zi) ≤ 0.10. Remark B discusses the close connection betweenthis algorithm and Benjamini and Hochberg’s (1995) method.
This paper concerns the choice of f0(z), the null hypothesis density. In the drug-mutation example it is crucial whether f0 is taken to be the theoretical or empirical null, N (0, 1) or N (−0.35, 1.202). This is illustrated in Figure 2, a close-up view of Figure 1 focusing onthe bin containing z = −3. The expected number of the 444 zi values falling into this bin is 6.37 for f (z), and either 0.62 or 3.90 as f0(z) is N (0, 1) or N (−0.35, 1.202). Thusfdr(z) = f0(z)/f (z) at z = −3 is estimated to be either .097 using theoretical null N(0, 1) .612 using empirical null N(−0.35,1.202) . In this bin, changing from the theoretical to empirical null changes our inferences from Interesting to deﬁnitely Uninteresting.
Figure 2:Close-up view of the bin containing z = −3 in Figure 1. Expected number in
bin: 6.37 for f (z), 0.62 for f0 = N (0, 1), 3.90 for f0 = N (0.35, 1.202), the empirical null.
Corresponding estimates of fdr(−3) : 0.097 for N(0, 1) versus 0.612 for N(−0.35, 1.202).
Should we report the cases in this bin as Interesting?. Figure 3 compares the two estimates of log fdr(z) over most of the z scale. 18 of the 444 z-values have fdr(z) < 0.10 for f0 = N (0, 1) but > 0.10 for f0 = N (−0.35, 1.202), with 17 ofthese at the left end of the scale. All told the empirical null yields only two-thirds as many cases with fdr < 0.10 as the theoretical null, 35 compared to 53.
3. Estimating the Empirical Null Distribution Our estimate of the empirical null
distribution for the Drug-mutation data was obtained in two steps: the curve f (z) shown Figure 3:Comparison of estimates of log fdr(z) for the Drug-Mutation data; empirical null
estimate (solid curve) declines more slowly than theoretical null estimate (dotted). Dashes indicate the 444 z-values. 17 cases on left have fdr(z) < 1/10 for theoretical but > 1/10 for in Figure 1 was ﬁt to the histogram counts by Poisson regression, and then the center and half-width of the central peak, say δ0 and σ0, were obtained from f (z), δ0 = arg max{f (z)} and σ0 = − d2 log f (z) yielding (δ0, σ0) = (−0.35, 1.20). Details are given in Remark D, where the possibility of anon-normal empirical null distribution is brieﬂy discussed.
More direct estimation methods for f0 seem possible, for example estimating δ0 by the median of the z-values. Suppose though that 10% of the z-values came from the non-null distribution and all of these were located at the far left end of Figure 1. Then the median of all the z’s would be the 4/9 quartile of the actual null distribution, not its median, yielding a badly biased estimate of δ0. Similar comments apply to estimating σ0, Remark D. Method (3.1) does not require preliminary estimates of the proportion p0 in the null population of (2.1), a considerable practical advantage.
How accurate are the estimates (−0.35, 1.20)? The usual standard error approximations for a Poisson regression ﬁt are not appropriate here since the zi’s are not independent of each other. A nonparametric bootstrap analysis was performed instead, with the 1391 80- dimensional vectors (x, v), (1.5-1.6) as the resampling units. This gave .09 and .08 for the
bootstrap standard errors of δ0 and σ0 respectively, i.e.
(δ0, σ0) = (−0.35, 1.20) ± (.09, .08) . It seems quite unlikely that estimation error alone accounts for the diﬀerence between the empirical null and the theoretical values (δ0, σ0) = (0, 1). (Notice that this type of bootstrap analysis, which requires independent sampling units, is not applicable to the microarray example of Section 6, where we expect correlations among the genes.) The next two sections concern other possible causes for empirical/theoretical diﬀerences, diagnostics for these causes, and their interpretations. Our list is not exhaustive and in fact the microarray example of Section 6 demonstrates another form of pathology.
4. Permutation Tests and Unobserved Covariates The theoretical N (0, 1) null hy-
pothesis (1.4) is usually based on asymptotic approximations like those for the logistic re- gression coeﬃcients in the Drug-mutation study. Permutation methods can be used to avoid these approximations, perhaps in the hope that an improved theoretical null will more closely This was not the case for the Drug-mutation data. Permutation testing was implemented by randomly pairing the 1391 predictor vectors x, (1.5), with the 1391 response vectors v,
(1.6), and recalculating the 444 z-values. This whole process was independently repeated 20 times, yielding a total of 20 ×444 permutation z’s. Their distribution was well approximatedby a N (0, .9652) density (the “permutation null”) except for a prominent spike near z = 0.3.
In this case the permutation-improved theoretical null diﬀers more rather than less from the empirical null N (−0.35, 1.202).
Permutation methods are popular in the microarray literature as a way of avoiding assumptions and approximations, see Efron et al. (2001) or Dudoit et al. (2003), but they do not automatically resolve the question of an appropriate null hypothesis. This can be seen in the following hypothetical example, which is a stylized version of the two-sample microarray testing problem in Section 6: the data xij comes from N simultaneous two- sample experiments, each comparing 2n subjects, (i = 1, . . . , N ), . Treatments j = n + 1, n + 2, . . . , 2n The ith test statistic Yi is the usual two-sample t-statistic, comparing Treatments versus Controls for the ith experiment.
Suppose that, unknown to the statistician, the data was actually generated from 2 i βi ∼ N(0, σ2) , with the uij and βi mutually independent and −1 j = 1, 2, . . . n j = n + 1, . . . 2n . Then it is easy to show that the statistics Yi follow a dilated t-distribution with 2n − 2degrees of freedom, while the permutation distribution, permuting Treatments and Controls within each experi- ment, has nearly a standard t2n−2 null distribution. So for example if σ2 = 2/n, the empirical 2 times as wide as the permutation null.
The quantity βi in (4.2)-(4.3) causes the only consistent diﬀerences between Treatments and Controls in experiment i. If βi is a dependable feature of the ith experiment, and would appear again with the same value in a replication of the study, then the permutation null t2n−2is a reasonable basis for inference. With n large and σ2 = 2/n, it results in fdr(y for the most extreme 2% of the observed t-statistics, favoring those with the largest values Suppose though that βi is not inherent to experiment i, but rather a purely random eﬀect that would have a diﬀerent value and perhaps a diﬀerent sign if the study were repeated; that is, βi is part of the noise and not part of the signal. In this case the appropriate choice is the empirical null (4.4). The equivalent of Figure 1 will be all central peak, with no interesting outliers, and there will be no cases having small values of fdr(yi). This is appropriate since now there is no real Treatment eﬀect.
In this last context βi acts as an unobserved covariate, a quantity which the statistician would use to correct the Treatment-Control comparison if it were observable. Unobserved covariates are ubiquitous in observational studies. There are several obvious ones in the Drug-mutation study: personal characteristics of the patients such as age and gender, prior use of AZT and other non-PI drugs, years since infection, geographical location, etc.
The eﬀect of important unobserved covariates is to dilate the null hypothesis density f0(z), as happens in (4.4). Unobserved covariates will also dilate the “Interesting” density f1(z) in (2.1), and the mixture density f (z), (2.2). However an empirical ﬁtting method for estimating f (z), such as the spline ﬁt in Figure 1, automatically includes any dilation eﬀects.
In estimating fdr(z) = f0(z)/f (z) it is important to also allow for dilation of the numerator f0. This is a strong argument for preferring the empirical null hypothesis in observational 5. A Structural Model for the z-values The Bayesian speciﬁcations (2.1) underlying
our fdr results have the advantage of not requiring a structural model for the z-values; in particular it is not necessary to motivate, or even describe, the non-null density f1(z).
There is however a simple structural model that helps elucidate the Interesting-Uninteresting The structural model assumes that zi, the ith z-value, is normally distributed around a “true value” µi, its expectation, zi ∼ N(µi, 1) for i = 1, 2, . . . , N , with µi having some prior distribution g(µ), µi ∼ g(µ) for i = 1, 2, . . . , N . Structure (5.1) is often a good approximation, see Section 4 of Efron (1988), and in fact proved reasonably accurate in the bootstrap experiment giving (3.2). Together (5.1)-(5.2) say that the mixture density f (z), (2.2), is a convolution of g(µ) with the standard normal ϕ(z − µ)g(µ)dµ (with the understanding that g(µ) may include discrete probability atoms.) As a ﬁrst application of the structural model, suppose we insist that g(µ) put probability for some ﬁxed value of p0 between 0 and 1. This amounts to our original Bayes model (2.1) with p0 = Prob{Uninteresting}, f0(z) the theoretical null hypothesis N(0, 1), and ϕ(z − µ)g(µ)dµ/(1 − p0) . In the context of this paper, p0 should be 0.90 or greater.
For any f (z) of the convolution form (5.3) let (δg, σg) be the center and width parameters (δ0, σ0) deﬁned by (3.1). Figure 4 answers the following question: for a given choice of p0 in constraint (5.4), what are the maximum possible values of |δg| and of σg, δmax = max{|δg| p0} and σmax = max{σg|p0} . Figure 4:Maximum possible values of the center and width parameters (δ0, σ0), (3.1), when
the structural model (5.1)-(5.3) is constrained to put probability p0 on µ = 0. For 1−p0 ≤ 0.10the maxima are not much greater than the theoretical null values (0, 1), as shown in Table 1. Three curves appear for σmax, for the general case just described, for the case where the non-zero component of g(µ) is required to be symmetric around zero, and for the case where it is also required to be normal. Here we will only mention the general case. Remark F discusses the solution of (5.6), which turns out to have a simple “single-point” form.
The notable feature of Figure 4 is that for p0 ≥ 0.90, our preferred realm for large-scale hypothesis testing, (δmax, σmax) must be quite near the theoretical null values (0, 1): δmax ≤ 0.07 and σmax ≤ 1.04 . Table 1 shows (δmax, σmax) for various choices of p0. We see that the “Interesting” probability 1 − p0 would have to be nearly 0.30, very big by the standards of large-scale testing, in orderto obtain the observed Drug-mutation values (δ0, σ0) = (−0.35, 1.20). The inference is thatuninteresting eﬀect, such as the unobserved covariates of Section 4, are dilating the null Table 1:Value of σmax and δmax as a function of 1 − p0, (5.4).
The main point here is that our measures (3.1) of center and width are quite robust to the arrangement of Interesting values µi as long as the Interesting percentage does not exceed 10%. If (δ0, σ0) for the central peak is much diﬀerent than (0, 1), as it is in Figure 1, then use of the theoretical null is bound to result in identifying an uncomfortably large percentage of supposedly Interesting cases.
We can pursue this last point for the Drug-mutation data by removing constraint (5.4).
Figure 5 shows an unconstrained estimate of g(µ). For computational simplicity g(µ) was assumed to be discrete, with at most J = 8 support points µ1, µ2, . . . , µJ , so that (5.3) πj being the probability g puts on µj, with πj ≥ 0 and πj = 1. A non-linear minimization program was employed to ﬁnd the best-ﬁt curve of form (5.8) to the histogram counts in Figure 1, using Poisson deviance as the ﬁtting criterion. The vertical bars in Figure 5 are located at the resulting 8 values µj, with the bar’s height proportional to πj. For example the little bar at far left represents an atom of probability π1 = .015 at µ1 = −10.9. Theresulting f (z) estimate (5.7) closely resembles the natural spline ﬁt of Figure 1. Table 2 shows all 8 (πj, µj) pairs.
Suppose for a moment that the estimated g(µ) is exactly correct, so 1.5% of the 444 cases have their µi’s equal -10.9, 1.3% have -7.0, etc., and that an oracle has told us the eight (πj, µj) values. Given an observed zi we can now calculate Prob{Uninteresting|z}, Figure 5:Best-ﬁt discrete mixing function g(µ), (5.2) for Drug-mutation data; bars located at
support points µj, heights proportional to weights πj; tall bar at µj = 0 has weight πj = 0.61. Solid curve is best-ﬁt estimate f (z) = πjϕ(z − µj); it closely matches natural spline ﬁt (2.3), exactly, once the scientist speciﬁes the deﬁnition of Uninteresting versus Interesting.
It seems obvious that the 60.8% at µj = 0 are Uninteresting, and that the 10.6% at µj =−10.9, −7.0, −4.9, and 6.1 deserve Interesting status. However the status of the 28.6% atµj = −1.8, −1.1, and 2.4 is less clear.
If the 28.6% are deemed Interesting, this leaves only the 60.8% at µj = 0 as Unin- teresting. In terms of our Bayes model (2.1) we have p0 = .608 and f0(z) ∼ N(0, 1), thetheoretical null. About 174 of the 444 cases will be identiﬁed as Interesting, too many for a typical screening exercise. Shifting the 28.6% to the Uninteresting classiﬁcation increases p0 to .608 + .286 = .894, a more manageable value, and changes f0(z) to the version of (5.7) supported on the four Uninteresting µj’s, this is approximately N (−0.34, 1.192), almost the same as the empirical null (1.8).
In other words the deﬁnition of “Interesting” determines the relevant choice of the null hypothesis f0. If we want to keep the proportion of Interesting cases manageably small then f0(z) has to grow wider than N (0, 1).
Use of the term “Interesting” rather than “Signiﬁcant” reﬂects a diﬀerence in intent between large-scale and classical testing. In the hypothetical context of Figure 5 and Table 2, all of the 39.2% of the cases with non-zero µi’s would eventually be declared as “signiﬁcantly diﬀerent from zero” if we vastly increased the sample size of patients. Section 4 suggests that minor deviations from N (0, 1) might arise from scientiﬁcally uninteresting causes such as unobserved covariates. However even if a modestly non-zero µi is genuine in some sense, it may still be Uninteresting when viewed in comparison with an ensemble of more dramatic possibilities. Nonsigniﬁcant implies Uninteresting but not conversely.
6. A Microarray Example Microarrays have become a prime source of large-scale simul-
taneous testing problems. Figure 6 relates to a well-known microarray experiment concern- ing diﬀerences between two types of genetic mutations causing increased breast cancer risk, “BRCA1” and “BRCA2”; see Hedenfalk et al. (2001), also Efron and Tibshirani (2002), and The experiment included 15 breast cancer patients, seven with the BRCA1 mutation and eight with BRCA2. Each women’s tumor was analyzed on a separate microarray, each microarray reporting on the same set of N = 3226 genes. For each gene the two-sample t-statistic yi comparing the 7 BRCA1 responses with the 8 BRCA2’s was computed. The yi’s were then converted to z-values.
zi = Φ−1F13(yi) , where F13 is the cdf of a standard t-distribution with 13 degrees of freedom. Figure 6 displays the histogram of the 3226 z-values.
Table 2:Weights πj and locations µj for 8-point best-ﬁt estimate g(µ) of Figure 8. Which
locations we deem Interesting versus Uninteresting determines the choice between the theo- retical or empirical null hypothesis. (Numerical results accurate to one decimal place.) Figure 6:Histogram of N = 3226 z-values from breast cancer study. The theoretical N (0, 1)
null is much narrower than the central peak, which has (δ0, σ0) = (−0.02, 1.58). In this casethe central peak seems to include the entire histogram. The central peak is wider here than in Figure 1, with center-width estimates (δ0, σ0) = (−0.02, 1.58). More importantly, the histogram seems to be all central peak, with no inter-esting outliers such as those seen at the left of Figure 1. This was reﬂected in the local fdr calculations: using the theoretical N (0, 1) null yielded 35 genes having fdr(zi) < 0.1, those with |zi| > 3.35; using the empirical N(−0.02, 1.582) null, no genes at all had fdr < 0.1(or for that matter fdr < 0.9, the histogram in fact being a little short-tailed compared to N (−0.02, 1.582).) There is ample reason to distrust the theoretical null in this case. The microarray experiment for all its impressive technology is still an observational study, with a wide range of unobserved covariates possibly distorting the BRCA1-BRCA2 comparison.
Another reason for doubt can be found in the data itself. The fdr methodology does not require independence of the yi’s or zi’s across genes. However it does require that the 15 measurements for each gene be independent across the microarrays. Otherwise the two- sample t-statistic yi will not have an F13 null distribution, not even approximately.
Unfortunately the experimental methodology used in the breast cancer study seems to have induced substantial correlations among the various microarrays. In particular, as discussed in Remark G, the ﬁrst four microarrays in the BRCA2 groups were mutually correlated, and likewise the last four. Correlations reduce the eﬀective sample size for a two-sample t-statistic, just the type of eﬀect that would induce overdispersion in (6.1).
This does not say that there are no BRCA1-BRCA2 diﬀerences, only that it is dangerous to compare the t-statistics with a standard t13 null distribution, even if simultaneous inference 7. Remarks
A. Drug-mutation Study
The data base for the Drug-mutation study, Wu et al. (2002), included 2497 patients having HIV subtype B, of whom 1391 had received at least one of six popular Protease Inhibitor drugs. amprenavir, indinavir, lopinavir, nelﬁnavir, ritonavir, or saquinavir. Among the 1391, the mean number of PI drugs taken was 2.05 per patient.
Amino acid sequences were obtained at all 99 positions on the HIV protease gene, and mutations from wild-type recorded; 25 positions showed 3 or fewer mutations among the 1391 patients, deemed too few for analysis, leaving 74 positions for the investigation here.
Each of the 74 individual logistic regressions included an intercept term as well as the six PI main eﬀects, but no other covariates.
B. The Local False Discovery Rate The local fdr, (2.3) or (2.4), is closely related to Benjamini
and Hochberg’s (1995) “tail-area” False Discovery Rate, as discussed in Efron et al. (2001) and Efron and Tibshirani (2002).Substituting cdf’s F0 and F for the densities f0 and f , Bayes theorem gives a tail-area version of (2.3), Prob{Uninteresting|z ≤ z0} = p0F0(z0)/F (z0) ≡ FDR(z0) . FDR(z0) turns out to be the conditional expectation of fdr(z) ≡ p0f0(z)/f (z) given z ≤ z0, fdr(z)f (z)dz/ Benjamini and Hochberg work in a frequentist framework but their False Discovery Rate control rule can be stated in empirical Bayes terms: given F0, which they usually take to be what we called the theoretical null, estimate FDR(z0) by FDR(z0) = p0F0(z)/F (z0) , where F is the empirical cdf of the zi’s; for a desired control level α, say α = .05, deﬁne z0 = arg max{FDR(z) ≤ α} ; then rejecting all cases with zi ≤ z0 gives an expected (frequentist) rate of false discoveriesno greater than α.
With z0 as in (7.4), relation (7.2) (applied to the estimated versions of FDR, fdr, and f ) says that the weighted average of fdr(zi) for the cases rejected by the FDR level-α rule is itself α. As an example take α = .05 and f0 equal the theoretical N (0, 1) null. Applying the FDR control rule to the negative side of Figure 1’s Drug-mutation data rejects the null hypothesis for the 56 cases having zi ≤ −2.61; the corresponding 56 values of fdr(zi) haveweighted average α = .05. They vary from nearly zero at the far left to .19 at the boundary value z = −2.61, justifying the name “local”: zi’s near the boundary are more likely to befalse discoveries than the overall .05 rate suggests.
Our concern with a correct choice of null hypothesis applies to FDR just as well as fdr.
In the microarray study, FDR with F0 = N (0, 1) gives 24 signiﬁcant genes at α = .05, while F0 = N (−.02, 1.582) gives none. In fact any simultaneous testing procedure, the popularWestfall-Young method (1993) for example, will depend on a correct assessment of p-values for the individual cases, i.e. on the choice of F0.
C. Estimating f (z) The Poisson regression method used in Figure 1 to estimate the mixture
density f (z), (2.2), originates in an idea of Lindsey described in Section 2 of Efron and Tibshirani (1996): the range of the sample z1, z2, . . . zN is partitioned into K equal intervals, with interval k having midpoint xk and containing count sk of the N z-values; the expectation λk of sk is nearly proportional to fk ≡ f (xk), and if the zi’s are independent the countsapproximate independent Poisson variates, [k = 1, 2, . . . , K] , c a constant depending on N and the interval length.
Lindsey’s method is to estimate the λk’s with a Poisson regression, which because of (7.5) amounts to estimating a scaled version of the fk’s; in other words estimating f (z).
K equals 60 in Figure 1, with the regression model being a natural spline with 7 degrees of freedom, roughly equivalent to a sixth degree polynomial ﬁt in z.
Poisson regression based on (7.5), is almost fully eﬃcient for estimating f (z) if the zi’s are independent. Here we do not expect independence but we still have the expectation of sk proportional to fk. The Poisson regression method will still tend to unbiasedly estimate f (z), assuming the regression model is suﬃciently ﬂexible, though we may lose estimating The bootstrap analysis that gave the standard errors in (3.2) was also used to check (7.5). This turned out to be surprisingly accurate for the Drug-mutation data. If not we might have used the bootstrap estimate of covariance for the sk’s to motivate a more eﬃcient estimation procedure, though this is unlikely to be important for large values of N . In any case bootstrap analyses as in (3.2) will provide legitimate standard errors for the Poisson regression whether or not (7.5) is valid.
D. Estimating the Empirical Null Distribution The main tactic of this paper is to estimate
the null distribution f0(x) in (2.1) from the central peak in the z-values’ histogram. Assuming for z near zero, so that δ0 and σ0 can be estimated by diﬀerentiating log f (z) as in (3.1).
The constant depends on N and p0 but the constant has no eﬀect on the derivatives of (3.1).
Directly diﬀerentiating the spline estimate of log f (z) can give an overly variable estimate of σ0. One more smoothing step was employed here: a quadratic curve a0 + a1xk + a2x2 was ﬁt by ordinary least squares to the estimated values log fk, for xk within 1.5 units of the maximum δ0, yielding σ0 = [−2a2]−12 as in (3.1). This procedure gave the small bootstrapstandard error estimate in (3.2).
None of this methodology is crucial, though it is important that the estimates δ0 and σ0 relate directly to f0(z), and are not much aﬀected by the non-Null distribution f1(z) in (2.1). As an example of what can go wrong suppose we try to estimate σ0 by a “robust” scale measure such as (84th quantile minus 16th quantile)/2. This gives σ0 = 1.47 for the Drug-mutation data, reﬂecting long tails due to the Interesting cases in Figure 1. Similar diﬃculties arise using the central slope of a qq plot. Basically a density estimate of the central peak is required, and then some assessment of its center and width.
More ambitiously, we might try extending the estimation of f0(z) to third moments, permitting a skew null distribution. Expression (7.6) could be generalized to − log f(z) ˙= c0 + c1z + c2z2/2 + c3z3/6 , now requiring three derivates to estimate the coeﬃcients rather than the two of (3.1). This is an unexplored path, and in particular Table 1 has not been extended to include skewness Familiarity was the only reason for using z-values instead of t-values in Figures 1 and 6.
E. Estimating p0 We can obtain reasonable upper bounds for p0 in (2.1) from estimates of
π(c) ≡ Probf {zi ∈ δ0 ± cσ0} . Supposing f0(z) = N (δ0, σ2), deﬁne G0(c) = 2Φ(c) − 1 and G1(c) = the probabilities that zi ∈ δ0 ± cσ0 under f0 and f1 respectively. Then G0(c) − G1(c) the inequality following from the assumption that G1(c) ≤ G0(c), i.e. that the f1 density ismore dispersed than f0.
This leads to the estimated upper bound for p0, i ∈ δ0 ± cσ0}/N . In particular if we assume G1(c) = 0, in other words that the Interesting zi’s always fall outside δ0 ± cσ0, then p0 = π(c)/G0(c) is unbiased. (This is the same estimate suggestedin Remark F of Efron et al. (2001).) Choosing (δ0, σ0) = (−0.35, 1.20) and c = 1.5 gavep0 = 0.88 for the Drug-mutation data, with bootstrap standard error 0.024.
F. Single-point Solutions for (δmax, σmax) The distributions g(µ) providing (δmax, σmax) in
(5.6), as graphed in Figure 4, have their non-zero components supported at a single point µ1. For example, g(µ) for the entry giving σmax = 1.04 in Table 1 puts probability 0.90 at µ = 0 and 0.10 at µ1 = 1.47. Single-point optimality was proved for three of the four cases in Figure 4, and veriﬁed by numerical maximization for the “General” case. Here is the proof for the σmax “Symmetric” case, the other two proofs being similar.
We consider symmetric distributions putting probability p0 on µ = 0 and probabilities pj on symmetric pairs (−µj, µj), j = 1, 2, . . . J, so (5.3) becomes f (z) = p0ϕ(z) + pj[ϕ(z − µj) + ϕ(z + µj)]/2 . Deﬁning c0 = p0/(1 − p0), rj = pj/p0, and r+ = j = 1/c0, we can express σ0 in (3.1) as Here we have used δ0 = 0, which is true by symmetry assuming p0 ≥ 1/2. Then σmax in(5.6) can be found by maximizing Q.
We will show that with p0 (and c0) and µ1, µ2, . . . , µJ held ﬁxed in (7.12), Q is maximized by a choice of p1, p2, . . . , pJ having J − 1 zero values; this is a stronger version of the single-
point result. Because Q is homogeneous in r = (r1, r2, . . . , rJ ) in (7.13), we can consider the
unconstrained maximization of Q(r), subject only to rj ≥ 0 for j = 1, 2, . . . , J.
“den” the denominator of Q. At a maximizing point r we must have
∂Q(r) ≤ 0 with equality if r
j = µ2/(1 + c Q(r) ≥ Rj
Since Q(r) is the maximum, this says that rj, and pj can only be non-zero if j maximizes
Rj. In case of ties we can arbitrarily choose one of the maximizing j’s.
All of this shows that we need only consider J = 1 in (7.12). The global maximized value of r0 in (7.12) is σmax = (1 − Rmax)−12 where max = max{µ2/(1 + c The maximizing argument µ1 ranges from 1.43 for p0 = .95 to 1.51 for p0 = .70. The corresponding result for δmax is simpler, µ1 = δmax + 1.
G. Microarray Correlation in the Breast Cancer Study
correlation structure among the eight BRCA2 microarrays. Let X be the 3226 × 8 matrixof BRCA2 data, with the columns of X standardized to have mean 0 and variance 1. A “de-gened” matrix X was formed by subtracting row-wise averages from each element of X, Table 3 shows the 8 × 8 correlation matrix of X. With genuine gene eﬀects subtracted out,the correlations should vary around −1/7 = −0.14 if the columns of X are independent.
Instead we see that the columns are correlated in blocks of four, with the oﬀ-diagonal block too negative and the on-diagonal blocks too positive.
Table 3:Correlation matrix for the BRCA2 data with row-wise means subtracted oﬀ, (7.17).
It indicates positive correlations within the two blocks of four.
Large-scale simultaneous hypothesis testing, where the number of cases exceeds say 100, permits the empirical estimation of a null hypothesis distribution. The em- pirical null may be wider (more dispersed) than the theoretical null distribution that would ordinarily be used for a single hypothesis test. The choice between empirical and theoretical nulls can greatly inﬂuence which cases are identiﬁed as “Signiﬁcant” or “Interesting”, as op- posed to “Null” or “Uninteresting”, this being true no matter which simultaneous hypothesis We present an analysis plan for large-scale testing situations: • A density ﬁtting technique is used to estimate the null hypothesis distribution f0, • The local false discovery rate, an empirical Bayes version of standard FDR theory, provides inferences for the N cases, Figure 3 and Section 2.
There are many possible reasons for overdispersion of the empirical null distribution that would lead to the empirical null being preferred for simultaneous testing: • Unobserved covariates in a observational study, Section 4.
• Hidden correlations, Section 6.
• A large proportion of genuine but uninterestingly small eﬀects, Figure 5.
Large-scale testing diﬀers in scientiﬁc intent from an individual hypothesis test. The latter is most often designed to reject the null hypothesis with high probability. Large-scale testing is usually more of a screening operation, intended to identify a small percentage of Interesting cases, assumed to be on the order of 10% or less in this paper. Our estimation technique for the empirical null hypothesis is designed to be accurate under this constraint, Figure 4. More traditional estimation methods, involving permutations or quantiles, give incorrect f0 estimates, Section 4 and Remark D.
Acknowledgment I am grateful to Robert Shafer, David Katzenstein, and Rami Kantor
for bringing the Drug-mutation data to my attention, and to Robert Tibshirani for several References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J.R. Stat. Soc. Ser. B Stat. Methodol. 57
Dudoit, S., Shaﬀer J., and Boldrick J. (2003). “Multiple hypothesis testing in microarray experiments”. Statistical Science 18 71-103.
Efron, B. (2003). “Robbins, empirical Bayes, and microarrays”. Annals Stat. 31 366-378.
Efron, B. and Tibshirani, R. (2002). “Empirical Bayes methods and false discovery rates for microarrays”. Genetic Epidemiology 23 70-86.
Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160.
Efron, B. and Tibshirani, R. (1996). “Using specially designed exponential families for density estimation”. Annals Stat. 24 2431-61.
Efron, B. (1988). “Three examples of computer-intensive statistical inference”. Sankhya 50
Hedenfalk, I., Duggen, D., Chen, Y., et al. (2001). “Gene expression proﬁles in hereditary breast cancer”. New Engl. Jour. Medicine 344 539-48.
Miller, R. (1981). Simultaneous Statistical Inference, Second Edition, Springer-Verlag, New Westfall, P. and Young, S. (1993). Resampling-based multiple testing: examples and methods for p-value adjustments. Wiley, New York.
Wu, T., Schiﬀer, C., Shafer, R. et al. (2003). “Mutation patterns and structural correlates in Human Immunodeﬁciency Virus Type 1 Protease following diﬀerent protease inhibitor treatments”. Jour. Virology 77(8) 4836-47.

Source: http://www.stats.org.uk/statistical-inference/Efron2004.pdf

Microsoft word - emf_hks_2006c.doc

Can Electromagnetic Exposure Cause a Change in Behaviour? Studying Possible Non-Thermal Influences on Honey Bees – An Approach within the Framework of Educational Informatics Wolfgang Harst1, Jochen Kuhn2* & Hermann Stever1 1 Institute of Educational Informatics, University of Koblenz-Landau/Campus Landau, Fortstr. 7, 76829 Landau, 2 Institute of Science and Science Education (

Microsoft word - why differentiate instruction.doc

Why Differentiate Instruction? A single seventh grade English language class at your College is likely to include students who can read and comprehend as well as most college learners; students who can barely decode words, comprehend meaning, or apply basic information; and students who fall somewhere between these extremes. There are students whose primary interests lie in science, sports