Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis Bradley Efron Abstract
Current scientific techniques in genomics and image processing routinely produce hy-
pothesis testing problems with hundreds or thousands of cases to consider simultaneously.
This poses new difficulties for the statistician, but also opens new opportunities. In particu-
lar it allows empirical estimation of an appropriate null hypothesis. The empirical null may
be considerably more dispersed than the usual theoretical null distribution that would be
used for any one case considered separately. An empirical Bayes analysis plan for this situ-
ation is developed, using a local version of the false discovery rate to examine the inference
issues. Two genomics problems are used as examples to show the importance of correctly
Key Words: local false discovery rate, empirical Bayes, microarray analysis, empirical null 1. Introduction
Until recently “simultaneous inference” meant considering two or five or perhaps ten
hypothesis tests at the same time, as in Miller’s classic 1981 text. Rapid progress in tech-
nology, particularly in genomics and imaging, has vastly upped the ante for simultaneous
inference problems: now 500 or 5000 or even 50,000 tests may need to be evaluated at once,
raising new problems for the statistician, but also opening new analytic opportunities. This
paper concerns the choice of an appropriate null hypothesis in large-scale testing situations,
and how this choice affects well-known inference methods such as the false discovery rate.
Simultaneous hypothesis testing begins with a collection of null hypotheses
H1, H2, . . . , HN ,
corresponding test statistics, possibly not independent,
Y1, Y2, . . . , YN ,
and their p-values, P1, P2, . . . , PN , with Pi measuring how strongly yi, the observed value of
Yi, contradicts Hi, for instance Pi = prob {|Yi| > |yi|}. “Large-scale” means that N is a
big number, say at least N > 100.
It is convenient though not necessary to work with z-values instead of the Yi’s or Pi’s,
zi = Φ−1(Pi),i = 1, 2, . . . , N ,
Φ indicating the standard normal cumulative distribution function (cdf), Φ−1(.95) = 1.645
etc. If Hi is exactly true then zi will have a standard normal distribution
We will call (1.4) the theoretical null hypothesis.
Our motivating example concerns an HIV study of 1391 patients, investigating which of
6 Protease Inhibitor (“PI”) drugs cause mutations at which of 74 sites on the viral genome.
Each patient provided a vector of predictors
x = (x1, x2, . . . , x6) , xj = 1 or 0 indicating whether or not the patient used P Ij, 1 ≤v = (v1, v2, . . . , v74) , vk = 1 or 0 indicating whether or not a mutation occurred at site k. Remark A of Section 7
describes the study in a little more detail.
For each of the 74 genomic sites, a separate logistic regression analysis was run using
all 1391 cases, with that site’s mutation indicators as responses and the PI indicators as
predictors. Together these yielded 444 = 6×74 z-values, one for testing each null hypothesis,that drug j does not cause mutations at site k, j = 1, 2, . . . 6 and k = 1, 2, . . . , 74. The z-
values were based on the usual approximation
i = 1, 2, . . . , 444 ,
(using a single subscript i in place of (j, k)) where yi is the maximum likelihood estimate
(MLE) of the logistic regression coefficient and sei its approximate large-sample standard
Figure 1 shows a histogram of the 444 z-values, with negative zi’s indicating greater
mutational effects. The smooth curve f (z) is a natural spline with seven degrees of freedom,
fit to the histogram counts by Poisson regression. It emphasizes the central peak near z = 0,
presumably the large majority of uninteresting drug-site combinations that have negligible
mutation effects. Near its center the peak is well-described by a normal density having mean
-0.35 and standard deviation 1.20, which we will call the empirical null hypothesis,
zi|Hi ∼ N(−0.35, 1.202) .
Section 3 describes the estimation methodology for (1.8), with a brief discussion of the
normality assumption in Remark D of Section 7.
The difference between the theoretical null N (0, 1) and empirical null N (−0.35, 1.202)
may not seem worrisome here but we will see that it substantially affects any simultaneous
inference procedure. A more dramatic example is given in Section 6, for a microarray analysis
where going from the theoretical to empirical null totally negates any findings of significance.
Situations going in the reverse direction also occur.
In classic situations involving only a single hypothesis test we must, of necessity, employ
the theoretical null hypothesis z ∼ N(0, 1). The main point of this paper is that large-scaletesting situations permit empirical estimation of the null distribution. Sections 3 through
5 concern reasons why the empirical and theoretical null might differ, and which might be
preferable in different situations.
There are scientific as well as statistical differences between small-scale and large-scale
z-values (<-- larger mutation effects)
Figure 1:Histogram of 444 z-values from the Drug-mutation analysis; smooth curve f (z) is natural spline fit to histogram counts. The central peak near z = 0 is approximatelyN (−0.35, 1.202): the “empirical null hypothesis”. Simultaneous hypothesis tests for the 444cases depend critically on the choice between the empirical or theoretical N (0, 1) null.
hypothesis testing situations. A single hypothesis test is most often run with the expectation
and hope of rejecting the null, “with 80% power” in a typical clinical trial. Nobody wants to
reject 80% of N = 5000 null hypotheses. The usual point of large-scale testing is to identify
a small percentage of interesting cases that deserve further investigation. While not exactly
looking for a needle in a haystack, we don’t want the whole haystack either. An important
assumption of what follows is that the proportion of interesting cases is small, perhaps 1%,
or 5% of N , but not more than 10%. This is made explicit in Section 2, in our description of
the local false discovery rate as an analytic tool for large-scale testing. There are situations
where the 10% limit is irrelevant, for example, in constructing prediction models, but these
The terminology “Interesting/Uninteresting” used in this paper in preference to “Sig-
nificant/Nonsignificant” is discussed near the end of Section 5. We conclude in Sections 7
and 8 with remarks, including most of the technical details, and a summary. 2. The Local False Discovery Rate It is convenient to discuss large-scale testing prob-
lems in terms of the local false discovery rate (fdr), an empirical Bayes version of Benjamini
and Hochberg’s (1995) methodology focusing on densities rather than tail areas; see Efron
et al. (2001) and Efron and Tibshirani (2002).
We begin with a simple Bayes model. Suppose that the N z-values fall into two classes,
“Uninteresting” or “Interesting”, corresponding to whether or not zi is generated according
to the null hypothesis, with prior probabilities p0 and p1 = 1 − p0, for the classes; and thatzi has density either f0(z) or f1(z) depending on its class,
p0 = Prob{Uninteresting},f0(z) density if Uninteresting (Null)
p1 = Prob{Interesting},f1(z) density if Interesting (Non-Null) .
The smooth curve in Figure 1 estimates the mixture density f (z),
f (z) = p0f0(z) + p1f1(z) .
According to Bayes theorem the a posteriori probability of being in the Uninteresting class
Prob{Uninteresting|z} = p0f0(z)/f (z) .
Here we define the local false discovery rate to be
fdr(z) ≡ f0(z)/f (z) ,
ignoring the factor p0 in (2.3), so fdr(z) is an upper bound on Prob{Uninteresting|z}. Infact p0 can be roughly estimated, see Remark B, but we are assuming that p0 is near 1, say
p0 ≥ 0.90, so fdr(z) is not a flagrant overestimator.
The local fdr provides a useful methodology for identifying Interesting cases in a situ-
ation like that of Figure 1: (1) estimate f (z) from the observed ensemble of z-values, for
example by the natural spline fit to the histogram counts; (2) assign a null density f0(z); (3)
calculate fdr(z) = f0(z)/f (z); (4) report as Interesting those cases with fdr(zi) less than some
threshold value, perhaps fdr(zi) ≤ 0.10. Remark B discusses the close connection betweenthis algorithm and Benjamini and Hochberg’s (1995) method.
This paper concerns the choice of f0(z), the null hypothesis density. In the drug-mutation
example it is crucial whether f0 is taken to be the theoretical or empirical null, N (0, 1) or
N (−0.35, 1.202). This is illustrated in Figure 2, a close-up view of Figure 1 focusing onthe bin containing z = −3. The expected number of the 444 zi values falling into this
bin is 6.37 for f (z), and either 0.62 or 3.90 as f0(z) is N (0, 1) or N (−0.35, 1.202). Thusfdr(z) = f0(z)/f (z) at z = −3 is estimated to be either
.097 using theoretical null N(0, 1)
.612 using empirical null N(−0.35,1.202) .
In this bin, changing from the theoretical to empirical null changes our inferences from
Interesting to definitely Uninteresting. Figure 2:Close-up view of the bin containing z = −3 in Figure 1. Expected number in bin: 6.37 for f (z), 0.62 for f0 = N (0, 1), 3.90 for f0 = N (0.35, 1.202), the empirical null. Corresponding estimates of fdr(−3) : 0.097 for N(0, 1) versus 0.612 for N(−0.35, 1.202). Should we report the cases in this bin as Interesting?.
Figure 3 compares the two estimates of log fdr(z) over most of the z scale. 18 of the 444
z-values have fdr(z) < 0.10 for f0 = N (0, 1) but > 0.10 for f0 = N (−0.35, 1.202), with 17 ofthese at the left end of the scale. All told the empirical null yields only two-thirds as many
cases with fdr < 0.10 as the theoretical null, 35 compared to 53. 3. Estimating the Empirical Null Distribution Our estimate of the empirical null
distribution for the Drug-mutation data was obtained in two steps: the curve f (z) shown
Figure 3:Comparison of estimates of log fdr(z) for the Drug-Mutation data; empirical null estimate (solid curve) declines more slowly than theoretical null estimate (dotted). Dashesindicate the 444 z-values. 17 cases on left have fdr(z) < 1/10 for theoretical but > 1/10 for
in Figure 1 was fit to the histogram counts by Poisson regression, and then the center and
half-width of the central peak, say δ0 and σ0, were obtained from f (z),
δ0 = arg max{f (z)} and σ0 = − d2 log f (z)
yielding (δ0, σ0) = (−0.35, 1.20). Details are given in Remark D, where the possibility of anon-normal empirical null distribution is briefly discussed.
More direct estimation methods for f0 seem possible, for example estimating δ0 by the
median of the z-values. Suppose though that 10% of the z-values came from the non-null
distribution and all of these were located at the far left end of Figure 1. Then the median of
all the z’s would be the 4/9 quartile of the actual null distribution, not its median, yielding
a badly biased estimate of δ0. Similar comments apply to estimating σ0, Remark D. Method
(3.1) does not require preliminary estimates of the proportion p0 in the null population of
(2.1), a considerable practical advantage.
How accurate are the estimates (−0.35, 1.20)? The usual standard error approximations
for a Poisson regression fit are not appropriate here since the zi’s are not independent of
each other. A nonparametric bootstrap analysis was performed instead, with the 1391 80-
dimensional vectors (x, v), (1.5-1.6) as the resampling units. This gave .09 and .08 for the
bootstrap standard errors of δ0 and σ0 respectively, i.e.
(δ0, σ0) = (−0.35, 1.20) ± (.09, .08) .
It seems quite unlikely that estimation error alone accounts for the difference between the
empirical null and the theoretical values (δ0, σ0) = (0, 1). (Notice that this type of bootstrap
analysis, which requires independent sampling units, is not applicable to the microarray
example of Section 6, where we expect correlations among the genes.)
The next two sections concern other possible causes for empirical/theoretical differences,
diagnostics for these causes, and their interpretations. Our list is not exhaustive and in fact
the microarray example of Section 6 demonstrates another form of pathology. 4. Permutation Tests and Unobserved Covariates The theoretical N (0, 1) null hy-
pothesis (1.4) is usually based on asymptotic approximations like those for the logistic re-
gression coefficients in the Drug-mutation study. Permutation methods can be used to avoid
these approximations, perhaps in the hope that an improved theoretical null will more closely
This was not the case for the Drug-mutation data. Permutation testing was implemented
by randomly pairing the 1391 predictor vectors x, (1.5), with the 1391 response vectors v,
(1.6), and recalculating the 444 z-values. This whole process was independently repeated 20
times, yielding a total of 20 ×444 permutation z’s. Their distribution was well approximatedby a N (0, .9652) density (the “permutation null”) except for a prominent spike near z = 0.3.
In this case the permutation-improved theoretical null differs more rather than less from the
empirical null N (−0.35, 1.202).
Permutation methods are popular in the microarray literature as a way of avoiding
assumptions and approximations, see Efron et al. (2001) or Dudoit et al. (2003), but theydo not automatically resolve the question of an appropriate null hypothesis. This can be
seen in the following hypothetical example, which is a stylized version of the two-sample
microarray testing problem in Section 6: the data xij comes from N simultaneous two-
sample experiments, each comparing 2n subjects,
(i = 1, . . . , N ), .
Treatments j = n + 1, n + 2, . . . , 2n
The ith test statistic Yi is the usual two-sample t-statistic, comparing Treatments versus
Controls for the ith experiment.
Suppose that, unknown to the statistician, the data was actually generated from
2 i βi ∼ N(0, σ2) ,
with the uij and βi mutually independent and
−1 j = 1, 2, . . . nj = n + 1, . . . 2n .
Then it is easy to show that the statistics Yi follow a dilated t-distribution with 2n − 2degrees of freedom,
while the permutation distribution, permuting Treatments and Controls within each experi-
ment, has nearly a standard t2n−2 null distribution. So for example if σ2 = 2/n, the empirical
2 times as wide as the permutation null.
The quantity βi in (4.2)-(4.3) causes the only consistent differences between Treatments
and Controls in experiment i. If βi is a dependable feature of the ith experiment, and would
appear again with the same value in a replication of the study, then the permutation null t2n−2is a reasonable basis for inference. With n large and σ2 = 2/n, it results in fdr(y
for the most extreme 2% of the observed t-statistics, favoring those with the largest values
Suppose though that βi is not inherent to experiment i, but rather a purely random effect
that would have a different value and perhaps a different sign if the study were repeated; that
is, βi is part of the noise and not part of the signal. In this case the appropriate choice is the
empirical null (4.4). The equivalent of Figure 1 will be all central peak, with no interesting
outliers, and there will be no cases having small values of fdr(yi). This is appropriate since
now there is no real Treatment effect.
In this last context βi acts as an unobserved covariate, a quantity which the statistician
would use to correct the Treatment-Control comparison if it were observable. Unobserved
covariates are ubiquitous in observational studies. There are several obvious ones in the
Drug-mutation study: personal characteristics of the patients such as age and gender, prior
use of AZT and other non-PI drugs, years since infection, geographical location, etc.
The effect of important unobserved covariates is to dilate the null hypothesis density
f0(z), as happens in (4.4). Unobserved covariates will also dilate the “Interesting” density
f1(z) in (2.1), and the mixture density f (z), (2.2). However an empirical fitting method for
estimating f (z), such as the spline fit in Figure 1, automatically includes any dilation effects.
In estimating fdr(z) = f0(z)/f (z) it is important to also allow for dilation of the numerator
f0. This is a strong argument for preferring the empirical null hypothesis in observational5. A Structural Model for the z-values The Bayesian specifications (2.1) underlying
our fdr results have the advantage of not requiring a structural model for the z-values;
in particular it is not necessary to motivate, or even describe, the non-null density f1(z).
There is however a simple structural model that helps elucidate the Interesting-Uninteresting
The structural model assumes that zi, the ith z-value, is normally distributed around a
“true value” µi, its expectation,
zi ∼ N(µi, 1) for i = 1, 2, . . . , N ,
with µi having some prior distribution g(µ),
µi ∼ g(µ) for i = 1, 2, . . . , N .
Structure (5.1) is often a good approximation, see Section 4 of Efron (1988), and in fact
proved reasonably accurate in the bootstrap experiment giving (3.2). Together (5.1)-(5.2)
say that the mixture density f (z), (2.2), is a convolution of g(µ) with the standard normal
ϕ(z − µ)g(µ)dµ
(with the understanding that g(µ) may include discrete probability atoms.)
As a first application of the structural model, suppose we insist that g(µ) put probability
for some fixed value of p0 between 0 and 1. This amounts to our original Bayes model (2.1)
with p0 = Prob{Uninteresting}, f0(z) the theoretical null hypothesis N(0, 1), and
ϕ(z − µ)g(µ)dµ/(1 − p0) .
In the context of this paper, p0 should be 0.90 or greater.
For any f (z) of the convolution form (5.3) let (δg, σg) be the center and width parameters
(δ0, σ0) defined by (3.1). Figure 4 answers the following question: for a given choice of p0 in
constraint (5.4), what are the maximum possible values of |δg| and of σg,
δmax = max{|δg| p0} and σmax = max{σg|p0} .Figure 4:Maximum possible values of the center and width parameters (δ0, σ0), (3.1), when the structural model (5.1)-(5.3) is constrained to put probability p0 on µ = 0. For 1−p0 ≤ 0.10the maxima are not much greater than the theoretical null values (0, 1), as shown in Table 1.
Three curves appear for σmax, for the general case just described, for the case where the
non-zero component of g(µ) is required to be symmetric around zero, and for the case where
it is also required to be normal. Here we will only mention the general case. Remark F
discusses the solution of (5.6), which turns out to have a simple “single-point” form.
The notable feature of Figure 4 is that for p0 ≥ 0.90, our preferred realm for large-scale
hypothesis testing, (δmax, σmax) must be quite near the theoretical null values (0, 1):
δmax ≤ 0.07 and σmax ≤ 1.04 .
Table 1 shows (δmax, σmax) for various choices of p0. We see that the “Interesting” probability
1 − p0 would have to be nearly 0.30, very big by the standards of large-scale testing, in orderto obtain the observed Drug-mutation values (δ0, σ0) = (−0.35, 1.20). The inference is thatuninteresting effect, such as the unobserved covariates of Section 4, are dilating the null
Table 1:Value of σmax and δmax as a function of 1 − p0, (5.4).
The main point here is that our measures (3.1) of center and width are quite robust
to the arrangement of Interesting values µi as long as the Interesting percentage does not
exceed 10%. If (δ0, σ0) for the central peak is much different than (0, 1), as it is in Figure
1, then use of the theoretical null is bound to result in identifying an uncomfortably large
percentage of supposedly Interesting cases.
We can pursue this last point for the Drug-mutation data by removing constraint (5.4).
Figure 5 shows an unconstrained estimate of g(µ). For computational simplicity g(µ) was
assumed to be discrete, with at most J = 8 support points µ1, µ2, . . . , µJ , so that (5.3)
πj being the probability g puts on µj, with πj ≥ 0 and
πj = 1. A non-linear minimization
program was employed to find the best-fit curve of form (5.8) to the histogram counts in
Figure 1, using Poisson deviance as the fitting criterion. The vertical bars in Figure 5 are
located at the resulting 8 values µj, with the bar’s height proportional to πj. For example
the little bar at far left represents an atom of probability π1 = .015 at µ1 = −10.9. Theresulting f (z) estimate (5.7) closely resembles the natural spline fit of Figure 1. Table 2
shows all 8 (πj, µj) pairs.
Suppose for a moment that the estimated g(µ) is exactly correct, so 1.5% of the 444
cases have their µi’s equal -10.9, 1.3% have -7.0, etc., and that an oracle has told us the
eight (πj, µj) values. Given an observed zi we can now calculate Prob{Uninteresting|z},
Figure 5:Best-fit discrete mixing function g(µ), (5.2) for Drug-mutation data; bars located at support points µj, heights proportional to weights πj; tall bar at µj = 0 has weight πj = 0.61.Solid curve is best-fit estimate f (z) =
πjϕ(z − µj); it closely matches natural spline fit
(2.3), exactly, once the scientist specifies the definition of Uninteresting versus Interesting.
It seems obvious that the 60.8% at µj = 0 are Uninteresting, and that the 10.6% at µj =−10.9, −7.0, −4.9, and 6.1 deserve Interesting status. However the status of the 28.6% atµj = −1.8, −1.1, and 2.4 is less clear.
If the 28.6% are deemed Interesting, this leaves only the 60.8% at µj = 0 as Unin-
teresting. In terms of our Bayes model (2.1) we have p0 = .608 and f0(z) ∼ N(0, 1), thetheoretical null. About 174 of the 444 cases will be identified as Interesting, too many for
a typical screening exercise. Shifting the 28.6% to the Uninteresting classification increases
p0 to .608 + .286 = .894, a more manageable value, and changes f0(z) to the version of (5.7)
supported on the four Uninteresting µj’s,
this is approximately N (−0.34, 1.192), almost the same as the empirical null (1.8).
In other words the definition of “Interesting” determines the relevant choice of the null
hypothesis f0. If we want to keep the proportion of Interesting cases manageably small then
f0(z) has to grow wider than N (0, 1).
Use of the term “Interesting” rather than “Significant” reflects a difference in intent
between large-scale and classical testing. In the hypothetical context of Figure 5 and Table 2,
all of the 39.2% of the cases with non-zero µi’s would eventually be declared as “significantly
different from zero” if we vastly increased the sample size of patients. Section 4 suggests
that minor deviations from N (0, 1) might arise from scientifically uninteresting causes such
as unobserved covariates. However even if a modestly non-zero µi is genuine in some sense,
it may still be Uninteresting when viewed in comparison with an ensemble of more dramatic
possibilities. Nonsignificant implies Uninteresting but not conversely. 6. A Microarray Example Microarrays have become a prime source of large-scale simul-
taneous testing problems. Figure 6 relates to a well-known microarray experiment concern-
ing differences between two types of genetic mutations causing increased breast cancer risk,
“BRCA1” and “BRCA2”; see Hedenfalk et al. (2001), also Efron and Tibshirani (2002), and
The experiment included 15 breast cancer patients, seven with the BRCA1 mutation
and eight with BRCA2. Each women’s tumor was analyzed on a separate microarray, each
microarray reporting on the same set of N = 3226 genes. For each gene the two-sample
t-statistic yi comparing the 7 BRCA1 responses with the 8 BRCA2’s was computed. The
yi’s were then converted to z-values. zi = Φ−1F13(yi) ,
where F13 is the cdf of a standard t-distribution with 13 degrees of freedom. Figure 6 displays
the histogram of the 3226 z-values. Table 2:Weights πj and locations µj for 8-point best-fit estimate g(µ) of Figure 8. Which
locations we deem Interesting versus Uninteresting determines the choice between the theo-
retical or empirical null hypothesis. (Numerical results accurate to one decimal place.)
Figure 6:Histogram of N = 3226 z-values from breast cancer study. The theoretical N (0, 1) null is much narrower than the central peak, which has (δ0, σ0) = (−0.02, 1.58). In this casethe central peak seems to include the entire histogram.
The central peak is wider here than in Figure 1, with center-width estimates (δ0, σ0) =
(−0.02, 1.58). More importantly, the histogram seems to be all central peak, with no inter-esting outliers such as those seen at the left of Figure 1. This was reflected in the local fdr
calculations: using the theoretical N (0, 1) null yielded 35 genes having fdr(zi) < 0.1, those
with |zi| > 3.35; using the empirical N(−0.02, 1.582) null, no genes at all had fdr < 0.1(or for that matter fdr < 0.9, the histogram in fact being a little short-tailed compared to
N (−0.02, 1.582).)
There is ample reason to distrust the theoretical null in this case. The microarray
experiment for all its impressive technology is still an observational study, with a wide range
of unobserved covariates possibly distorting the BRCA1-BRCA2 comparison.
Another reason for doubt can be found in the data itself. The fdr methodology does
not require independence of the yi’s or zi’s across genes. However it does require that the
15 measurements for each gene be independent across the microarrays. Otherwise the two-
sample t-statistic yi will not have an F13 null distribution, not even approximately.
Unfortunately the experimental methodology used in the breast cancer study seems
to have induced substantial correlations among the various microarrays. In particular, as
discussed in Remark G, the first four microarrays in the BRCA2 groups were mutually
correlated, and likewise the last four. Correlations reduce the effective sample size for a
two-sample t-statistic, just the type of effect that would induce overdispersion in (6.1).
This does not say that there are no BRCA1-BRCA2 differences, only that it is dangerous
to compare the t-statistics with a standard t13 null distribution, even if simultaneous inference
7. Remarks A. Drug-mutation Study
The data base for the Drug-mutation study, Wu et al. (2002),
included 2497 patients having HIV subtype B, of whom 1391 had received at least one of
six popular Protease Inhibitor drugs. amprenavir, indinavir, lopinavir, nelfinavir, ritonavir,
or saquinavir. Among the 1391, the mean number of PI drugs taken was 2.05 per patient.
Amino acid sequences were obtained at all 99 positions on the HIV protease gene, and
mutations from wild-type recorded; 25 positions showed 3 or fewer mutations among the
1391 patients, deemed too few for analysis, leaving 74 positions for the investigation here.
Each of the 74 individual logistic regressions included an intercept term as well as the six PI
main effects, but no other covariates. B. The Local False Discovery Rate The local fdr, (2.3) or (2.4), is closely related to Benjamini
and Hochberg’s (1995) “tail-area” False Discovery Rate, as discussed in Efron et al. (2001)
and Efron and Tibshirani (2002).Substituting cdf’s F0 and F for the densities f0 and f ,
Bayes theorem gives a tail-area version of (2.3),
Prob{Uninteresting|z ≤ z0} = p0F0(z0)/F (z0) ≡ FDR(z0) .
FDR(z0) turns out to be the conditional expectation of fdr(z) ≡ p0f0(z)/f (z) given z ≤ z0,
fdr(z)f (z)dz/
Benjamini and Hochberg work in a frequentist framework but their False Discovery Rate
control rule can be stated in empirical Bayes terms: given F0, which they usually take to be
what we called the theoretical null, estimate FDR(z0) by
FDR(z0) = p0F0(z)/F (z0) ,
where F is the empirical cdf of the zi’s; for a desired control level α, say α = .05, define
z0 = arg max{FDR(z) ≤ α} ;
then rejecting all cases with zi ≤ z0 gives an expected (frequentist) rate of false discoveriesno greater than α.
With z0 as in (7.4), relation (7.2) (applied to the estimated versions of FDR, fdr, and
f ) says that the weighted average of fdr(zi) for the cases rejected by the FDR level-α rule
is itself α. As an example take α = .05 and f0 equal the theoretical N (0, 1) null. Applying
the FDR control rule to the negative side of Figure 1’s Drug-mutation data rejects the null
hypothesis for the 56 cases having zi ≤ −2.61; the corresponding 56 values of fdr(zi) haveweighted average α = .05. They vary from nearly zero at the far left to .19 at the boundary
value z = −2.61, justifying the name “local”: zi’s near the boundary are more likely to befalse discoveries than the overall .05 rate suggests.
Our concern with a correct choice of null hypothesis applies to FDR just as well as fdr.
In the microarray study, FDR with F0 = N (0, 1) gives 24 significant genes at α = .05, while
F0 = N (−.02, 1.582) gives none. In fact any simultaneous testing procedure, the popularWestfall-Young method (1993) for example, will depend on a correct assessment of p-values
for the individual cases, i.e. on the choice of F0. C. Estimating f (z) The Poisson regression method used in Figure 1 to estimate the mixture
density f (z), (2.2), originates in an idea of Lindsey described in Section 2 of Efron and
Tibshirani (1996): the range of the sample z1, z2, . . . zN is partitioned into K equal intervals,
with interval k having midpoint xk and containing count sk of the N z-values; the expectation
λk of sk is nearly proportional to fk ≡ f (xk), and if the zi’s are independent the countsapproximate independent Poisson variates,
[k = 1, 2, . . . , K] ,c a constant depending on N and the interval length.
Lindsey’s method is to estimate the λk’s with a Poisson regression, which because of
(7.5) amounts to estimating a scaled version of the fk’s; in other words estimating f (z). K equals 60 in Figure 1, with the regression model being a natural spline with 7 degrees of
freedom, roughly equivalent to a sixth degree polynomial fit in z.
Poisson regression based on (7.5), is almost fully efficient for estimating f (z) if the zi’s
are independent. Here we do not expect independence but we still have the expectation of
sk proportional to fk. The Poisson regression method will still tend to unbiasedly estimate
f (z), assuming the regression model is sufficiently flexible, though we may lose estimating
The bootstrap analysis that gave the standard errors in (3.2) was also used to check
(7.5). This turned out to be surprisingly accurate for the Drug-mutation data. If not we
might have used the bootstrap estimate of covariance for the sk’s to motivate a more efficient
estimation procedure, though this is unlikely to be important for large values of N . In any
case bootstrap analyses as in (3.2) will provide legitimate standard errors for the Poisson
regression whether or not (7.5) is valid. D. Estimating the Empirical Null Distribution The main tactic of this paper is to estimate
the null distribution f0(x) in (2.1) from the central peak in the z-values’ histogram. Assuming
for z near zero, so that δ0 and σ0 can be estimated by differentiating log f (z) as in (3.1).
The constant depends on N and p0 but the constant has no effect on the derivatives of (3.1).
Directly differentiating the spline estimate of log f (z) can give an overly variable estimate
of σ0. One more smoothing step was employed here: a quadratic curve a0 + a1xk + a2x2 was
fit by ordinary least squares to the estimated values log fk, for xk within 1.5 units of the
maximum δ0, yielding σ0 = [−2a2]−12 as in (3.1). This procedure gave the small bootstrapstandard error estimate in (3.2).
None of this methodology is crucial, though it is important that the estimates δ0 and
σ0 relate directly to f0(z), and are not much affected by the non-Null distribution f1(z) in
(2.1). As an example of what can go wrong suppose we try to estimate σ0 by a “robust”
scale measure such as (84th quantile minus 16th quantile)/2. This gives σ0 = 1.47 for the
Drug-mutation data, reflecting long tails due to the Interesting cases in Figure 1. Similar
difficulties arise using the central slope of a qq plot. Basically a density estimate of the
central peak is required, and then some assessment of its center and width.
More ambitiously, we might try extending the estimation of f0(z) to third moments,
permitting a skew null distribution. Expression (7.6) could be generalized to
− log f(z) ˙= c0 + c1z + c2z2/2 + c3z3/6 ,
now requiring three derivates to estimate the coefficients rather than the two of (3.1). This
is an unexplored path, and in particular Table 1 has not been extended to include skewness
Familiarity was the only reason for using z-values instead of t-values in Figures 1 and 6. E. Estimating p0 We can obtain reasonable upper bounds for p0 in (2.1) from estimates of π(c) ≡ Probf {zi ∈ δ0 ± cσ0} .
Supposing f0(z) = N (δ0, σ2), define
G0(c) = 2Φ(c) − 1 and G1(c) =
the probabilities that zi ∈ δ0 ± cσ0 under f0 and f1 respectively. Then
G0(c) − G1(c)
the inequality following from the assumption that G1(c) ≤ G0(c), i.e. that the f1 density ismore dispersed than f0.
This leads to the estimated upper bound for p0,
i ∈ δ0 ± cσ0}/N .
In particular if we assume G1(c) = 0, in other words that the Interesting zi’s always fall
outside δ0 ± cσ0, then p0 = π(c)/G0(c) is unbiased. (This is the same estimate suggestedin Remark F of Efron et al. (2001).) Choosing (δ0, σ0) = (−0.35, 1.20) and c = 1.5 gavep0 = 0.88 for the Drug-mutation data, with bootstrap standard error 0.024. F. Single-point Solutions for (δmax, σmax) The distributions g(µ) providing (δmax, σmax) in
(5.6), as graphed in Figure 4, have their non-zero components supported at a single point
µ1. For example, g(µ) for the entry giving σmax = 1.04 in Table 1 puts probability 0.90 at
µ = 0 and 0.10 at µ1 = 1.47. Single-point optimality was proved for three of the four cases in
Figure 4, and verified by numerical maximization for the “General” case. Here is the proof
for the σmax “Symmetric” case, the other two proofs being similar.
We consider symmetric distributions putting probability p0 on µ = 0 and probabilities
pj on symmetric pairs (−µj, µj), j = 1, 2, . . . J, so (5.3) becomes
f (z) = p0ϕ(z) +
pj[ϕ(z − µj) + ϕ(z + µj)]/2 .
Defining c0 = p0/(1 − p0), rj = pj/p0, and r+ =
j = 1/c0, we can express σ0 in (3.1) as
Here we have used δ0 = 0, which is true by symmetry assuming p0 ≥ 1/2. Then σmax in(5.6) can be found by maximizing Q.
We will show that with p0 (and c0) and µ1, µ2, . . . , µJ held fixed in (7.12), Q is maximized
by a choice of p1, p2, . . . , pJ having J − 1 zero values; this is a stronger version of the single- point result. Because Q is homogeneous in r = (r1, r2, . . . , rJ ) in (7.13), we can consider the
unconstrained maximization of Q(r), subject only to rj ≥ 0 for j = 1, 2, . . . , J.
“den” the denominator of Q. At a maximizing point r we must have ∂Q(r) ≤ 0 with equality if r j = µ2/(1 + cQ(r) ≥ Rj
Since Q(r) is the maximum, this says that rj, and pj can only be non-zero if j maximizes Rj. In case of ties we can arbitrarily choose one of the maximizing j’s.
All of this shows that we need only consider J = 1 in (7.12). The global maximized
value of r0 in (7.12) is σmax = (1 − Rmax)−12 where
max = max{µ2/(1 + c
The maximizing argument µ1 ranges from 1.43 for p0 = .95 to 1.51 for p0 = .70. The
corresponding result for δmax is simpler, µ1 = δmax + 1. G. Microarray Correlation in the Breast Cancer Study
correlation structure among the eight BRCA2 microarrays. Let X be the 3226 × 8 matrixof BRCA2 data, with the columns of X standardized to have mean 0 and variance 1. A
“de-gened” matrix X was formed by subtracting row-wise averages from each element of X,
Table 3 shows the 8 × 8 correlation matrix of X. With genuine gene effects subtracted out,the correlations should vary around −1/7 = −0.14 if the columns of X are independent. Instead we see that the columns are correlated in blocks of four, with the off-diagonal block
too negative and the on-diagonal blocks too positive. Table 3:Correlation matrix for the BRCA2 data with row-wise means subtracted off, (7.17).
It indicates positive correlations within the two blocks of four.
Large-scale simultaneous hypothesis testing, where the number of cases
exceeds say 100, permits the empirical estimation of a null hypothesis distribution. The em-
pirical null may be wider (more dispersed) than the theoretical null distribution that would
ordinarily be used for a single hypothesis test. The choice between empirical and theoretical
nulls can greatly influence which cases are identified as “Significant” or “Interesting”, as op-
posed to “Null” or “Uninteresting”, this being true no matter which simultaneous hypothesis
We present an analysis plan for large-scale testing situations:
• A density fitting technique is used to estimate the null hypothesis distribution f0,
• The local false discovery rate, an empirical Bayes version of standard FDR theory,
provides inferences for the N cases, Figure 3 and Section 2.
There are many possible reasons for overdispersion of the empirical null distribution that
would lead to the empirical null being preferred for simultaneous testing:
• Unobserved covariates in a observational study, Section 4. • Hidden correlations, Section 6. • A large proportion of genuine but uninterestingly small effects, Figure 5.
Large-scale testing differs in scientific intent from an individual hypothesis test. The
latter is most often designed to reject the null hypothesis with high probability. Large-scale
testing is usually more of a screening operation, intended to identify a small percentage of
Interesting cases, assumed to be on the order of 10% or less in this paper. Our estimation
technique for the empirical null hypothesis is designed to be accurate under this constraint,
Figure 4. More traditional estimation methods, involving permutations or quantiles, give
incorrect f0 estimates, Section 4 and Remark D. Acknowledgment I am grateful to Robert Shafer, David Katzenstein, and Rami Kantor
for bringing the Drug-mutation data to my attention, and to Robert Tibshirani for several
References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing. J.R. Stat. Soc. Ser. B Stat. Methodol. 57
Dudoit, S., Shaffer J., and Boldrick J. (2003). “Multiple hypothesis testing in microarray
experiments”. Statistical Science 18 71-103.
Efron, B. (2003). “Robbins, empirical Bayes, and microarrays”. Annals Stat. 31 366-378.
Efron, B. and Tibshirani, R. (2002). “Empirical Bayes methods and false discovery rates for
microarrays”. Genetic Epidemiology 23 70-86.
Efron, B., Tibshirani, R., Storey, J. and Tusher, V. (2001). Empirical Bayes analysis of a
microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160.
Efron, B. and Tibshirani, R. (1996). “Using specially designed exponential families for
density estimation”. Annals Stat. 24 2431-61.
Efron, B. (1988). “Three examples of computer-intensive statistical inference”. Sankhya 50
Hedenfalk, I., Duggen, D., Chen, Y., et al. (2001). “Gene expression profiles in hereditary
breast cancer”. New Engl. Jour. Medicine 344 539-48.
Miller, R. (1981). Simultaneous Statistical Inference, Second Edition, Springer-Verlag, New
Westfall, P. and Young, S. (1993). Resampling-based multiple testing: examples and methodsfor p-value adjustments. Wiley, New York.
Wu, T., Schiffer, C., Shafer, R. et al. (2003). “Mutation patterns and structural correlates
in Human Immunodeficiency Virus Type 1 Protease following different protease inhibitor
treatments”. Jour. Virology 77(8) 4836-47.
Can Electromagnetic Exposure Cause a Change in Behaviour? Studying Possible Non-Thermal Influences on Honey Bees – An Approach within the Framework of Educational Informatics Wolfgang Harst1, Jochen Kuhn2* & Hermann Stever1 1 Institute of Educational Informatics, University of Koblenz-Landau/Campus Landau, Fortstr. 7, 76829 Landau, 2 Institute of Science and Science Education (
Why Differentiate Instruction? A single seventh grade English language class at your College is likely to include students who can read and comprehend as well as most college learners; students who can barely decode words, comprehend meaning, or apply basic information; and students who fall somewhere between these extremes. There are students whose primary interests lie in science, sports