## Maths.lth.se

Regression Course – Additional Notes I & II These consolidate the earlier separate sets of additional notes – I & II Balanced vs Unbalanced Data – What is the Effect? This section illustrates, in a very simple case, confounding effects that may arise when data are unbal-anced. It demonstrates, also, how the estimate for a term may change when a further term is includedin the model.
In a game of cricket the bowler’s aim is to give away as few runs as possible, while taking the maximumpossible number of wickets. (A wicket is taken when a batsman is given out). The game is divided intotwo innings, and the conditions can be very different between the two innings. Here the interest is incomparing the performances of different bowlers.
Thus Bowler A does better than Bowler B in both innings (gives away fewer runs for each wicket taken), but ends up giving away more runs per wicket overall. Observe that Bowler A did more of thebowling in the first innings, when there were more runs per wicket. Thus Bowler A’s runs per wicketaverage is skewed towards the first innings runs per wicket rate.
Here is the 2 by 2 table of runs per wicket information. Adding the total of runs for each bowler, and dividing in each case by the corresponding total of wickets is equivalent to taking averages for eachbowler of runs per wicket, with number of wickets as weight.
The default contrasts used by the function lm() take the baseline for all effects (the “intercept”) as the outcome for Innings 1 and Bowler A. With innings 1 is taken as the baseline, the innings effect(Innings 2 - Innings 1) appears as -33, as in the above table. With Bowler A as the baseline, the bowlereffect (Bowler B - Bowler A) appears as -7.
BALANCED VS UNBALANCED DATA – WHAT IS THE EFFECT? > cricket <- data.frame(Innings=factor(rep(c("Inn1","Inn2"),2)),+ Bowler=factor(rep(c("A","B"), c(2,2)))) > cricket\$rw <- c(50,14,40,10)> cricket\$w <- c(1,5,6,4)> cricket > cricket.lm <- lm(rw ~ Innings+Bowler, data=cricket)> round(coef(cricket.lm), 2)(Intercept) InningsInn2 Now omit consideration of the innings effect: > cricket1.lm <- lm(rw ~ Bowler, data=cricket)> round(coef(cricket1.lm), 2)(Intercept) Observe that the estimate of the bowler effect is unchanged.
> cricket.wlm <- lm(rw ~ Bowler+innings, weight=w, data=cricket)> round(coef(cricket.wlm), 2) Observe that the estimate of the bowler effect has changed, but is still fair to Bowler B.
Now omit consideration of the innings effect: > cricket1.wlm <- lm(rw ~ Bowler, weight=w, data=cricket)> round(coef(cricket1.wlm), 2)(Intercept) This is equivalent to dividing the overall number of runs, for each bowler, by the overall number ofwickets. As above, Bowler B is estimated as costing 8 runs per wicket more than Bowler A. Thecombination of unequal weights and omission of a relevant factor generates this misleading result.
Calculations using Householder reflections First, form the matrix on which the calculations will be performed.
> Xy <- cbind(model.matrix(~ Bowler + Innings, data=cricket), Now use my function house() for the calculation.
To obtain estimates of the model parameters, solve: Here are the calculations for the weighted analysis: > house(Xy*sqrt(cricket\$w), showzeros=3) A key difference is that the R matrix no longer has a zero in the (2, 3) position. The consequence isthat the estimate of the bowler effect changes (and changes in sign) if there is no allowance for Innings.
The table shows, separately for males and females, the effect of pentazocine on post-operative pain pro-files (average VAS scores), with (mbac and fbac) and without (mpl and fpl) preoperatively administeredbaclofen. Pain scores are recorded every 20 minutes, from 10 minutes to 170 minutes. Results are shownfor 30 minutes and 50 minutes only.
15 females were given baclofen, as against 3 males. 7 females received the placebo, as against 16 males. Averages for the two treatments (baclofen/placebo), taken over all trial participants and ignoringsex, are misleading. Notice that, both for males and for females, the scores are lower when baclofen is administered. Reliance on the overall average would suggest that the scores are higher when baclofenis administered.
Here our interest will be in comparing residual sums of squares for models that are not nested. Themethodology will be demonstrated using the fruitohms data in the DAAG package.
For this initial discussion, denote the models as M1 and M2. We first fit the model M1 to the originaldata, then obtaining fitted values and residuals. Also, we fit M2, and note the change as measured bya statistic ˜ F12 that is calculated just as for the usual F -statistic.
If M2 is not giving an improvement, then ˜ F12 will differ only by statistical error from the F12 observed when a random set of residuals are added. It will, in effect, be a random sample from the distributionobtained when repeated random sets of residuals are added on to the fitted values, and F12 is calculatedfor each such new set of “observations”. We can then locate ˜ above its 95th percentile then we will judge that, at the 5% significance level, M2 improves on M1. Thechange when the residuals are correctly attached is, on this criterion, more than statistical variation.
With slight modification, the argument applies if the resdiduals are obtained by permuting the original residuals. If model M2 is not improving on M1, then the change when new “observations”are derived by adding randomly permuted residuals back onto the fitted values will be attributable tostatistical variation.
It also applies, again in slightly modified from, if repeated boostrap samples of the original residuals are used. Note that we are not, here, deriving bootstrap estimates of some parameter. While there aretheoretical issues even with this use of the bootstrap, we do not have the issues of bias that commonlyarise when the bootstrap is used to estimate variance-like statistics.
Note that the three distributions contemplated above are different. There will, accordingly, be differences of power between the three approaches.
The methodology can be varied or extended in various ways: • We can choose M2 if it appears, on average, to improve on M1, i.e., choose M2 if the p-value is • This same style of approach can be used if M2 is selected from a wider class of models. The model selection step must then be repeated for each new bootstrap or permutation sample.
Here are functions that can handle the calculations: ## mmat1 & mmat2 must be model matricesM1 <- mmat1M2 <- mmat2n <- dim(M1)qrM1 <- qr(M1)qrM2 <- qr(M2)ss <- c(sum(qr.resid(qrM1, y)^2), df <- dim(mmat1) - c(qrM1\$rank, qrM2\$rank)ssd <- ss - ssF12 <- ssd/ss*df print(paste("Estimates of sigma^2 (Xply by ", divide, ")", sep=""))names(ss) <- paste(df, "df", sep="")print(ss/df/divide)print(paste("’F-statistic’ = ", round(F12,4), " (df=", df-df, " & ", df,")", sep="")) First try running funF() with the initial data: > M1.lm <- lm(ohms ~ poly(juice, 3), data=fruitohms)> M2.lm <- lm(ohms ~ ns(juice, 4), data=fruitohms)> M1 <- model.matrix(M1.lm)> M2 <- model.matrix(M2.lm)> funF(M1, M2, y=fruitohms\$ohms,showstats=TRUE) "Estimates of sigma^2 (Xply by 1e+06)" 0.9723468 0.9343716 "’F-statistic’ = 6.0397 (df=1 & 123)" The function bootF now shown can be used for the calculations. The default is to use bootstrap sam- ples of the residuals (type="ordinary"). Other options are type="permutation" and type="parametric".
function(data=fruitohms, statistic=funF, R=999, form1= ohms ~ poly(juice, 3), form2= ohms ~ ns(juice,4),type=c("ordinary", "parametric", "permutation")){ ## By default, type="ordinary" is used.
## Alternatives are type="parametric" or type="permutation"M1mod <- lm(form1, data=data)sigma <- summary(M1mod)\$sigman <- dim(data)M1 <- model.matrix(M1mod)M2 <- model.matrix(form2, data=data)M1fit <- fitted(M1mod)M1resid <- resid(M1mod)R2 <- R+1y <- M1fit+M1residboot.out <- numeric(R2)boot.out <- funF(mmat1=M1, mmat2=M2, y=y, showstats=TRUE)for(i in 2:R2){ index <- resid <- M1resid[sample(1:n)] else "ordinary") resid <- M1resid[sample(1:n, replace=TRUE)] else if(type =="parametric") resid <- rnorm(n, sd=sigma)y <- M1fit+residboot.out[i] <- statistic(mmat1=M1, mmat2=M2, y=y) }testval <- boot.outpval <- sum(boot.out >= testval)/R2print(c("p-value" = round(pval,5)))invisible(boot.out) We will now test the use of ordinary bootstrap samples in a situation where we pretty much know theanswer. For this purpose, we take the models to be ohms\$poly(juice,2) and ohms\$poly(juice,3),so that the models are nested.
> poly45stats <- bootF(form1=ohms ~ poly(juice,4), form2 = ohms ~ poly(juice,5)) "Estimates of sigma^2 (Xply by 1e+06)" 0.8878564 0.8859704 "’F-statistic’ = 1.2618 (df=1 & 122)"p-value Here, this statistic is an analysis of variance F -statistic. Thus, we may, provided iid normality assumptions are acceptable, refer it to the relevant theoretical F -distribution. This gives a p-value thatis essentially the same as the bootstrap distribution p-value.
Saving the values in poly45stats allows us, if we wish, to examine other percentiles of the bootstrapdistribution.
For the comparison between poly(juice,3) and ns(juice, 4) we have: > poly3ns4ord <- bootF(form1=ohms ~ poly(juice,3), form2 = ohms ~ ns(juice, 4)) "Estimates of sigma^2 (Xply by 1e+06)" 0.9723468 0.9343716 "’F-statistic’ = 6.0397 (df=1 & 123)"p-value This is not, strictly, a bootstrap. Rather it is a simulation that is based on a theoretical distribution.
Here, the residuals are sampled from the theoretical normal, with the standard deviation taken to bethe square root of the mean residual sum of squares from fitting the model M1.
> poly3ns4sim <- bootF(form1=ohms ~ poly(juice,3),+ form2 = ohms ~ ns(juice, 4), type="parametric")  "Estimates of sigma^2 (Xply by 1e+06)" 0.9723468 0.9343716 "’F-statistic’ = 6.0397 (df=1 & 123)"p-value > poly3ns4perm <- bootF(form1=ohms ~ poly(juice,3),+ form2 = ohms ~ ns(juice, 4), type="permutation")  "Estimates of sigma^2 (Xply by 1e+06)" 0.9723468 0.9343716 "’F-statistic’ = 6.0397 (df=1 & 123)"p-value The permutation distribution is widely useful in contexts where the asymptotic or other theory is in doubt, and where the null hypothesis implies that permuting the y-values, or permuting the values of LOGISTIC REGRESSION VS MULTI-WAY TABULATION an explanatory variable, should on average not affect the model’s fitted values. It can be useful in thefitting of logistic regression models.
Logistic Regression vs Multi-way Tabulation Models for multi-way tables that allow or all interactions, at all levels, are said to be “saturated”. Fittedvalues from a logistic regression, with a model that is thus “saturated”, equal the probabilities that mayalternatively be derived from the equivalent multi-way table. Code will be given that may be used toverify the equivalence of the multi-way table to fitted values from a logistic model. (Actually, it is forthis purpose immaterial what link is used with the binomial or quasibinomial model.) Data, in the data frame nassCDS in the DAAGxtras package, were collected according to a samplingdesign where different cells had different weights. Whether using the logistic regression or tabulatingthe proportions, it is necessary to take account of the weights.
The data have had a central role in a controversy, debated in three articles in the journal Chance, on the effectiveness of airbags. References, both to these articles and to relevant web pages, are givenon the help page for nassCDS.
Various biases may affect the result. There are alternatives to the style of analysis discussed here.
Farmer (2006) discusses an approach that is preferred by the National Highway and Traffic SafetyAdministraton (NHTSA), and which gives an answer that is favourable to the use of airbags.
Notwithstanding care to consider all relevant effects, it remains possible that there will be relevantfactors of interactions that have not been considered. The relevant information may not be included inthe data. A useful strategy, with data such as these, may be: • Account first for those effects that on prior grounds (relevant science, previous experience with related data), seem certain to have a role. Such arguments justify use, for the present data, of thefactors seatbelt, airbag and dvcat.
• Investigate addition of other possible effects one at a time.
Stepwise and best subsets automatic variable selection procedures have a much more limited useful- ness than older textbooks on regression may have suggested. Cross-validation or bootstrap approachesshould be used to check out the stability of the selection with respect to statistical variation. Notethat the use of automatic selection priocedures invalidates, in general, estimates that classical linearmodel theory gives for the standard errors of parameters. Bootstrap estimates of the standard errors,perhaps obtained along with the procedure used to check out the stability of the selection, may providea workaround.
The analysis here will be limited to the factors seatbelt and airbag, leaving as an exercise extensionto account for the force of impact measure (dvcat). Such a more extended analysis makes it clear thatany defendable analysis in the style of the analysis discussed here must, as a minimum, include thesethree factors and their interactions.
Almost certainly, there are other factors, not considered in any of the analyses presented in the Chance articles, that have affected results. A more complete analysis will require consideration of furtherpossible effects for which data are available. If more than one or two of those factors are included thereis a risk, even with this relatively large data set, that it will become impossible to distinguish the likelyeffects of airbags from those of other factors.
Here is the code for the limited (and misleading!) tabulations presented here: tot <- xtabs(weight ~ seatbelt+airbag, data=nassCDS)dead <- xtabs(I(weight*(unclass(dead)-1)) ~ seatbelt+airbag, data=nassCDS) nass.glm <- glm(dead ~ seatbelt*airbag, weight=weight, family=binomial, ## Reconcile with tablated resultdf <- with(nassCDS, expand.grid(seatbelt=factor(levels(seatbelt), df\$tot <- as.vector(xtabs(weight ~ seatbelt+airbag, data=nassCDS))nasshat <- predict(nass.glm, newdata=df, type="response", se=TRUE)df\$estdead <- nasshat\$fit*df\$totxtabs(tot ~ seatbelt+airbag, data=df) xtabs(estdead ~ seatbelt+airbag, data=df) Bootstrap esimates of the excess risk from airbags can be obtained thus: xtra <- matrix(0, nrow=2, ncol=1000)nass <- nassCDS[nassCDS\$weight>0,]prob=with(nass, weight/sum(weight))for(i in 1:1000){ nrows <- sample(1:dim(nass), prob=prob, replace=TRUE)xtra[,i] <- excessRisk(form = weight ~ seatbelt + airbag, Calculations can take a long time. Run this to calculate 10 or 100 bootstrap samples before running it forthe full 1000 samples. Percentile estimates of the confidence limits may in this instance be satisfactory.
Propensities are one of a number of devices that may be used in the attempt to reduce the number ofexplanatory variables that need to be considered in a regression. The method is intended for a veryspecific, but important, context where there are two (or, potentially, more) levels of a treatment factor.
The aim is to investigate treatment effects, after adjustment for covariate effects.
The attempt to adjust for multiple potential covariate effects has a variety of complications. The correct functional form must be used – it may not be adequate to assume additive linear effects, evenafter transformation of covariates in cases where this seems desirable. Diagnostic checking may bedifficult; failure to account adequately for the effect of one or more variable may lead to misleadingdiagnostics for other variables.
The derivation and use of propensity scores can simplify the model fitting process. The complications that arise from the attempt to adjust for multiple covariates are limited to the modeling used to predictthe propensity scores. Having derived propensity scores, the regression model that incorporates thetreatment effect has two terms – a treatment effect, and a single covariate adjustment term.
Discussion will be limited to the case where there is a binary treatment assignment. For the purposes ofthe data that will be studied here, this will be a control and a treatment. A propensity is the conditional probability λ(x) of assignment to a particular treatment given a vector of observed covariates x. Themethodology requires that treatment assignment should be ignorable given the propensity, i.e., treatmentassignment should be unrelated to potential outcomes within strata defined by λ(x). Conditional on thepropensity score, the distributions of the observed covariates are independent of the binary treatmentassignment.1 This allows use of the propensity score as a balancing score. See Rosenbaum (2002, pp.296-297), perhaps supplemented by Rosenbaum (1999) and Rosenbaum and Rubin (1983), for adiscussion.
The propensity score, or a monotone function of the score, can be estimated using discriminant analysis methodology, independently of the outcome yi. The regression equation becomes where the functional form of φ() has to be estimated or guessed. Scores from use of the logit transfor-mation are often used as a starting point.
Compare this with the use of regression adjustments of the form where in the simplest situation it might be hoped that f (x1, x2, . . . , xk) = a1x1 + +a2x2 + . . . + +akxk This requires the stronger condition that treatment assignment should be ignorable given the observedcovariates x. i.e., treatment assignment should be unrelated to potential outcomes within strata definedby x.
The propensity score approach reduces the regression equation that is of primary interest to a simple form. Decisions on which variables and interactions to include, and on transformation and/ormodeling using spline terms where this seems required, is relegated to the earlier discriminant functioncalculations. Diagnostics for the model for yi need be studied for one covariate only.
Scores calculated using a linear discriminant In Maindonald & Braun (2007, pp.412-419), we discuss the use, as propensity scores, of functionsf (x1, x2, . . . , xk) that are linear in covariates x1, x2, . . . , xk. This may be too restrictive. Even afteruse of spline terms (how many d.f.? what interactions, if any, should be included?) the model may beunable to capture well the nuances of the regression dependence in cases where there are more than oneor two explanatory variables.
# Mild preliminary filtering on# variables educ, re74 and re75 ## Calculate propensity scores: data frame nsw74psidA (DAAG)disc.glm <- glm(formula = trt ~ age + educ + black + hisp + marr + re74 + re75, family = binomial,data = nsw74psid1, subset=common) Pscores <- predict(disc.glm)## Now filter further, based on values of Pscoresxchop <- with(subset(nsw74psid1, common), overlapDensity(Pscores[trt==0], Pscores[trt==1], compare.numbers=FALSE,ratio=c(1/30, 30))) overlap <- commonoverlap[common] <- Pscores > xchop & Pscores < xchopnsw74psidC <- subset(nsw74psid1, overlap)Pscores <- Pscores[Pscores > xchop & Pscores < xchop] 1The ignorability assumption seems to me implausible for the present data.
SCORES CALCULATED FROM RANDOMFOREST ANALYSIS The hope is that, conditional on values of Pscores, controls and treated are now relatively similar withrespect to the various covariates. This can be checked directly, by splitting the data set up in to, e.g.,5 parts, based on values of Pscores.
cut5 <- cut(Pscores, breaks=5)for (cutlev in levels(cut5)){print(cutlev)nsw74 <- subset(nsw74psidC, cutlev==cut5)print(sapply(nsw74[, c("black","hisp","marr","nodeg")], function(x){tab <- table(nsw74[,"trt"], x) The balancing is, in most cases, reasonable. There is however a big disparity in the numbers of hispanicsin one of the categories, and none of the treated group in the final category were married. Similarcomparisons can be done for the continuous variables.
nsw.lm <- lm(log(re78+25) ~ trt + propensity, data=nsw74psidC)nsw.glm <- glm(I(re78>0) ~ trt + propensity, Scores Calculated from randomForest Analysis This is at the other extreme, relative to linear models. These models builds in as little structure aspossible. There is no insistence on continuous forms of dependence on continuous variables. Remarkablyas it seems to me, these models can, in some classification problems, do very well. It may be that theloss from ignoring of obvious structure is more than compensated by the ability to handle complexinteractions and non-linear responses.
The following is exploratory. As a technique for revealing structure in data, it can work well. Its use for deriving propensity scores requires futher exploration and theoretical elucidation.
Here, we use the larger data set in which not all observations have information on re75. We therefore omit this variable from consideration.
nsw <- rbind(psid1, nswdemo[nswdemo\$trt==1,])nsw\$trt <- factor(nsw\$trt)nswx <- nsw[, c(1:7,9)]nsw.rf <- randomForest(trt ~ ., data=nswx, proximity=TRUE)distmat <- 1-nsw.rf\$proximitydistmat[distmat==0] <- 0.001 ## Apply arcsine transformation to stretch the scale out at both endsdistmat <- asin(distmat)## Start with classical multi-dimensional scaling (Euclidean distances)nsw.cmd <- cmdscale(distmat)plot(nsw.cmd, col=unclass(nsw\$trt))## Apply Sammon (semi-metric) scalingnsw.sam <- sammon(distmat, nsw.cmd)plot(nsw.sam\$points, col=unclass(nsw\$trt)) The plot suggests that the two groups are not well matched. There may however be subgroups that overlap substantially. One might try to separate out those regions of the plot where both controls and(especially, as there are many fewer of these) treatment observations are reasonably well represented.
A more direct approach is to use the predict method for randomForest objects to return a matrix ofclass probabilities, with one row for each data point.
## Take the second column; the first would do equally well.
prob <- predict(nsw.rf, type="prob")[,2]prob[prob==0] <- 0.5*min(prob[prob>0])prob[prob==1] <- 1-0.5*min(1-prob[prob<1])scores <- log(prob/(1-prob)) Now find the range of values of scores where the ratios of the densities are in the range (0.025, 40), and try the regressions with attention limited to data where the scores are in that range: z <- with(nsw, overlapDensity(ratio=c(.025,40), retain <- scores>z & scores<znsw.lm <- lm(log(re78+25) ~ trt + scores, data=nsw, subset=retain)termplot(nsw.lm, par=T, smooth=panel.smooth)nsw.glm <- glm(I(re78>0) ~ trt + scores, family=binomial, data=nsw, subset=retain) The results depend on the chosen range of values of the scores.
reproduce the results from the experimental comparison; in fact they suggest a negative effect fromtraining. The estimates from glm() do favour the treatment group, providing the range of ratios is setsmall enough. The difference does not, however, reach the 5% level of statistical significance.
We measure, not x, but w = x + u, where u is “measurement error”.
Now assume that w is unbiased for x and that u is independent of x and .
Then, conditional on x, instead of This happens because y is independent of u.
Then β∗ is a consistent estimate of λβ, where Use the function g6.17, included in the image file figs6.RData that is available from:http://www.maths.anu.edu.au/~johnm/r-book/2edn/figures/ The function errorsINxGP(), available from http://www.maths.anu.edu.au/~johnm/r/functions/,simulates the effect when the variables that are measured without error code for a categorical effect.
In addition to the notes in Maindonald & Braun (2007, p.186), note the following: • Consider data where one variable, or a small number of variables jointly, have effects that, in the preferred model, are large relative to statistical error, while other variables have effects that, atbest, are marginally detectable. Then classical selection techniques (stepwise regression, etc.) arelikely to find those variables that have large effects, and their coefficients will be estimated withoutselection bias.
• In contexts where automatic selection techniques are tested more severely, they may not do much • To see the potential, with automatic selection algorithms, to get highly significant effects from random data, run the function bestset.noise(), from the DAAG package.
• There may be no unique “best” set of explanatory variables.
• The paper by Zhu and Chipman (2006) is interesting. The key here seems to be the incorporation of a random element. I suspect that a bootstrap approach, used in a similar way as in the randomforests algorithm, would do as well or better.
• The selection problem is fraught with further hazards when one or more of the variables is measured Attempts to interpret regression coefficients raise further hazards. Conditions that may make co- efficients interpretable include: a) It is possible to identify a few variables that have large effects; b)the data allow their contribution to the regression to be estimated accurately; c) there is good reasonto believe that no variables or interactions with substantial effects have been left out; d) there is acontext of scientific understanding that supports any interpretations that are proposed. Note furtherthe Bradford Hill criteria, in Hill (1965).
Farmer, C.H. 2006. Another look at Meyer and Finney’s ‘Who wants airbags?’. Chance 19:15-22.
Hill, A.B. 1965. The environment and disease. Association or causation? Maindonald, J. H. and Braun, W.J. 2007. Data Analysis and Graphics Using R – An Example-Based Approach. 2nd edition, Cambridge University Press.
URL:http://wwwmaths.anu.edu.au/~johnm/r-book.html Rosenbaum, P. and Rubin, D., 1983. The central role of the propensity score in observational studies for causal effects. Biometrika, 70:41–55.
Rosenbaum, P. R., 1999. Choice as an alternative to control in observational studies. Statistical Science, 14:259–278. With following discussion, pp.279–304.
Rosenbaum, P. R., 2002. Observational Studies. Springer-Verlag, 2 edition.
Zhu, M. and Chipman, H.A. 2006 Darwinian Evolution in Parallel Universes: A Parallel Genetic Algorithm for Variable Selection. Technometrics 48, 491–502.