Description
This question is adapted from Exercise 1 on page 46 of Faraway. We will analyze the wbca data
from the faraway package. After loading the faraway library, type ?wbca to see descriptions of
the study, the response variable and covariates, and the goals of the analysis.
(a) First, we will examine the associations between the individual covariates and the response
(Class, where 0 is a malignant tumor and 1 is benign). From the perspective of interpre-
tation, it makes more sense to model the probability of a malignant tumor rather than the
probability of a benign tumor, so create a new response variable called y where y=1 if malig-
nant, and y=0 if benign. For each covariate, plot the sample proportions p? vs. the covariate
values (there should be 9 plots, since there are 9 covariates). Does the probability of a ma-
lignant tumor appear to be associated with any of the covariates? If so, for which covariates
does the association appear strongest? [Hint: When looking at the univariate associations
between the response and individual covariates, there are only 10 unique covariate classes,
because each covariate can only take integer values between 1 and 10. Thus, we can summa-
rize the binary response as counts of 1 vs. 0 for each unique value of the covariate, and use
the counts to calculate the p?s. There are many ways to do this, but table(y,covariate) is
perhaps the easiest object to work with. Note that logistic regression models with multiple
covariates will have more and more unique covariate classes as the number of covariates
increase, so it makes less sense to calculate and plot the p?s, since most will be 0 or 1.]
(b) Fit 9 different logistic regression models one model describing the relationship between
the response and each individual covariate in the dataset. Use the test statistic or P –
value corresponding to each covariate (e.g., from the Wald tests shown in the summary of the
1fitted model) to order the covariates from least to most strongly associated with the response
(where a larger test statistic or smaller P -value indicates stronger association). Which three
covariates are most strongly associated with the probability of a tumor being malignant?
Is the ordering by the strength of the association consistent with what you determined by
plotting the p?s in the previous part?
(c) Make a scatterplot matrix of all of the covariates. Are any of the covariates correlated among
themselves? If we were to fit a logistic regression model with multiple covariates and compare
to the models with a single covariate, how might correlation among the covariates affect the
results? For example, would the covariate most strongly associated with the response on its
own be most strongly associated with the response in a model that includes all 9 covariates?
Is it possible that a covariate strongly associated with the response on its own has little to
no association with the response when other covariates are included in the model? If the
relationship between a covariate and the response changes depending on whether and which
other covariates are included in the model, how do you explain that phenomenon?
(d) Fit a logistic regression model that includes all 9 covariates. Use the step function to select
the best model according to AIC (using the default algorithm given by direction= ‘ both ‘ ).
Use the summary function to display the fitted model. Which covariates are included in that
model? Is the strength and nature of the associations with the response consistent with the
individual associations determined in part (b)?
(e) Repeat the previous part, but use BIC to select the model. Do AIC and BIC agree on what
is the best model? If not, describe how the selected models differ. Mathematically, why
might BIC select a different model than AIC? [Hint: To use BIC instead of AIC, type? step
and read what the k argument does. The documentation should also give you a hint as
to why AIC and BIC sometimes select different models, but you may want to google and
read some external sources of information, since the information in R isnt very detailed and
Faraway discusses AIC, but not BIC (or at least, not until much later in the book).]