Description

This question is adapted from Exercise 1 on page 46 of Faraway. We will analyze the wbca data

from the faraway package. After loading the faraway library, type ?wbca to see descriptions of

the study, the response variable and covariates, and the goals of the analysis.

(a) First, we will examine the associations between the individual covariates and the response

(Class, where 0 is a malignant tumor and 1 is benign). From the perspective of interpre-

tation, it makes more sense to model the probability of a malignant tumor rather than the

probability of a benign tumor, so create a new response variable called y where y=1 if malig-

nant, and y=0 if benign. For each covariate, plot the sample proportions p? vs. the covariate

values (there should be 9 plots, since there are 9 covariates). Does the probability of a ma-

lignant tumor appear to be associated with any of the covariates? If so, for which covariates

does the association appear strongest? [Hint: When looking at the univariate associations

between the response and individual covariates, there are only 10 unique covariate classes,

because each covariate can only take integer values between 1 and 10. Thus, we can summa-

rize the binary response as counts of 1 vs. 0 for each unique value of the covariate, and use

the counts to calculate the p?’s. There are many ways to do this, but table(y,covariate) is

perhaps the easiest object to work with. Note that logistic regression models with multiple

covariates will have more and more unique covariate classes as the number of covariates

increase, so it makes less sense to calculate and plot the p?’s, since most will be 0 or 1.]

(b) Fit 9 different logistic regression models – one model describing the relationship between

the response and each individual covariate in the dataset. Use the test statistic or P –

value corresponding to each covariate (e.g., from the Wald tests shown in the summary of the

1fitted model) to order the covariates from least to most strongly associated with the response

(where a larger test statistic or smaller P -value indicates stronger association). Which three

covariates are most strongly associated with the probability of a tumor being malignant?

Is the ordering by the strength of the association consistent with what you determined by

plotting the p?’s in the previous part?

(c) Make a scatterplot matrix of all of the covariates. Are any of the covariates correlated among

themselves? If we were to fit a logistic regression model with multiple covariates and compare

to the models with a single covariate, how might correlation among the covariates affect the

results? For example, would the covariate most strongly associated with the response on its

own be most strongly associated with the response in a model that includes all 9 covariates?

Is it possible that a covariate strongly associated with the response on its own has little to

no association with the response when other covariates are included in the model? If the

relationship between a covariate and the response changes depending on whether and which

other covariates are included in the model, how do you explain that phenomenon?

(d) Fit a logistic regression model that includes all 9 covariates. Use the step function to select

the best model according to AIC (using the default algorithm given by direction= ‘ both ‘ ).

Use the summary function to display the fitted model. Which covariates are included in that

model? Is the strength and nature of the associations with the response consistent with the

individual associations determined in part (b)?

(e) Repeat the previous part, but use BIC to select the model. Do AIC and BIC agree on what

is the best model? If not, describe how the selected models differ. Mathematically, why

might BIC select a different model than AIC? [Hint: To use BIC instead of AIC, type? step

and read what the k argument does. The documentation should also give you a hint as

to why AIC and BIC sometimes select different models, but you may want to google and

read some external sources of information, since the information in R isn’t very detailed and

Faraway discusses AIC, but not BIC (or at least, not until much later in the book).]