We learned previously that there are 4 things we need to know about any statistical test:
Let’s talk about these 4 needs in relation to our next test: regression.
Like all the tests we’ve seen so far, a regression usually takes a single numeric dependent variable.
BUT there are exceptions: multivariate regression can have multiple dependent variables. I don’t see this much in my discipline, but perhaps you will in yours. Once you know how to do a regular regression, multivariate regression isn’t too hard.
Most of the time, a regression is done on non-experimental data. For this reason, we usually use the word ‘predictor’ to describe the predictive variables in our regression models, rather than ‘independent variable’. This isn’t an absolute rule, but rather a strong trend.
Regressions can take both categorical variables (i.e. factors) and continuous variables (i.e. numbers) as predictors. For categorical predictors, there is no limit on the number of levels (except the limits that you should impose for your own sanity).
As many as you want/need. Just be cautious - more variables = more potential problems AND more interpretation headaches for you.
Regression uses the t statistic. Since we already learned about this when we talked about t-tests, you have a conceptual head start here.
BUT the t statistic is calculated differently in regression than it is in the t-test. There are 4 steps:
Regression is often called linear regression. The reason for this is that we are trying to find a line that best fits our data.
Think of it this way:
Each variable is one dimension. If we want to summarize a single variable (one-dimensional data), such as an exam score, we can do so with a single point - the mean. If we want to summarize the relationship between two variables (two-dimensional data), we can’t use a point, we have to move up a dimension and use a line. And this is what regression does - it tests the strength of a relationship between two variables.
The mean is a number that minimizes the distance to each data point; if we were to subtract all of the data points from the mean, then add them up, we end up with 0:
| Number | Mean | Difference |
|---|---|---|
| 2 | 5.1 | -3.1 |
| 4 | 5.1 | -1.1 |
| 6 | 5.1 | 0.9 |
| 7 | 5.1 | 1.9 |
| 7 | 5.1 | 1.9 |
| 5 | 5.1 | -0.1 |
| 3 | 5.1 | -2.1 |
| 1 | 5.1 | -4.1 |
| 7 | 5.1 | 1.9 |
| 9 | 5.1 | 3.9 |
| Sum of Differences | 0 |
Another way to think about the mean is as the point that minimizes the summed squared distances to the data points. In other words:
xdata <- data.frame()
for (i in (-20:20)/10) {
y <- c(i, sum((x - (mean(x)+i))^2))
xdata <- rbind(xdata, y)
}
colnames(xdata) <- c("MeanDifference", "SumofSquares")
ggplot(data <- xdata, aes(x = MeanDifference, y = SumofSquares)) +
geom_line() +
geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) +
theme_bw() +
labs(
x = "Difference Between Reference Value and Mean Value",
y = "Sum of Squared Deviations from the Reference Value"
)
So that’s how the mean works. As mentioned, regression operates on the same principle, but in 2 dimensions instead of one. We want to find the line that is as close as possible to the data points, like so:
mod <- lm(price ~ carat, data = filter(diamonds, color == "E" & cut == "Fair"))
#diamonds <- transform(price, Fitted = fitted(mod))
ggplot(data = filter(diamonds, color == "E" & cut == "Fair"), aes(x= carat, y = price)) +
geom_segment(aes(x = carat, y = price, xend = carat, yend = fitted(mod)), color = "gray50") +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "black") +
geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) +
labs(
x = "Carats (weight of the diamond)",
y = "Price (in $)"
) + theme_bw()
This line represents our estimate of the relationship between the predictor (carats) and the dependent variable (price).
The blue circles represent the actual data points.
The vertical gray lines represent the ‘residuals’, the distance of the data points from the line. These residuals represent the unexplained variance left in the data after we apply our line. Our goal is to find the line that makes these residuals (or rather, the sum of the squares of these residuals) as small as possible.
To see regression in action, copy this code chunk and run it! Notice how the residuals get bigger as the line moves away from the ‘best fit’ slope (7756.426).
if (!require("shiny")) install.packages("shiny")
library(shiny)
filtered_diamonds <- filter(diamonds, color == "E" & cut == "Fair")
filtered_diamonds %>%
summarize(slope =
sum((carat - mean(carat)) * (price - mean(price))) /
sum((carat - mean(carat))^2),
intercept =
mean(price) - slope * mean(carat)
) -> diamond_summary
shinyApp(
ui = fluidPage(
fluidRow(
column(12, wellPanel(
sliderInput("slope", label = h3("Slope"), min = -150000,
max = 150000, value = 7348.326),
))),
fluidRow(
column(12, offset = 0,
plotOutput("plot")
)
)),
server = function(input, output) {
output$plot = renderPlot({
slope <- input$slope
intercept <- mean(filtered_diamonds$price) - slope * mean(filtered_diamonds$carat)
yend <- intercept + slope * (filtered_diamonds$carat)
SS <- sum((filtered_diamonds$price - (intercept + slope * filtered_diamonds$carat))^2)
ggplot(data = filtered_diamonds, aes(x= carat, y = price)) +
geom_segment(aes(x = carat, y = price, xend = carat, yend = yend), color = "gray50") +
geom_abline(slope=slope, intercept=intercept, color = "black") +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "red") +
geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) +
labs(
x = "Carats (weight of the diamond)",
y = "Price (in $)"
) + theme_bw() +
theme(text = element_text(size = 30)) +
coord_cartesian(ylim = c(0, 16000)) +
annotate(geom = "text", x = 0.5, y = 15000, label = paste("SS Residual:", formatC(SS, format = "e", digits = 2)), size = 15)
})
},
options = list(height = 900)
)
But how do we get this line?
Remember, a line has 2 properties:
If you are interested, you can compute the slope as follows (using the diamonds data set):
diamonds %>%
summarize(slope =
sum((carat - mean(carat)) * (price - mean(price))) /
sum((carat - mean(carat))^2))
It’s a little more complicated than calculating the mean, but it’s the same idea, just in 2 dimensions.
Once we have the slope, we just need to calculate where the line would cross the Y axis (i.e. what Y will be when X is 0):
diamonds %>%
summarize(slope = # Compute the slope - we need it to compute the intercept
sum((carat - mean(carat)) * (price - mean(price))) /
sum((carat - mean(carat))^2),
intercept = # Compute the intercept
mean(price) - slope * mean(carat))
This gives us the following numbers:
| slope | intercept |
|---|---|
| 7756.426 | -2256.361 |
Some facts about the slope:
Some facts about the intercept:
The standard error of the regression coefficient is computed by:
So, the less accurate our predictions are, the larger the standard error will be. Not really a surprise.
Also, the larger the standard deviation of our predictor variable is, the smaller our standard error will be. In other words, more variability in our predictor variable is a good thing - it increases our confidence that we’ve found the best-fitting line.
diamonds %>%
summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2),
intercept = mean(price) - slope * mean(carat),
se_slope = sqrt(
sum(
(price - (intercept + slope * carat))^2
)/
(n() - 2)) / sqrt(sum((carat - mean(carat))^2)))
## # A tibble: 1 × 3
## slope intercept se_slope
## <dbl> <dbl> <dbl>
## 1 7756. -2256. 14.1
Now we compute the t statistic.
Slope / Standard Error = t
diamonds %>%
summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2),
intercept = mean(price) - slope * mean(carat),
se_slope = sqrt(
sum(
(price - (intercept + slope * carat))^2
)/
(n() - 2)) / sqrt(sum((carat - mean(carat))^2)),
t = slope / se_slope)
## # A tibble: 1 × 4
## slope intercept se_slope t
## <dbl> <dbl> <dbl> <dbl>
## 1 7756. -2256. 14.1 551.
That’s a pretty big t value!
Let’s consider what influences our t score:
The Degrees of Freedom for the regression model =
Since we only have one regression coefficient (one slope), this is equal to:
diamonds %>%
summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2),
intercept = mean(price) - slope * mean(carat),
se_slope = sqrt(
sum(
(price - (intercept + slope * carat))^2
)/
(n() - 2)) / sqrt(sum((carat - mean(carat))^2)),
t = slope / se_slope,
df = n() - 1 - 1)
## # A tibble: 1 × 5
## slope intercept se_slope t df
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7756. -2256. 14.1 551. 53938
Imagine that I have the following regression model:
fakemodel <- lm(weight ~ height + shoesize, data = fakeweightdata)
I have data from 120 people.
Since we have a t statistic, we will be using the t distribution here. Everything you learned about this distribution for the t test applies here.
So how do we set up our regression? You’ll be happy to learn that it is very similar to an ANOVA. But instead of aov(), we use lm(). lm stands for limear, I guess. (Just kidding. It stands for linear model)
simple_regression <- lm(price ~ carat, data = diamonds)
summary(simple_regression)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <0.0000000000000002 ***
## carat 7756.43 14.07 551.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 0.00000000000000022
Scroll up a bit and look at the slope, intercept, etc. that we computed. Then look at the R regression output. Did we do our math right? What does R call the slope, intercept, standard error, etc.?
The most important part of this output is the Coefficients: section.
Down below the coefficients is some further information.
In order to do the regressions that follow, you will need to install and load the Stats2Data package. Add this code you your setup chunk.
if (!require("Stat2Data")) install.packages("Stat2Data")
library(Stat2Data) # Add these lines to your setup chunk!
Consider the following regression output as you answer the next few questions:
##
## Call:
## lm(formula = sales ~ budget, data = marketing %>% mutate(budget = youtube +
## facebook + newspaper))
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6655 -1.5685 0.1408 1.9153 8.6274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.091634 0.526230 9.676 <0.0000000000000002 ***
## budget 0.048688 0.001982 24.564 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.12 on 198 degrees of freedom
## Multiple R-squared: 0.7529, Adjusted R-squared: 0.7517
## F-statistic: 603.4 on 1 and 198 DF, p-value: < 0.00000000000000022
When we include a continuous variable in our model, the intercept is the point where that continuous variable becomes 0. Sometimes this makes sense, because 0 is a real part of that variable:
In these cases, the 0 is actually interpretable, and there are likely 0 values as part of the data.
But in many cases, there isn’t a single 0 in our predictor, and the 0 value is meaningless. Consider our diamond example, above. What is a 0 carat diamond? It’s nothing - there’s no such thing. Try proposing with a 0 carat diamond.
So what should we do in such cases? We should center our predictor. We subtract all values of that predictor from the mean for that predictor. This shifts the values so that the mean becomes 0, data points below the mean become negative, and data points above the mean are positive.
What does centering do?
How do you center a variable? There are two ways:
data %>% mutate(newvariable = variable - mean(variable))lm(dv ~ scale(variable, scale = FALSE), data = data)
Let’s redo our diamond regression, this time centering the variable.
simple_regression_centered <- lm(price ~ scale(carat, scale = FALSE), data = diamonds)
summary(simple_regression)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.361 13.05535 -172.8304 0
## carat 7756.426 14.06658 551.4081 0
summary(simple_regression_centered)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3932.800 6.667655 589.8325 0
## scale(carat, scale = FALSE) 7756.426 14.066579 551.4081 0
If we compare the original model to the new one, we see that the slope etc. for carat is the same before and after centering.
BUT the intercept is different.
In the original model, the intercept said: The value of a 0 carat diamond is -2256.
In the centered model, the intercept says: The value of an average diamond is 3932.8. That’s a more useful number.
So, when your predictor doesn’t have a meaningful 0 value, center it!
Consider the following regression model. The variables are as follows:
data(ChildSpeaks)
Talk.lm <- lm(Gesell ~ scale(Age, scale = FALSE), data = ChildSpeaks)
summary(Talk.lm)
##
## Call:
## lm(formula = Gesell ~ scale(Age, scale = FALSE), data = ChildSpeaks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.604 -8.731 1.396 4.523 30.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.6667 2.4054 38.940 < 0.0000000000000002 ***
## scale(Age, scale = FALSE) -1.1270 0.3102 -3.633 0.00177 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 19 degrees of freedom
## Multiple R-squared: 0.41, Adjusted R-squared: 0.3789
## F-statistic: 13.2 on 1 and 19 DF, p-value: 0.001769
But how do we do a linear regression when our predictor is categorical? In other words, how do we find a line that fits a factor with levels?
Consider this simple example, the SocialCog data set from the heplots package. This data set contains data from 3 groups: A control group, a group diagnised with schizophrenia, and a group diagnosed with schizoaffective disorder. Each group took some tests to assess their ability to manage emotions, among other social abilities.
If we wanted to include this categorical Diagnosis variable in a regression, how would we do it? In other words, how do we draw a line when the predictor is Diagnosis (Levels: Control, Schizphrenia, Schizoaffective)? The different levels aren’t numbers, there is no obvious intercept, and what is the distance between control, schizphrenia, and schizoaffective disorder anyway?
For categorical predictors, R picks one of the levels to be the intercept (AKA the baseline or comparison group). By default, this goes alphabetically, but it can be changed. Then a separate ‘slope’ is computed from this ‘intercept’ to each of the other levels, as shown above. The slope represents the difference between the means of each group.
| Dx | means | difference |
|---|---|---|
| Control | 45.05 | 0.00 |
| Schizophrenia | 35.84 | -9.21 |
| Schizoaffective | 44.53 | -0.52 |
Compare the data in the table above to the regression coefficients below.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.05 1.55 29.03 0.00
## DxSchizophrenia -9.21 2.47 -3.73 0.00
## DxSchizoaffective -0.51 2.78 -0.18 0.85
If you only have categorical predictors, like we have here, it is unusual to do a regression. Consider doing an ANOVA instead.
If you want to change the ‘baseline’ level of a factor, you can! Here is one way:
SocialCog <- SocialCog %>%
mutate(Dx = factor(Dx), Dx = relevel(Dx, "Control"))
In the code above, we turned Diagnosis into a factor and then used the relevel function to change the baseline to “Control”.
You can use the levels() function to see what the levels of a variable are, like this:
levels(SocialCog$Dx)
## [1] "Control" "Schizophrenia" "Schizoaffective"
The first one listed is the ‘baseline’ condition.
Let’s talk about teen pregnancy. More specifically, let’s find out how the civil war affected teen pregnancy rates in 2010. Why? I cannot think of a single good reason, except that it will illustrate how to do a regression.
![Don’t dishonor their sacrifice with teen pregnancy!] (https://en.wikipedia.org/wiki/American_Civil_War#/media/File:Battle_of_Gettysburg,_by_Currier_and_Ives.png)
Our variables are:
data(TeenPregnancy)
TeenPregnancy <- TeenPregnancy %>% mutate(CivilWar = as.factor(CivilWar), CivilWar = case_when(
CivilWar == "B" ~ "Border States",
CivilWar == "C" ~ "Confederate States",
CivilWar == "U" ~ "Union States",
CivilWar == "O" ~ "Other States"))
Teen.lm <- lm(Teen ~ CivilWar, data = TeenPregnancy)
summary(Teen.lm)
##
## Call:
## lm(formula = Teen ~ CivilWar, data = TeenPregnancy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2381 -6.1952 -0.1515 8.1061 24.9333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.667 5.745 10.735 0.0000000000000407 ***
## CivilWarConfederate States 2.970 6.481 0.458 0.6489
## CivilWarOther States -6.600 6.293 -1.049 0.2997
## CivilWarUnion States -13.429 6.141 -2.187 0.0339 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.95 on 46 degrees of freedom
## Multiple R-squared: 0.3167, Adjusted R-squared: 0.2721
## F-statistic: 7.106 on 3 and 46 DF, p-value: 0.0005073
This section has valuable but somewhat advanced information. Read it if you choose - you may need to understand it later. For know, just know that we’ve been talking about dummy coding of categorical variables, the default in R.
When we talk about Contrasts in regression, we are referring to the way we set up the comparisons between the groups in a categorical variable. The default way of dealing with categorical variables (picking a baseline and comparing the other levels to it) is called dummy coding. But it’s not the only way to code for categorical variables in R. We can also do
And many others.
Changing the contrasts changes the intercept as well.
Dummy coding is also sometimes called treatment coding, because its the contrast we would use if we have a clear control and treatment group.
As we discussed, in dummy coding the intercept is the baseline level/group that we select. The slopes represent the difference of the other levels from this one.
If we use the contrasts() function, we can see how dummy coding is set up.
contrasts(SocialCog$Dx)
## Schizophrenia Schizoaffective
## Control 0 0
## Schizophrenia 1 0
## Schizoaffective 0 1
Some things to note:
We can create this contrast matrix ourselves using contr.treatment()
z <- contr.treatment(3)
colnames(z) <- c("Schizophrenia", "Schizoaffective")
rownames(z) <- c("Control", "Schizophrenia", "Schizoaffective")
z
## Schizophrenia Schizoaffective
## Control 0 0
## Schizophrenia 1 0
## Schizoaffective 0 1
In deviation coding, we care comparing each level to the grand mean (the average of all the levels). So, our intercept is NOT some baseline group, it’s the average of all the groups. With deviation coding, we’re asking “does this group deviate significantly from the average”?
When would we want to use deviation coding? When we don’t have a clear control/comparison group. In our SocialCog example, if there was no ‘Control’ level, then we wouldn’t have any clear rationale for choosing a ‘baseline’. So, instead of using dummy coding, which prioritizes one level of our variable over the other(s), we could just say “was one of these conditions a lot different from the other(s)?” That’s the question that deviation coding answers.
In deviation coding, the intercept is the grand mean. The slopes represent the difference of the levels from this grand mean.
We can change our contrasts to Deviation coding using contr.sum(), like this:
contrasts(SocialCog$Dx) <- contr.sum(3)
colnames(contrasts(SocialCog$Dx)) <- c("Control", "Schizophrenia")
contrasts(SocialCog$Dx)
## Control Schizophrenia
## Control 1 0
## Schizophrenia 0 1
## Schizoaffective -1 -1
Notice the row of -1 next to Schizoaffective. Since we can only do a number of contrasts equal to the number of levels -1, we are leaving this group out. Deviation coding will always leave out the last level; if you want to change this, reorder the levels.
## # A tibble: 3 × 4
## Dx means grand_mean deviation
## <fct> <dbl> <dbl> <dbl>
## 1 Control 45.0 41.8 3.24
## 2 Schizophrenia 35.8 41.8 -5.97
## 3 Schizoaffective 44.5 41.8 2.72
In the table above, you can see the means of each group, the grand mean, and the deviation of each group from the grand mean. Now look at the regression output below. Do you see what Deviation coding is doing?
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.81 1.13 37.14 0.00
## DxControl 3.24 1.44 2.25 0.03
## DxSchizophrenia -5.97 1.58 -3.78 0.00
In Helmert Coding, we compare each group to the means of all the groups that come before it. Why would we want to do this? Consider a study were we are studying heart health, and have the following groups:
In this study, Helmert coding would compare Light Cardio to Medication, to see if cardio was better than drugs. Then the second contrast would compare Light Cardio + Medication to the average of Medication and Light Cardio, to see if the combination of the two is better that the average effect of each one separately.
So maybe you’ll encounter a need for this someday. Maybe.
In the SocialCog data, we could compare Schizoaffective to Schizophrenia, to see if these two groups are different from each other. Then we could compare the mean of both groups to the Control group, to see if the two disordered groups are, considered together, different from the control. Or maybe not.
You can set up the contrasts using contr.helmert(). But you might prefer to create your own custom contrasts, which keeps the coefficients on the original scale so the slopes are still interpretable:
contrasts(SocialCog$Dx) <- contr.helmert(3) # Use built-in function
custom.helmert = matrix(c( # Create custom Helmert contrasts
c(0, -1/2, 1/2),
c(2/3, -1/3, -1/3)
), ncol = 2)
contrasts(SocialCog$Dx) <- custom.helmert
colnames(contrasts(SocialCog$Dx)) <- c("SchizophreniavsSchizoaffective", "BothSchizovsControl")
contrasts(SocialCog$Dx)
## SchizophreniavsSchizoaffective BothSchizovsControl
## Control 0.0 0.6666667
## Schizophrenia -0.5 -0.3333333
## Schizoaffective 0.5 -0.3333333
In Helmert coding, like in Deviation coding, the intercept is the grand mean.
The slope in the ‘SchizophrenaivsSchizoaffective’ row compares ‘Schizophrenia’ to ‘Schizoaffective’. The next row compares Control to the mean of Both ‘Schizo-’ groups. Using this contrast here doesn’t really make sense, but I present the results for the purposes of illustration:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.81 1.13 37.14 0.00
## DxSchizophreniavsSchizoaffective 8.70 3.00 2.90 0.00
## DxBothSchizovsControl 4.86 2.16 2.25 0.03
In regression, we’re not restricted to just one predictor. We can put in a few of them.
When we do, we connect predictors using ‘+’ or ’’ just like in ANOVA*.
Let’s try another data set from the heplots package, NeuroCog. This data set is a lot like SocialCog, in that it has 3 groups: Control, Schizophrenia, and Schizoaffective. And it has an overall measure of social cognition. But it also includes multiple cognitive variables, including:
Let’s set up a regression where we see what factors predict Social Cognition in this sample. We’ll start by adding diagnosis and attention.
multiple_regression <- lm(SocialCog ~ Dx + Attention, data = NeuroCog)
summary(multiple_regression)
##
## Call:
## lm(formula = SocialCog ~ Dx + Attention, data = NeuroCog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.6259 -7.8433 -0.0711 8.0173 28.7274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.02672 2.62865 12.184 < 0.0000000000000002 ***
## DxSchizophrenia -7.51657 1.80995 -4.153 0.00004577197 ***
## DxSchizoaffective -2.08454 2.05708 -1.013 0.312
## Attention 0.35330 0.05641 6.263 0.00000000175 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.04 on 238 degrees of freedom
## Multiple R-squared: 0.2574, Adjusted R-squared: 0.248
## F-statistic: 27.5 on 3 and 238 DF, p-value: 0.000000000000002639
When we put more predictors into our model, we change the equation and so we change the intercept.
Takeaway: As you start adding predictors, the intercept often becomes difficult to interpret.
As we add more predictors to our model, the coefficients for the individual predictors often change.
This is because regression is different from ANOVA.
In ANOVA we test the Main Effects of each factor, checking to see if each has a separate and independent influence on the dependent variable.
In regression, by contrast, we are checking to see if each predictor we add has an influence on the dependent variable after accounting for the influence of all other predictors. So, this test is NOT separate. Instead, if a given predictor is significant, it means that this predictor gives us information about the dependent variable that the other predictors don’t. This can be quite useful, as we’ll see below.
For multiple regressions, once we have put in the predictors we know we want in order to test our hypotheses, it’s time to check our model. We’ll do two things:
This process is called model fitting
We want to remove a predictor (or an interaction) from our model if:
How can we tell is a given predictor (or interaction) contributes to model fit? There are 2 ways:
Let’s suppose we want to assess whether we should leave in the interaction of Music and Volume. We would run two regressions, then compare them using the anova() function. Used in this way, the anova() function performs a likelihood ratio test, comparing the two models to see which leaves less variance unexplained (RSS means residual sum of squares).
NeuroCogRegression <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog) # A model without the predictor 'Attention'
NeuroCogRegression2 <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention, scale = FALSE), data = NeuroCog) # A model with the predictor 'Attention'
anova(NeuroCogRegression, NeuroCogRegression2) # Did the interaction improve the model fit?
## Analysis of Variance Table
##
## Model 1: SocialCog ~ Dx + scale(Verbal, scale = FALSE)
## Model 2: SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention,
## scale = FALSE)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 238 31406
## 2 237 28797 1 2609 21.472 0.000005927 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F statistic is large, and the p value is < .001. So we conclude that the predictor Attention is necessary, and we should include it in the model.
Or we can just do this:
NeuroCogRegressionFull <- lm(SocialCog ~ Dx +
scale(Speed, scale = FALSE) +
scale(Attention, scale = FALSE) +
scale(Memory, scale = FALSE) +
scale(Verbal, scale = FALSE) +
scale(Visual, scale = FALSE) +
scale(ProbSolv, scale = FALSE) +
scale(Age, scale = FALSE) +
Sex,
data = NeuroCog)
# Start with the most complicated model you are willing to accept. Ours includes all possible predictors and co-variates.
NeuroCogRegressionStep <- step(NeuroCogRegressionFull)
## Start: AIC=1164.53
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention,
## scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal,
## scale = FALSE) + scale(Visual, scale = FALSE) + scale(ProbSolv,
## scale = FALSE) + scale(Age, scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## - scale(Age, scale = FALSE) 1 1.41 27179 1162.5
## - scale(Visual, scale = FALSE) 1 4.54 27182 1162.6
## - scale(Verbal, scale = FALSE) 1 7.12 27185 1162.6
## - scale(ProbSolv, scale = FALSE) 1 111.36 27289 1163.5
## <none> 27178 1164.5
## - scale(Speed, scale = FALSE) 1 263.11 27441 1164.9
## - Sex 1 621.97 27800 1168.0
## - scale(Memory, scale = FALSE) 1 826.04 28004 1169.8
## - Dx 2 1570.83 28749 1174.1
## - scale(Attention, scale = FALSE) 1 1360.07 28538 1174.3
##
## Step: AIC=1162.55
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention,
## scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal,
## scale = FALSE) + scale(Visual, scale = FALSE) + scale(ProbSolv,
## scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## - scale(Visual, scale = FALSE) 1 5.73 27185 1160.6
## - scale(Verbal, scale = FALSE) 1 7.27 27186 1160.6
## - scale(ProbSolv, scale = FALSE) 1 112.96 27292 1161.5
## <none> 27179 1162.5
## - scale(Speed, scale = FALSE) 1 268.89 27448 1162.9
## - Sex 1 645.00 27824 1166.2
## - scale(Memory, scale = FALSE) 1 828.67 28008 1167.8
## - Dx 2 1602.30 28782 1172.4
## - scale(Attention, scale = FALSE) 1 1419.62 28599 1172.9
##
## Step: AIC=1160.6
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention,
## scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal,
## scale = FALSE) + scale(ProbSolv, scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## - scale(Verbal, scale = FALSE) 1 4.44 27189 1158.6
## - scale(ProbSolv, scale = FALSE) 1 107.54 27292 1159.5
## <none> 27185 1160.6
## - scale(Speed, scale = FALSE) 1 278.22 27463 1161.1
## - Sex 1 639.33 27824 1164.2
## - scale(Memory, scale = FALSE) 1 828.87 28014 1165.9
## - Dx 2 1604.89 28790 1170.5
## - scale(Attention, scale = FALSE) 1 1415.80 28601 1170.9
##
## Step: AIC=1158.64
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention,
## scale = FALSE) + scale(Memory, scale = FALSE) + scale(ProbSolv,
## scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## - scale(ProbSolv, scale = FALSE) 1 111.17 27300 1157.6
## <none> 27189 1158.6
## - scale(Speed, scale = FALSE) 1 273.91 27463 1159.1
## - Sex 1 683.89 27873 1162.7
## - scale(Memory, scale = FALSE) 1 911.70 28101 1164.6
## - Dx 2 1666.25 28856 1169.0
## - scale(Attention, scale = FALSE) 1 1529.52 28719 1169.9
##
## Step: AIC=1157.63
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention,
## scale = FALSE) + scale(Memory, scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## - scale(Speed, scale = FALSE) 1 199.08 27500 1157.4
## <none> 27300 1157.6
## - Sex 1 639.28 27940 1161.2
## - scale(Memory, scale = FALSE) 1 1195.08 28496 1166.0
## - scale(Attention, scale = FALSE) 1 1486.64 28787 1168.5
## - Dx 2 1795.37 29096 1169.0
##
## Step: AIC=1157.38
## SocialCog ~ Dx + scale(Attention, scale = FALSE) + scale(Memory,
## scale = FALSE) + Sex
##
## Df Sum of Sq RSS AIC
## <none> 27500 1157.4
## - Sex 1 639.47 28139 1161.0
## - scale(Memory, scale = FALSE) 1 998.35 28498 1164.0
## - scale(Attention, scale = FALSE) 1 1292.67 28792 1166.5
## - Dx 2 1605.03 29105 1167.1
# This command will remove predictors and do model comparisons until it finds the simplest, best-fitting model
summary(NeuroCogRegressionStep)
##
## Call:
## lm(formula = SocialCog ~ Dx + scale(Attention, scale = FALSE) +
## scale(Memory, scale = FALSE) + Sex, data = NeuroCog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.4269 -7.3055 -0.2892 7.4942 26.3815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.90041 1.27267 37.638 < 0.0000000000000002 ***
## DxSchizophrenia -6.58970 1.78777 -3.686 0.000283 ***
## DxSchizoaffective -1.49693 2.02007 -0.741 0.459413
## scale(Attention, scale = FALSE) 0.23058 0.06923 3.331 0.001005 **
## scale(Memory, scale = FALSE) 0.22098 0.07550 2.927 0.003756 **
## SexMale -3.39423 1.44890 -2.343 0.019980 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.79 on 236 degrees of freedom
## Multiple R-squared: 0.2964, Adjusted R-squared: 0.2815
## F-statistic: 19.88 on 5 and 236 DF, p-value: < 0.00000000000000022
The step() function has tested all the predictors and removed those that don’t contribute to the model. The output shows us the final model, as well as the intermediate steps to get there.
Run ?step() to see different arguments this function can
take.
Collinearity is when two or more predictors in the model are strongly correlated with each other. When this happens, including them both can have strange effects on the regression parameters.
Let’s explore this with our flights data. We’ll examine 3 models. Each has a dependent variable of arrival delay. The first includes both flight distance and air time. These two variables are highly correlated (r = 0.9906496).
library(nycflights13)
flights_regression_all <- lm(arr_delay ~ distance + air_time , data = flights)
flights_regression_distance <- lm(arr_delay ~ distance, data = flights)
flights_regression_airtime <- lm(arr_delay ~ air_time , data = flights)
Here are the outputs of the 3 models:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.455936 0.172887 -8.421332 0
## distance -0.087655 0.000761 -115.145312 0
## air_time 0.665263 0.005980 111.256476 0
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.829198 0.135521 79.90781 0
## distance -0.003752 0.000106 -35.46495 0
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.429260 0.147654 63.86058 0
## air_time -0.016816 0.000832 -20.20746 0
First, let’s examine the effect of distance. The coefficient in the combined model is -0.088, but when we take air_time out, the coefficient changes to -0.0038 - that’s an order of magnitude smaller!
Things are even worse for air_time”: the coefficient in the combined model is 0.67. When we take distance away, this changes - to -0.017! It gets a lot smaller AND changes direction from positive to negative.
This is the danger of collinearity - it can mess up your coefficients so much that you draw completely wrong conclusions. How embarrassed would you be if you tried to report that first model?!
Fortunately, we can check for collinearity, using the vif() function:
if (!require("car")) install.packages("car")
library(car) # Add these two lines to your setup chunk!
vif(flights_regression_all)
## distance air_time
## 53.72509 53.72509
The variance inflation factor for both predictors in this model is above 50; we want to see numbers < 5.
How do we fix collinearity?
In the case of our flights example, the collinearity is so bad that removing either air_time or distance is the best solution.
In other cases, centering your variables will alleviate this problem. Let’s return to our miles per gallon regression. First, we’ll look at the vif without centering:
vif_uncentered <- lm(mpg ~ disp * hp, data = mtcars)
vif(vif_uncentered)
## disp hp disp:hp
## 13.60363 12.31598 37.82176
That’s >5, much higher than we want. This is because adding an interaction term increases the collinearity of the model; we’re telling the model to test how these two variables are related, so we shouldn’t be surprised that it does.
Let’s see what happens when we center our variables:
vif_centered <- lm(mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), data = mtcars)
vif(vif_centered)
## scale(disp, scale = FALSE) scale(hp, scale = FALSE)
## 2.672299 2.738923
## scale(disp, scale = FALSE):scale(hp, scale = FALSE)
## 1.084565
And the collinearity is gone!
Collinearity that arises from strongly intercorrelated predictors is best dealt with by removing one of them. Collinearity that arises from the structure of our model (i.e. interactions) is best dealt with by centering variables.
Now we should have our final model, the one we want to report. Before we’re ready to do that, we need to assess our model fit.
There are two measures that are commonly used to assess model fit:
The AIC is a number that represent our prediction error - how much information our model did NOT account for in the data.
So, smaller AIC is better. It means that the model fits the data better.
To get AIC, use the this function: AIC(modelname)
R2 is the opposite of AIC; it measures how much variance our model DID account for.
So, bigger R2 is better. It means that the model fits the data better.
You can think of R2 as a proportion of the variance explained.
To get R2, just look in the model output.
Use adjusted R2; this accounts for the number of predictors in your model and is a better metric for comparing different models.
Now that we have finalized our regression model, its time to report our results. We need to report information for the model as a whole, as well as the individual predictors:
Model as a Whole:
For each predictor and interaction, we need to report the following:
So, let’s look at our diamonds regression one more time:
##
## Call:
## lm(formula = price ~ scale(carat, scale = FALSE), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3932.800 6.668 589.8 <0.0000000000000002 ***
## scale(carat, scale = FALSE) 7756.426 14.067 551.4 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 0.00000000000000022
First, let’s describe the model as a whole:
The regression model included price (in dollars) as the dependent variable and carat as the predictor. Prior to analysis, carat was centered. The adjusted R2 of the model was 0.85.
Next, let’s describe the result of the carat predictor:
The effect of carat was significant (Estimate = 7756, SE = 14.07, t = 551.4, p < .001), indicating that for each increase of 1 carat in diamond weight, the price increased by $7,756 dollars.
Now let’s do one that’s a bit more complicated.
summary(vif_centered)
##
## Call:
## lm(formula = mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE),
## data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5153 -1.6315 -0.6346 0.9038 5.7030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.20207699 0.73949796 24.614 < 0.0000000000000002 ***
## scale(disp, scale = FALSE) -0.03082639 0.00637635 -4.834 0.0000436 ***
## scale(hp, scale = FALSE) -0.03097348 0.01166910 -2.654 0.01296 *
## scale(disp, scale = FALSE):scale(hp, scale = FALSE) 0.00029005 0.00008694 3.336 0.00241 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.692 on 28 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8005
## F-statistic: 42.48 on 3 and 28 DF, p-value: 0.0000000001499
First, let’s describe the model as a whole:
The regression model included miles per gallon as the dependent variable. The predictors included the total displacement of the engine (in cubic inches) and the overall horsepower of the engine (in horses), as well as the interaction between them. Prior to analysis, both variables were centered. Likelihood ratio tests confirmed that both predictors and the interaction term contributed to model fit. The adjusted R2 of the model was 0.8, and the AIC was 159.9087233.
Next, let’s describe the result of each predictor:
The effect of displacement was significant (Estimate = -0.03, SE = 0.0064, t = -4.83, p < .001), indicating that for each increase of 1 cubic inch of displacement, the miles per gallon decreased slightly. Similarly, the effect of horsepower (Estimate = -0.03, SE = 0.012, t = -2.65, p = .013) indicated that as horsepower increased, miles per gallon decreased. The interaction of the two predictors was also significant (Estimate = 0.0003, SE = 0.00009, t = 3.34, p = .0024), indicating that as horsepower increased, the effect of displacement on miles per gallon got weaker, and vice versa.
This might be a good place to include a graph illustrating the interaction.
Now let’s do a regression, start to finish. Run the following code:
data(FirstYearGPA)
?FirstYearGPA
head(FirstYearGPA)
## GPA HSGPA SATV SATM Male HU SS FirstGen White CollegeBound
## 1 3.06 3.83 680 770 1 3.0 9.0 1 1 1
## 2 4.15 4.00 740 720 0 9.0 3.0 0 1 1
## 3 3.41 3.70 640 570 0 16.0 13.0 0 0 1
## 4 3.21 3.51 740 700 0 22.0 0.0 0 1 1
## 5 3.48 3.83 610 610 0 30.5 1.5 0 1 1
## 6 2.95 3.25 600 570 0 18.0 3.0 0 1 1
Why do a multiple regression? Do a multiple regression when you want to…
We’ve talked about Type I error and how doing more tests increases Type I error probability. If we can do all our ‘tests’ in a single model, it only counts as 1 test, so we avoid this problem. We saw this in the example above, when we evaluated the influence of many different factors on college GPA.
In the real world, different variables are often highly inter-correlated, co-occurring together willy-nilly in ways that are outside of our control. Perhaps you have heard of confounds: two predictors that are so overlapping that it is difficult to know which is influencing our dependent variable. For example:
If we can’t (or shouldn’t) do an experiment where we exert control by manipulating variables, we are faced with a challenge; how do we know that this variable is really influencing our dependent variable, and not some other confound?
Regression gives us a chance to test this. If we have a variable we think might be a confound, we can add it to the regression. Since each predictor is only significant if it adds new predictive power over and above all other predictors, then our original predictor will only be significant if it provides unique information about the dependent variable. If our original predictor only seemed like it was significant because it was correlated with the new predictor we added, it will no longer be statistically significant.
Each new predictor we add to our model in this way is called a co-variate, because we are checking to see if it varies with (co-varies) our original predictor of interest.
Here’s an example from the NeuroCog data set. We might hypothesize that people with good Verbal skills would have better overall social abilities. So we could regress Verbal (a continouous variable, which we have centered) on SocialCog scores, like so:
multiple_regression <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog)
summary(multiple_regression)
##
## Call:
## lm(formula = SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.4408 -6.9409 -0.8932 8.5092 26.2735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.22542 0.99659 46.384 < 0.0000000000000002 ***
## DxSchizophrenia -7.93779 1.93509 -4.102 0.0000563 ***
## DxSchizoaffective -2.41438 2.17986 -1.108 0.269
## scale(Verbal, scale = FALSE) 0.34286 0.08039 4.265 0.0000289 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.49 on 238 degrees of freedom
## Multiple R-squared: 0.1964, Adjusted R-squared: 0.1863
## F-statistic: 19.39 on 3 and 238 DF, p-value: 0.00000000002774
#summary(NeuroCog)
It’s quite significant; as Verbal scores go up, so does Social Cognition. But wait! Since cognitive abilities are inter-related, maybe better Verbal scores are really just the result of other abilities, such as Attention. Verbal ability and Attention ARE correlated (r = 0.5947171.
To test this, let’s add Attention as a co-variate to the model:
multiple_regression_w_covariate <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention, scale = FALSE), data = NeuroCog)
summary(multiple_regression_w_covariate)
##
## Call:
## lm(formula = SocialCog ~ Dx + scale(Verbal, scale = FALSE) +
## scale(Attention, scale = FALSE), data = NeuroCog)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.8208 -7.3469 -0.2813 8.1393 29.0185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.81316 0.96044 47.700 < 0.0000000000000002 ***
## DxSchizophrenia -6.84981 1.87166 -3.660 0.000311 ***
## DxSchizoaffective -1.47423 2.10157 -0.701 0.483686
## scale(Verbal, scale = FALSE) 0.12345 0.09052 1.364 0.173900
## scale(Attention, scale = FALSE) 0.30617 0.06607 4.634 0.00000593 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.02 on 237 degrees of freedom
## Multiple R-squared: 0.2632, Adjusted R-squared: 0.2508
## F-statistic: 21.16 on 4 and 237 DF, p-value: 0.000000000000006158
Attention is significant, but Verbal isn’t anymore. This tells us that Verbal wasn’t independently predicting Social Cognition; what we thought was an effect of Verbal was really just an effect of Attention, since the two variables were correlated.
Let’s try another one! For this example, we’ll use a different data set from heplots: TIPI, a set of 1799 people who took a personality test and answered some demographic questions. We’ll use Neuroticism as our dependent variable, and married (Levels: Never Married, Currently Married, Previously Married) as our first predictor.
multiple_regression_w_covariate_2 <- lm(Agreeableness ~ married,
data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(multiple_regression_w_covariate_2)
##
## Call:
## lm(formula = Agreeableness ~ married, data = filter(TIPI, age <
## 100))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8065 -0.7548 -0.1653 0.8347 2.8347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.16534 0.03723 111.882 < 0.0000000000000002 ***
## marriedCurrently married 0.58944 0.08652 6.812 0.0000000000131 ***
## marriedPreviously married 0.64111 0.14827 4.324 0.0000161643021 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.384 on 1786 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.03221, Adjusted R-squared: 0.03112
## F-statistic: 29.72 on 2 and 1786 DF, p-value: 0.0000000000002014
Look at that! It looks like Peole who are currently married or previously married are more agreeable than people who are never married. That makes sense, kinda - agreeable people are more likely to find marriage partners. But wait! Let’s add in age as a co-variate!
multiple_regression_w_covariate_2 <- lm(Agreeableness ~ married + scale(age, scale = FALSE),
data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(multiple_regression_w_covariate_2)
##
## Call:
## lm(formula = Agreeableness ~ married + scale(age, scale = FALSE),
## data = filter(TIPI, age < 100))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6949 -0.9136 -0.0520 0.9480 3.0484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.261699 0.039652 107.478 < 0.0000000000000002 ***
## marriedCurrently married 0.192474 0.104938 1.834 0.0668 .
## marriedPreviously married 0.137727 0.165607 0.832 0.4057
## scale(age, scale = FALSE) 0.020088 0.003077 6.529 0.0000000000858 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.368 on 1785 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.05478, Adjusted R-squared: 0.05319
## F-statistic: 34.48 on 3 and 1785 DF, p-value: < 0.00000000000000022
Now these married variable is no longer significant. It turns out that people who are or were married are just OLDER than never married people on average, and older people tend to be more agreeable. So it’s not the marital status that really predicts agreeableness, it’s the age.
Multiple regression allows us to control statistically for variables we can’t control experimentally. Neat!
Now let’s do another to help us learn about co-variates. Run the following code:
data(ArcheryData)
?ArcheryData
ArcheryData
## Attendance Average Sex Day1 LastDay Improvement Improve
## 1 21 229.619 m 195 226 31 1
## 2 20 188.900 m 140 251 111 1
## 3 20 91.650 f 51 142 91 1
## 4 22 188.227 m 133 255 122 1
## 5 21 162.048 m 151 182 31 1
## 6 21 169.952 m 201 253 52 1
## 7 21 95.381 m 42 184 142 1
## 8 22 116.500 f 62 85 23 1
## 9 19 188.368 m 174 191 17 1
## 10 22 122.727 f 56 168 112 1
## 11 22 123.818 f 86 173 87 1
## 12 20 97.500 f 86 131 45 1
## 13 22 95.500 f 109 99 -10 0
## 14 21 165.143 m 141 183 42 1
## 15 21 193.762 m 174 173 -1 0
## 16 20 98.400 f 51 49 -2 0
## 17 16 158.625 m 164 232 68 1
## 18 22 66.818 f 16 78 62 1
Just like with ANOVAs, we can look for interactions in regression.
For this example, we’ll use athe TIPI data set again. We’ll use Agreeableness as our dependent variable, and gender and age as our predictors, to see if women are more agreeable than men, and if people become more or less agreeable as they age (“Get off my lawn!”). We’ll also add in the interaction with the asterisk (*).
regression_with_interaction <- lm(Agreeableness ~ gender * scale(age, scale = FALSE),
data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(regression_with_interaction)
We set up the formula in the same way as ANOVA. The interpretation of interactions in regression a bit more complicated. Remember that ANOVA was an omnibus test - it just told us if something somewhere was significant, and we had to do follow-up tests to figure out what. In regression, we skip straight to the follow-up tests. This is both good and bad. It’s good because there are no extra steps. It’s bad because we have to learn to read and interpret the regression results directly.
Something to keep in mind:
Hopefully this will make more sense as we go through some examples.
Let’s start with the interaction of two continuous variables. We’ll use the cars data for that:
continuous_interaction <- lm(mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), data = mtcars)
summary(continuous_interaction)
##
## Call:
## lm(formula = mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE),
## data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5153 -1.6315 -0.6346 0.9038 5.7030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.20207699 0.73949796 24.614 < 0.0000000000000002 ***
## scale(disp, scale = FALSE) -0.03082639 0.00637635 -4.834 0.0000436 ***
## scale(hp, scale = FALSE) -0.03097348 0.01166910 -2.654 0.01296 *
## scale(disp, scale = FALSE):scale(hp, scale = FALSE) 0.00029005 0.00008694 3.336 0.00241 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.692 on 28 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8005
## F-statistic: 42.48 on 3 and 28 DF, p-value: 0.0000000001499
Notice how I centered these variables: What’s a 0 horsepower car, anyway?
Let’s walk through these results. Our dependent variable is Miles Per Gallon (mpg), and our two predictors are Displacement (disp; the total volume of all cylinders in the engine in cubic inches) and Horsepower (hp). Let’s discuss each finding:
This interaction term also applies to the effect of hp: the slope of horsepower is affected by displacement, so that for each unit increase in displacement, the slope of horsepower becomes more positive. Overall, this interaction term indicates that as one value (displacement OR horsepower) gets big, the influence of the other becomes weaker.
Looking for interactions between categorical and continuous variables is - perhaps - the most intuitive type of interaction in regression. What we’re asking is: Does the slope of our continuous variable differ for the different levels of our categorical variable.
Now let’s go back to our Personality data. By modeling the interaction, we’re asking: Is the effect of age on Extraversion different for males and females?
categorical_continuous_interaction <- lm(Agreeableness ~ gender * scale(age, scale = FALSE), data = filter(TIPI, age < 100))
summary(categorical_continuous_interaction)
##
## Call:
## lm(formula = Agreeableness ~ gender * scale(age, scale = FALSE),
## data = filter(TIPI, age < 100))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3303 -0.8795 -0.0146 0.8922 3.2363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.061613 0.043707 92.927 < 0.0000000000000002 ***
## genderF 0.535813 0.063782 8.401 < 0.0000000000000002 ***
## scale(age, scale = FALSE) 0.019297 0.003159 6.109 0.00000000123 ***
## genderF:scale(age, scale = FALSE) 0.014691 0.004641 3.165 0.00158 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.34 on 1794 degrees of freedom
## Multiple R-squared: 0.09443, Adjusted R-squared: 0.09291
## F-statistic: 62.36 on 3 and 1794 DF, p-value: < 0.00000000000000022
#summary(filter(TIPI, age < 100))
Let’s walk through these results. Our dependent variable is Agreeableness, and our two predictors are Gender (Male vs Female) and age (13-75), which I have centered. Let’s discuss each finding:
See the figure below for a visualization of these results. Notice that:
ggplot(data = filter(TIPI, age < 100), aes(x = age, y = Agreeableness)) +
geom_jitter(aes(color = gender, shape = gender, fill = gender), size = 3) +
geom_smooth(method = "lm", formula = y ~ x, aes(color = gender, lty = gender), fill = alpha("gray98", 1), se = TRUE, size = 3) + theme_bw() + theme(axis.text = element_text(angle = 90)) + labs(
color = "Gender", shape = "Gender",
lty = "Gender", fill = "Gender",
x = "Age (Years)",
y = "Agreeableness"
) + scale_color_manual(values = c(alpha("blue", 1), alpha("pink2", 1))) + scale_shape_manual(values = c(22, 21)) + scale_fill_manual(values = c(alpha("blue", 0.1), alpha("pink2", 0.1)))
When we include the interaction of two categorical variables, we’re asking: is the effect of one variable different for the different levels of the other variable?
In the regression model below, we’re asking two questions:
Also note that I’ve left age in as a co-variate, since I want to be sure that any effect of the married variable is NOT really due to age.
categorical_interaction <- lm(Neuroticism ~ gender * married + age, data = filter(TIPI, age < 100))
summary(categorical_interaction)
##
## Call:
## lm(formula = Neuroticism ~ gender * married + age, data = filter(TIPI,
## age < 100))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.764 -1.363 0.067 1.283 4.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.308433 0.115090 37.435 < 0.0000000000000002 ***
## genderF 0.760614 0.091370 8.325 < 0.0000000000000002 ***
## marriedCurrently married -0.005425 0.155992 -0.035 0.97226
## marriedPreviously married 0.513589 0.284352 1.806 0.07106 .
## age -0.023462 0.003824 -6.136 0.00000000104 ***
## genderF:marriedCurrently married -0.299254 0.213284 -1.403 0.16077
## genderF:marriedPreviously married -1.179285 0.364415 -3.236 0.00123 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.68 on 1782 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.09217, Adjusted R-squared: 0.08911
## F-statistic: 30.15 on 6 and 1782 DF, p-value: < 0.00000000000000022
Let’s walk through these results. Our dependent variable is Neuroticism, and our two predictors are gender (Male, Female) and married (Never Married, Currently married, Previously married). Let’s discuss each finding:
See the figure below for a visualization of these effects. Notice how the difference between bars is opposite in the ‘Previously Married’ condition. It looks like divorced males revert back to Never Married levels of neuroticism, while divorced females become less neurotic than any other group of females. Weird.
Now let’s do a regression that involves interactions. Run the following code:
data(FaithfulFaces)
?FaithfulFaces
head(FaithfulFaces)
## SexDimorph Attract Cheater Trust Faithful FaceSex RaterSex
## 1 4.900 2.000 1 4.118 6.294 F M
## 2 2.583 2.083 1 4.000 6.588 F M
## 3 4.300 2.667 1 3.941 5.824 F M
## 4 4.083 2.917 0 4.118 5.412 F M
## 5 4.500 3.500 0 3.235 4.000 F M
## 6 2.000 1.833 0 4.647 6.647 F M
Regression works great on between-subjects data. Remember that between-subjects data means that we only have one observation per participant.
If you only have categorical variables, just do an ANOVA. But regression is good if you have:
You should NOT use regression when you have within-subjects data. Recall that in within-subjects data, each subject provides more than one data point, and our statistical model needs to account for the inter-relatedness of these data points: data points that come from the same source are not independent. Regression has no way to account for this, so it is a bad tool for within-subjects data. So, when you have within-subjects AKA repeated-measures AKA nested data, don’t use regression. Use mixed-effects modeling for this instead. What’s mixed-effects modeling? Stay tuned!
Also, it’s a bit weird to use a regression when you only have categorical predictors. You COULD do regression, it’s not wrong, but then ANOVA has nothing to do.
For this task, we will be using data from this paper:
Soto, C. J. (2021). Do links between personality and life outcomes generalize? Testing the robustness of trait–outcome associations across gender, age, ethnicity, and analytic approaches. Social Psychological and Personality Science, 12(1), 118-130.
The Open Science Framework Page for this data can be found here:
Using the paper linked above as a reference, answer the following questions:
Note. Download the file, NOT the metadata.
Test the following research questions:
The codebook for this study is at https://osf.io/auxzt
To do this, do the following:
Write a brief APA-style paragraph describing the results.