Go To Home Page

Intro to Regression

We learned previously that there are 4 things we need to know about any statistical test:

  • What variables can the test handle?
  • What statistic does the test generate?
  • What distribution does the test use?
  • What arguments does the R function require?

Let’s talk about these 4 needs in relation to our next test: regression.

Variables

Dependent Variables

Like all the tests we’ve seen so far, a regression usually takes a single numeric dependent variable.

BUT there are exceptions: multivariate regression can have multiple dependent variables. I don’t see this much in my discipline, but perhaps you will in yours. Once you know how to do a regular regression, multivariate regression isn’t too hard.

Independent/Predictor Variables

Most of the time, a regression is done on non-experimental data. For this reason, we usually use the word ‘predictor’ to describe the predictive variables in our regression models, rather than ‘independent variable’. This isn’t an absolute rule, but rather a strong trend.

What Kind?

Regressions can take both categorical variables (i.e. factors) and continuous variables (i.e. numbers) as predictors. For categorical predictors, there is no limit on the number of levels (except the limits that you should impose for your own sanity).

How Many?

As many as you want/need. Just be cautious - more variables = more potential problems AND more interpretation headaches for you.

  1. A regression needs a dependent variable. What sort of variable (number, factor, string) should this dependent variable be?
  2. How many dependent variables can a regression take?
  3. A regression needs predictor/independent variable(s). What sort of variable (number, factor, string) should these be?
  4. How many variables can a regression take as predictor/independent variable(s)?
  5. How many levels can a given predictor/independent variable have in regression?
  6. Describe a hypothetical study where a regression would be the appropriate test to use and an ANOVA would not be.
  7. What is the main weakness of a regression? Why?
    • Hint: Between- or within-subjects data?
  8. Define the following terms in your own words:
    • Correlational Study/Non-Experimental Data
    • Experiment
  9. If you see that a paper has used a regression, is the data most probably from an experiment or a correlational study?

Statistic

Regression uses the t statistic. Since we already learned about this when we talked about t-tests, you have a conceptual head start here.

BUT the t statistic is calculated differently in regression than it is in the t-test. There are 4 steps:

  1. Compute the regression coefficient.
  2. Compute the standard error of the regression coefficient.
  3. Divide the regression coefficient by its standard error.
  4. Figure out the degrees of freedom.
  1. WITHOUT LOOKING, see if you can write down the 4 steps to compute the t statistic in a linear regression. Points for a good effort.

1. Compute the regression coefficient

Regression is often called linear regression. The reason for this is that we are trying to find a line that best fits our data.

Think of it this way:

Each variable is one dimension. If we want to summarize a single variable (one-dimensional data), such as an exam score, we can do so with a single point - the mean. If we want to summarize the relationship between two variables (two-dimensional data), we can’t use a point, we have to move up a dimension and use a line. And this is what regression does - it tests the strength of a relationship between two variables.

The mean is a number that minimizes the distance to each data point; if we were to subtract all of the data points from the mean, then add them up, we end up with 0:

Number Mean Difference
2 5.1 -3.1
4 5.1 -1.1
6 5.1 0.9
7 5.1 1.9
7 5.1 1.9
5 5.1 -0.1
3 5.1 -2.1
1 5.1 -4.1
7 5.1 1.9
9 5.1 3.9
Sum of Differences 0

Another way to think about the mean is as the point that minimizes the summed squared distances to the data points. In other words:

xdata <- data.frame()

for (i in (-20:20)/10) {
  y <- c(i, sum((x - (mean(x)+i))^2))
  xdata <- rbind(xdata, y)
}
colnames(xdata) <- c("MeanDifference", "SumofSquares")

ggplot(data <- xdata, aes(x = MeanDifference, y = SumofSquares)) +
  geom_line() + 
  geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) +
  theme_bw() +
  labs(
    x = "Difference Between Reference Value and Mean Value",
    y = "Sum of Squared Deviations from the Reference Value"
  )

So that’s how the mean works. As mentioned, regression operates on the same principle, but in 2 dimensions instead of one. We want to find the line that is as close as possible to the data points, like so:

mod <- lm(price ~ carat, data = filter(diamonds, color == "E" & cut == "Fair"))
#diamonds <- transform(price, Fitted = fitted(mod))

ggplot(data = filter(diamonds, color == "E" & cut == "Fair"), aes(x= carat, y = price)) +  
  geom_segment(aes(x = carat, y = price, xend = carat, yend = fitted(mod)), color = "gray50") +
  geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "black") +
  geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) + 
  labs(
    x = "Carats (weight of the diamond)",
    y = "Price (in $)"
  ) + theme_bw()

This line represents our estimate of the relationship between the predictor (carats) and the dependent variable (price).

The blue circles represent the actual data points.

The vertical gray lines represent the ‘residuals’, the distance of the data points from the line. These residuals represent the unexplained variance left in the data after we apply our line. Our goal is to find the line that makes these residuals (or rather, the sum of the squares of these residuals) as small as possible.

Try it yourself! (Optional fun)

To see regression in action, copy this code chunk and run it! Notice how the residuals get bigger as the line moves away from the ‘best fit’ slope (7756.426).

if (!require("shiny")) install.packages("shiny")
library(shiny)

filtered_diamonds <- filter(diamonds, color == "E" & cut == "Fair")

filtered_diamonds %>%
  summarize(slope = 
              sum((carat - mean(carat)) * (price - mean(price))) / 
              sum((carat - mean(carat))^2),
            intercept = 
              mean(price) - slope * mean(carat)
            ) -> diamond_summary

shinyApp(
  ui = fluidPage(
    fluidRow(
      column(12, wellPanel(
      sliderInput("slope", label = h3("Slope"), min = -150000, 
          max = 150000, value = 7348.326),
      ))),
      fluidRow(
      column(12, offset = 0,
        plotOutput("plot")
      )
    )),
  server = function(input, output) {
      output$plot = renderPlot({
        slope <- input$slope
        intercept <- mean(filtered_diamonds$price) - slope * mean(filtered_diamonds$carat)  
        yend <-  intercept + slope * (filtered_diamonds$carat)
        SS <- sum((filtered_diamonds$price - (intercept + slope * filtered_diamonds$carat))^2)
        ggplot(data = filtered_diamonds, aes(x= carat, y = price)) +  
          geom_segment(aes(x = carat, y = price, xend = carat, yend = yend), color = "gray50") +
          geom_abline(slope=slope, intercept=intercept, color = "black") +
          geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "red") +
          geom_point(size = 3, alpha = 1, color = "blue4", fill = "cornflowerblue", shape = 21) + 
          labs(
            x = "Carats (weight of the diamond)",
            y = "Price (in $)"
          ) + theme_bw() + 
          theme(text = element_text(size = 30)) +
          coord_cartesian(ylim = c(0, 16000)) + 
          annotate(geom = "text", x = 0.5, y = 15000, label = paste("SS Residual:", formatC(SS, format = "e", digits = 2)), size = 15)
      }) 
},
options = list(height = 900)
)

But how do we get this line?

Remember, a line has 2 properties:

  • An intercept (where it crosses the y axis, i.e. 0)
  • A slope (how steep it is, i.e. rise over run)

Computing the slope

If you are interested, you can compute the slope as follows (using the diamonds data set):

diamonds %>%
  summarize(slope = 
              sum((carat - mean(carat)) * (price - mean(price))) / 
              sum((carat - mean(carat))^2))

It’s a little more complicated than calculating the mean, but it’s the same idea, just in 2 dimensions.

Computing the Intercept

Once we have the slope, we just need to calculate where the line would cross the Y axis (i.e. what Y will be when X is 0):

diamonds %>%
  summarize(slope = # Compute the slope - we need it to compute the intercept
              sum((carat - mean(carat)) * (price - mean(price))) / 
              sum((carat - mean(carat))^2),
            intercept = # Compute the intercept
              mean(price) - slope * mean(carat))

What does this mean?

This gives us the following numbers:

slope intercept
7756.426 -2256.361

Some facts about the slope:

  • The slope represents how much Y changes for each unit change in X. In our diamonds example, this means that the price of a diamond increases by $7,756 for each increase of 1 carat in weight.
  • This slope is commonly called the regression coefficient. You may see it referred to as b, the estimate, or (if it has been standardized) as beta or \(\beta\)

Some facts about the intercept:

  • The intercept represents what Y would be if X were 0. In our diamonds example, this means that a 0 carat diamond would be worth -$2,256 dollars. If you try to sell a weightless diamond, you’ll end up owing someone money.
  1. Why is regression often called ‘linear regression’? What role do lines play in all this?
  2. Geometry question. Define the following terms:
    • Slope
    • Intercept
  3. List 3 other ‘names’ often given to the slope of a line in linear regression.
  4. Complete the following sentences:
    • A bigger ______ means a ______ relationship between the predictor and the dependent variable.
    • A smaller ______ (i.e. closer to 0) indicates ______ relationship between the predictor and the dependent variable.
    • Differences in the ______ are usually not of interest to us in regression.

2. Calculate the standard error

The standard error of the regression coefficient is computed by:

  1. Calculating the uncertainty (standard error) of our predictions. In other words, how far were our estimated Y values from the actual Y values?
  2. Dividing this by the standard deviation of our predictor.

So, the less accurate our predictions are, the larger the standard error will be. Not really a surprise.

Also, the larger the standard deviation of our predictor variable is, the smaller our standard error will be. In other words, more variability in our predictor variable is a good thing - it increases our confidence that we’ve found the best-fitting line.

diamonds %>%
  summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2), 
            intercept = mean(price) - slope * mean(carat), 
            se_slope = sqrt(
              sum(
                (price - (intercept + slope * carat))^2
                )/
                (n() - 2)) / sqrt(sum((carat - mean(carat))^2)))
## # A tibble: 1 × 3
##   slope intercept se_slope
##   <dbl>     <dbl>    <dbl>
## 1 7756.    -2256.     14.1
  1. Complete the following sentences:
    • The farther, on average, the dependent variable values are from our line, the ______ the standard error will be. This will make it ______ to achieve statistical significance.
    • The more variability in our predictor variable, the ______ the standard error will be. This will make it ______ to achieve statistical significance.
    • The bigger our standard error is, the ______ accurate our predictions are. In other words, bigger standard error means our line is a ______ fit to the data.

3. Compute the t statistic by dividing the regression coefficient by its standard error.

Now we compute the t statistic.

Slope / Standard Error = t

diamonds %>%
  summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2), 
            intercept = mean(price) - slope * mean(carat), 
            se_slope = sqrt(
              sum(
                (price - (intercept + slope * carat))^2
                )/
                (n() - 2)) / sqrt(sum((carat - mean(carat))^2)),
            t = slope / se_slope)
## # A tibble: 1 × 4
##   slope intercept se_slope     t
##   <dbl>     <dbl>    <dbl> <dbl>
## 1 7756.    -2256.     14.1  551.

That’s a pretty big t value!

Let’s consider what influences our t score:

  • A larger slope means a larger t score
    • Our slope is larger if there is a bigger change in Y for each change in X.
  • A smaller standard error means a larger t score
    • The more confident we are in our slope estimate, the larger the t
  1. What two numbers do we divide to get our t statistic?
  2. Complete the following sentences:
    • The larger the slope (regression coefficient), the ______ the t statistic.
    • The larger the standard error, the ______ the t statistic.

4. Degrees of Freedom

The Degrees of Freedom for the regression model =

  • Number of Observations - Number of Regression Coefficients - 1

Since we only have one regression coefficient (one slope), this is equal to:

diamonds %>%
  summarize(slope = sum((carat - mean(carat)) * (price - mean(price))) / sum((carat - mean(carat))^2), 
            intercept = mean(price) - slope * mean(carat), 
            se_slope = sqrt(
              sum(
                (price - (intercept + slope * carat))^2
                )/
                (n() - 2)) / sqrt(sum((carat - mean(carat))^2)),
            t = slope / se_slope, 
            df = n() - 1 - 1)
## # A tibble: 1 × 5
##   slope intercept se_slope     t    df
##   <dbl>     <dbl>    <dbl> <dbl> <dbl>
## 1 7756.    -2256.     14.1  551. 53938

Imagine that I have the following regression model:

fakemodel <- lm(weight ~ height + shoesize, data = fakeweightdata)

I have data from 120 people.

  1. What are the degrees of freedom for my regression model?

Distribution

Since we have a t statistic, we will be using the t distribution here. Everything you learned about this distribution for the t test applies here.

Arguments

So how do we set up our regression? You’ll be happy to learn that it is very similar to an ANOVA. But instead of aov(), we use lm(). lm stands for limear, I guess. (Just kidding. It stands for linear model)

simple_regression <- lm(price ~ carat, data = diamonds)
summary(simple_regression)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8 <0.0000000000000002 ***
## carat        7756.43      14.07   551.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 0.00000000000000022

Interpreting Regression Output

Scroll up a bit and look at the slope, intercept, etc. that we computed. Then look at the R regression output. Did we do our math right? What does R call the slope, intercept, standard error, etc.?

The most important part of this output is the Coefficients: section.

  • The first column tells us what model parameter is being tested.
    • The first row is the model intercept
    • The second row is the carat predictor
  • The second column, Estimate, gives us the model parameter
    • In the Intercept row, this is the intercept
    • In every other row, this will be a slope
  • 3rd column: the standard errors.
  • 4th column: the t values
  • 5th column: the p values

Down below the coefficients is some further information.

  • 1st row: The model standard error and model degrees of freedom
  • 2nd row: different measures of R-squared. This is a measure of the goodness of fit of the model. We’ll return to this when we discuss reporting the model. Just know that we like the second number best.
  • 3rd row: an ANOVA on the model as a whole. If this is significant, we can conclude that the model fit the data better than an empty model with no predictors would have. How exciting!

In order to do the regressions that follow, you will need to install and load the Stats2Data package. Add this code you your setup chunk.

if (!require("Stat2Data")) install.packages("Stat2Data")
library(Stat2Data) # Add these lines to your setup chunk!

Consider the following regression output as you answer the next few questions:

## 
## Call:
## lm(formula = sales ~ budget, data = marketing %>% mutate(budget = youtube + 
##     facebook + newspaper))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6655 -1.5685  0.1408  1.9153  8.6274 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 5.091634   0.526230   9.676 <0.0000000000000002 ***
## budget      0.048688   0.001982  24.564 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.12 on 198 degrees of freedom
## Multiple R-squared:  0.7529, Adjusted R-squared:  0.7517 
## F-statistic: 603.4 on 1 and 198 DF,  p-value: < 0.00000000000000022
  1. How many observations were there in this data set?
  2. How well does this model fit the data? How do you know this?
  3. If no money was spent on advertising, how many sales would you expect?
  4. How much did the budget have to increase for each additional sale? Is this a good return on investment?

Centering: ‘Releveling’ continuous variables

When we include a continuous variable in our model, the intercept is the point where that continuous variable becomes 0. Sometimes this makes sense, because 0 is a real part of that variable:

  • A 0 for Volume means the sound was off.
  • A 0 for departure delay means the flight took off on time.
  • A 0 for Number of Pets means the person has no pets.

In these cases, the 0 is actually interpretable, and there are likely 0 values as part of the data.

But in many cases, there isn’t a single 0 in our predictor, and the 0 value is meaningless. Consider our diamond example, above. What is a 0 carat diamond? It’s nothing - there’s no such thing. Try proposing with a 0 carat diamond.

So what should we do in such cases? We should center our predictor. We subtract all values of that predictor from the mean for that predictor. This shifts the values so that the mean becomes 0, data points below the mean become negative, and data points above the mean are positive.

What does centering do?

  • It does NOT change the slope for that predictor: we’ve moved the line left or right (x axis), but have not adjusted the Y at all. So the rise over the run is the same.
  • It DOES change the intercept. Now, the intercept is the average value for that predictor.

How do you center a variable? There are two ways:

  • Do the math yourself, creating a new variable: data %>% mutate(newvariable = variable - mean(variable))
  • Use the scale() function in your model: lm(dv ~ scale(variable, scale = FALSE), data = data)
    • We set scale = FALSE so that the function only centers the data. If scale = TRUE, the function also transforms the variable to z scores, If you want this to happen, go ahead! It will make different coefficients comparable, but will take them out of their native units.

Let’s redo our diamond regression, this time centering the variable.

simple_regression_centered <- lm(price ~ scale(carat, scale = FALSE), data = diamonds)
summary(simple_regression)$coefficients
##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept) -2256.361   13.05535 -172.8304        0
## carat        7756.426   14.06658  551.4081        0
summary(simple_regression_centered)$coefficients
##                             Estimate Std. Error  t value Pr(>|t|)
## (Intercept)                 3932.800   6.667655 589.8325        0
## scale(carat, scale = FALSE) 7756.426  14.066579 551.4081        0

If we compare the original model to the new one, we see that the slope etc. for carat is the same before and after centering.

BUT the intercept is different.

In the original model, the intercept said: The value of a 0 carat diamond is -2256.

In the centered model, the intercept says: The value of an average diamond is 3932.8. That’s a more useful number.

So, when your predictor doesn’t have a meaningful 0 value, center it!

Consider the following regression model. The variables are as follows:

  • Age: The age in months when the child forst started to talk.
  • Gesell: A score on the Gesell aptitide test, a measure of intelligence
data(ChildSpeaks)
Talk.lm <- lm(Gesell ~ scale(Age, scale = FALSE), data = ChildSpeaks)
summary(Talk.lm)
## 
## Call:
## lm(formula = Gesell ~ scale(Age, scale = FALSE), data = ChildSpeaks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.604  -8.731   1.396   4.523  30.285 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                93.6667     2.4054  38.940 < 0.0000000000000002 ***
## scale(Age, scale = FALSE)  -1.1270     0.3102  -3.633              0.00177 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.02 on 19 degrees of freedom
## Multiple R-squared:   0.41,  Adjusted R-squared:  0.3789 
## F-statistic:  13.2 on 1 and 19 DF,  p-value: 0.001769
  1. What is the relationship between age of first speaking and intelligence in this sample? Select the option that describes these findings:
  • As age of first speaking (increased/decreases) by 1 month, intelligence (increases/decreases) by 1 point.
  1. What does the intercept represent in this model?
  • Hint: Remember that Age is centered.

What about categorical variables (i.e. factors)?

But how do we do a linear regression when our predictor is categorical? In other words, how do we find a line that fits a factor with levels?

Consider this simple example, the SocialCog data set from the heplots package. This data set contains data from 3 groups: A control group, a group diagnised with schizophrenia, and a group diagnosed with schizoaffective disorder. Each group took some tests to assess their ability to manage emotions, among other social abilities.

If we wanted to include this categorical Diagnosis variable in a regression, how would we do it? In other words, how do we draw a line when the predictor is Diagnosis (Levels: Control, Schizphrenia, Schizoaffective)? The different levels aren’t numbers, there is no obvious intercept, and what is the distance between control, schizphrenia, and schizoaffective disorder anyway?

For categorical predictors, R picks one of the levels to be the intercept (AKA the baseline or comparison group). By default, this goes alphabetically, but it can be changed. Then a separate ‘slope’ is computed from this ‘intercept’ to each of the other levels, as shown above. The slope represents the difference between the means of each group.

Dx means difference
Control 45.05 0.00
Schizophrenia 35.84 -9.21
Schizoaffective 44.53 -0.52

Compare the data in the table above to the regression coefficients below.

  • The intercept is the mean for the “Control” level of the Dx (Diagnosis) variable
  • the slope for DxSchizophrenia is the difference between the mean of “Control” and the mean of “Schizophrenia”.
  • A separate slope is computed for each level of the Dx variable. In each case, we are comparing that level to the intercept: “Control”.
##                   Estimate Std. Error t value Pr(>|t|)
## (Intercept)          45.05       1.55   29.03     0.00
## DxSchizophrenia      -9.21       2.47   -3.73     0.00
## DxSchizoaffective    -0.51       2.78   -0.18     0.85

If you only have categorical predictors, like we have here, it is unusual to do a regression. Consider doing an ANOVA instead.

Releveling factors

If you want to change the ‘baseline’ level of a factor, you can! Here is one way:

SocialCog <- SocialCog %>% 
  mutate(Dx = factor(Dx), Dx = relevel(Dx, "Control"))

In the code above, we turned Diagnosis into a factor and then used the relevel function to change the baseline to “Control”.

You can use the levels() function to see what the levels of a variable are, like this:

levels(SocialCog$Dx)
## [1] "Control"         "Schizophrenia"   "Schizoaffective"

The first one listed is the ‘baseline’ condition.

Let’s talk about teen pregnancy. More specifically, let’s find out how the civil war affected teen pregnancy rates in 2010. Why? I cannot think of a single good reason, except that it will illustrate how to do a regression.

![Don’t dishonor their sacrifice with teen pregnancy!] (https://en.wikipedia.org/wiki/American_Civil_War#/media/File:Battle_of_Gettysburg,_by_Currier_and_Ives.png)

Our variables are:

  • CivilWar: 4 groups of states:
    • Border States: KY, MO, OK
    • Confederate States: AL, AR, FL, GA, LA, MS, NC, SC, TN, TX, VA
    • Union States: CA, CT, DE, IA, IL, IN, KS, MA, MD, ME, MI, MN, NH, NJ, NY, OH, OR, PA, RI, VT, WI
    • Other States: AK, AZ, CO, HI, ID, MT, ND, NE, NM, NV, SD, UT, WA, WV, WY
  • Teen: Number of teen pregnancies per 1000 girls in the state
data(TeenPregnancy)
TeenPregnancy <-  TeenPregnancy %>%  mutate(CivilWar = as.factor(CivilWar), CivilWar = case_when(
                              CivilWar == "B" ~ "Border States",
                              CivilWar == "C" ~ "Confederate States",
                              CivilWar == "U" ~ "Union States",
                              CivilWar == "O" ~ "Other States"))
Teen.lm <- lm(Teen ~ CivilWar, data = TeenPregnancy)
summary(Teen.lm)
## 
## Call:
## lm(formula = Teen ~ CivilWar, data = TeenPregnancy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2381  -6.1952  -0.1515   8.1061  24.9333 
## 
## Coefficients:
##                            Estimate Std. Error t value           Pr(>|t|)    
## (Intercept)                  61.667      5.745  10.735 0.0000000000000407 ***
## CivilWarConfederate States    2.970      6.481   0.458             0.6489    
## CivilWarOther States         -6.600      6.293  -1.049             0.2997    
## CivilWarUnion States        -13.429      6.141  -2.187             0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.95 on 46 degrees of freedom
## Multiple R-squared:  0.3167, Adjusted R-squared:  0.2721 
## F-statistic: 7.106 on 3 and 46 DF,  p-value: 0.0005073
  1. Which of the four groups of states is the ‘baseline’ group to which all others are being compared? How do you know this?
  2. What is the average teen pregnancy rate for this baseline group? What number tells you this?
  3. What is the average teen pregnancy rate for the union states? How can you calculate this?
  4. Copy the code above, but expand the mutate to relevel CivilWar so that Union States is the new baseline.

Contrasts (Optional!)

This section has valuable but somewhat advanced information. Read it if you choose - you may need to understand it later. For know, just know that we’ve been talking about dummy coding of categorical variables, the default in R.

When we talk about Contrasts in regression, we are referring to the way we set up the comparisons between the groups in a categorical variable. The default way of dealing with categorical variables (picking a baseline and comparing the other levels to it) is called dummy coding. But it’s not the only way to code for categorical variables in R. We can also do

  • Deviation coding
  • Helmert coding

And many others.

Changing the contrasts changes the intercept as well.

Dummy Coding

Dummy coding is also sometimes called treatment coding, because its the contrast we would use if we have a clear control and treatment group.

As we discussed, in dummy coding the intercept is the baseline level/group that we select. The slopes represent the difference of the other levels from this one.

If we use the contrasts() function, we can see how dummy coding is set up.

contrasts(SocialCog$Dx)
##                 Schizophrenia Schizoaffective
## Control                     0               0
## Schizophrenia               1               0
## Schizoaffective             0               1

Some things to note:

  • The ‘Control’ row has no 1 in it; this is our baseline.
  • Each of the columns has a 1 in only one position. This represents the level that is being compared to the baseline in this contrast. So, the ‘Schizphrenia’ column is comparing ‘Control’ to ‘Schizophrenia’.

We can create this contrast matrix ourselves using contr.treatment()

z <- contr.treatment(3) 
colnames(z) <- c("Schizophrenia", "Schizoaffective")
rownames(z) <- c("Control", "Schizophrenia", "Schizoaffective")
z
##                 Schizophrenia Schizoaffective
## Control                     0               0
## Schizophrenia               1               0
## Schizoaffective             0               1

Deviation Coding

In deviation coding, we care comparing each level to the grand mean (the average of all the levels). So, our intercept is NOT some baseline group, it’s the average of all the groups. With deviation coding, we’re asking “does this group deviate significantly from the average”?

When would we want to use deviation coding? When we don’t have a clear control/comparison group. In our SocialCog example, if there was no ‘Control’ level, then we wouldn’t have any clear rationale for choosing a ‘baseline’. So, instead of using dummy coding, which prioritizes one level of our variable over the other(s), we could just say “was one of these conditions a lot different from the other(s)?” That’s the question that deviation coding answers.

In deviation coding, the intercept is the grand mean. The slopes represent the difference of the levels from this grand mean.

We can change our contrasts to Deviation coding using contr.sum(), like this:

contrasts(SocialCog$Dx) <- contr.sum(3)
colnames(contrasts(SocialCog$Dx)) <- c("Control", "Schizophrenia")
contrasts(SocialCog$Dx)
##                 Control Schizophrenia
## Control               1             0
## Schizophrenia         0             1
## Schizoaffective      -1            -1

Notice the row of -1 next to Schizoaffective. Since we can only do a number of contrasts equal to the number of levels -1, we are leaving this group out. Deviation coding will always leave out the last level; if you want to change this, reorder the levels.

## # A tibble: 3 × 4
##   Dx              means grand_mean deviation
##   <fct>           <dbl>      <dbl>     <dbl>
## 1 Control          45.0       41.8      3.24
## 2 Schizophrenia    35.8       41.8     -5.97
## 3 Schizoaffective  44.5       41.8      2.72

In the table above, you can see the means of each group, the grand mean, and the deviation of each group from the grand mean. Now look at the regression output below. Do you see what Deviation coding is doing?

##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)        41.81       1.13   37.14     0.00
## DxControl           3.24       1.44    2.25     0.03
## DxSchizophrenia    -5.97       1.58   -3.78     0.00

Helmert Coding

In Helmert Coding, we compare each group to the means of all the groups that come before it. Why would we want to do this? Consider a study were we are studying heart health, and have the following groups:

  • Medication
  • Light Cardio
  • Light Cardio + Medication

In this study, Helmert coding would compare Light Cardio to Medication, to see if cardio was better than drugs. Then the second contrast would compare Light Cardio + Medication to the average of Medication and Light Cardio, to see if the combination of the two is better that the average effect of each one separately.

So maybe you’ll encounter a need for this someday. Maybe.

In the SocialCog data, we could compare Schizoaffective to Schizophrenia, to see if these two groups are different from each other. Then we could compare the mean of both groups to the Control group, to see if the two disordered groups are, considered together, different from the control. Or maybe not.

You can set up the contrasts using contr.helmert(). But you might prefer to create your own custom contrasts, which keeps the coefficients on the original scale so the slopes are still interpretable:

contrasts(SocialCog$Dx) <- contr.helmert(3) # Use built-in function

custom.helmert = matrix(c( # Create custom Helmert contrasts
  c(0, -1/2, 1/2), 
  c(2/3, -1/3, -1/3)
  ), ncol = 2)

contrasts(SocialCog$Dx) <- custom.helmert

colnames(contrasts(SocialCog$Dx)) <- c("SchizophreniavsSchizoaffective", "BothSchizovsControl")
contrasts(SocialCog$Dx)
##                 SchizophreniavsSchizoaffective BothSchizovsControl
## Control                                    0.0           0.6666667
## Schizophrenia                             -0.5          -0.3333333
## Schizoaffective                            0.5          -0.3333333

In Helmert coding, like in Deviation coding, the intercept is the grand mean.

The slope in the ‘SchizophrenaivsSchizoaffective’ row compares ‘Schizophrenia’ to ‘Schizoaffective’. The next row compares Control to the mean of Both ‘Schizo-’ groups. Using this contrast here doesn’t really make sense, but I present the results for the purposes of illustration:

##                                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         41.81       1.13   37.14     0.00
## DxSchizophreniavsSchizoaffective     8.70       3.00    2.90     0.00
## DxBothSchizovsControl                4.86       2.16    2.25     0.03

Multiple Regression

In regression, we’re not restricted to just one predictor. We can put in a few of them.

When we do, we connect predictors using ‘+’ or ’just like in ANOVA*.

Let’s try another data set from the heplots package, NeuroCog. This data set is a lot like SocialCog, in that it has 3 groups: Control, Schizophrenia, and Schizoaffective. And it has an overall measure of social cognition. But it also includes multiple cognitive variables, including:

  • Processing Speed
  • Attention
  • Working Memory
  • Verbal learning
  • Visual Learning
  • Problem Solving
  • Age

Let’s set up a regression where we see what factors predict Social Cognition in this sample. We’ll start by adding diagnosis and attention.

multiple_regression <- lm(SocialCog ~ Dx + Attention, data = NeuroCog)
summary(multiple_regression)
## 
## Call:
## lm(formula = SocialCog ~ Dx + Attention, data = NeuroCog)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6259  -7.8433  -0.0711   8.0173  28.7274 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       32.02672    2.62865  12.184 < 0.0000000000000002 ***
## DxSchizophrenia   -7.51657    1.80995  -4.153        0.00004577197 ***
## DxSchizoaffective -2.08454    2.05708  -1.013                0.312    
## Attention          0.35330    0.05641   6.263        0.00000000175 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 238 degrees of freedom
## Multiple R-squared:  0.2574, Adjusted R-squared:  0.248 
## F-statistic:  27.5 on 3 and 238 DF,  p-value: 0.000000000000002639

The Intercept in Multiple Regression

When we put more predictors into our model, we change the equation and so we change the intercept.

  • In a model with a single continuous predictor, the intercept is the point where the predictor = 0.
  • In a model with a single categorical predictor, the intercept is the control/comparison group’s mean.
  • In a model with multiple continuous predictors, the intercept is the point where all predictors are 0.
  • In a model with multiple categorical predictors, the intercept is the mean of the data from all baseline conditions (i.e. if the baseline is “No Music” and “No Lyrics”, the intercept is the mean of the data from the No Music and No Lyrics conditions).
  • In a model with multiple predictors, the intercept is the point where all categorical predictors = 0 AND all categorical predictors = their baseline condition.

Takeaway: As you start adding predictors, the intercept often becomes difficult to interpret.

Coefficients in Multiple Regression

As we add more predictors to our model, the coefficients for the individual predictors often change.

This is because regression is different from ANOVA.

In ANOVA we test the Main Effects of each factor, checking to see if each has a separate and independent influence on the dependent variable.

In regression, by contrast, we are checking to see if each predictor we add has an influence on the dependent variable after accounting for the influence of all other predictors. So, this test is NOT separate. Instead, if a given predictor is significant, it means that this predictor gives us information about the dependent variable that the other predictors don’t. This can be quite useful, as we’ll see below.

Evaluating Your Model

For multiple regressions, once we have put in the predictors we know we want in order to test our hypotheses, it’s time to check our model. We’ll do two things:

  1. Remove predictors that don’t contribute to the model (model fitting)
  2. Check for (and fix) collinearity
  3. Assess model fit

Model Fitting: Removing Predictors that don’t contribute to the model

This process is called model fitting

We want to remove a predictor (or an interaction) from our model if:

  • It doesn’t contribute to the overall predictive power of the model (what we call model fit) AND
  • We don’t have to leave it in for hypothesis/reporting reasons.

How can we tell is a given predictor (or interaction) contributes to model fit? There are 2 ways:

  1. Compare two models that differ in only one parameter
  2. Use the step() function.

Comparing two models using likelihood ratio tests

Let’s suppose we want to assess whether we should leave in the interaction of Music and Volume. We would run two regressions, then compare them using the anova() function. Used in this way, the anova() function performs a likelihood ratio test, comparing the two models to see which leaves less variance unexplained (RSS means residual sum of squares).

NeuroCogRegression <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog) # A model without the predictor 'Attention'
NeuroCogRegression2 <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention, scale = FALSE), data = NeuroCog) # A model with the predictor 'Attention'
anova(NeuroCogRegression, NeuroCogRegression2) # Did the interaction improve the model fit?
## Analysis of Variance Table
## 
## Model 1: SocialCog ~ Dx + scale(Verbal, scale = FALSE)
## Model 2: SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention, 
##     scale = FALSE)
##   Res.Df   RSS Df Sum of Sq      F      Pr(>F)    
## 1    238 31406                                    
## 2    237 28797  1      2609 21.472 0.000005927 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F statistic is large, and the p value is < .001. So we conclude that the predictor Attention is necessary, and we should include it in the model.

Using the step() function

Or we can just do this:

NeuroCogRegressionFull <- lm(SocialCog ~ Dx + 
                            scale(Speed, scale = FALSE) + 
                            scale(Attention, scale = FALSE) +
                            scale(Memory, scale = FALSE) + 
                            scale(Verbal, scale = FALSE) + 
                            scale(Visual, scale = FALSE) + 
                            scale(ProbSolv, scale = FALSE) + 
                            scale(Age, scale = FALSE) + 
                            Sex, 
                          data = NeuroCog) 
# Start with the most complicated model you are willing to accept. Ours includes all possible predictors and co-variates.
NeuroCogRegressionStep <- step(NeuroCogRegressionFull) 
## Start:  AIC=1164.53
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention, 
##     scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal, 
##     scale = FALSE) + scale(Visual, scale = FALSE) + scale(ProbSolv, 
##     scale = FALSE) + scale(Age, scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## - scale(Age, scale = FALSE)        1      1.41 27179 1162.5
## - scale(Visual, scale = FALSE)     1      4.54 27182 1162.6
## - scale(Verbal, scale = FALSE)     1      7.12 27185 1162.6
## - scale(ProbSolv, scale = FALSE)   1    111.36 27289 1163.5
## <none>                                         27178 1164.5
## - scale(Speed, scale = FALSE)      1    263.11 27441 1164.9
## - Sex                              1    621.97 27800 1168.0
## - scale(Memory, scale = FALSE)     1    826.04 28004 1169.8
## - Dx                               2   1570.83 28749 1174.1
## - scale(Attention, scale = FALSE)  1   1360.07 28538 1174.3
## 
## Step:  AIC=1162.55
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention, 
##     scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal, 
##     scale = FALSE) + scale(Visual, scale = FALSE) + scale(ProbSolv, 
##     scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## - scale(Visual, scale = FALSE)     1      5.73 27185 1160.6
## - scale(Verbal, scale = FALSE)     1      7.27 27186 1160.6
## - scale(ProbSolv, scale = FALSE)   1    112.96 27292 1161.5
## <none>                                         27179 1162.5
## - scale(Speed, scale = FALSE)      1    268.89 27448 1162.9
## - Sex                              1    645.00 27824 1166.2
## - scale(Memory, scale = FALSE)     1    828.67 28008 1167.8
## - Dx                               2   1602.30 28782 1172.4
## - scale(Attention, scale = FALSE)  1   1419.62 28599 1172.9
## 
## Step:  AIC=1160.6
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention, 
##     scale = FALSE) + scale(Memory, scale = FALSE) + scale(Verbal, 
##     scale = FALSE) + scale(ProbSolv, scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## - scale(Verbal, scale = FALSE)     1      4.44 27189 1158.6
## - scale(ProbSolv, scale = FALSE)   1    107.54 27292 1159.5
## <none>                                         27185 1160.6
## - scale(Speed, scale = FALSE)      1    278.22 27463 1161.1
## - Sex                              1    639.33 27824 1164.2
## - scale(Memory, scale = FALSE)     1    828.87 28014 1165.9
## - Dx                               2   1604.89 28790 1170.5
## - scale(Attention, scale = FALSE)  1   1415.80 28601 1170.9
## 
## Step:  AIC=1158.64
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention, 
##     scale = FALSE) + scale(Memory, scale = FALSE) + scale(ProbSolv, 
##     scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## - scale(ProbSolv, scale = FALSE)   1    111.17 27300 1157.6
## <none>                                         27189 1158.6
## - scale(Speed, scale = FALSE)      1    273.91 27463 1159.1
## - Sex                              1    683.89 27873 1162.7
## - scale(Memory, scale = FALSE)     1    911.70 28101 1164.6
## - Dx                               2   1666.25 28856 1169.0
## - scale(Attention, scale = FALSE)  1   1529.52 28719 1169.9
## 
## Step:  AIC=1157.63
## SocialCog ~ Dx + scale(Speed, scale = FALSE) + scale(Attention, 
##     scale = FALSE) + scale(Memory, scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## - scale(Speed, scale = FALSE)      1    199.08 27500 1157.4
## <none>                                         27300 1157.6
## - Sex                              1    639.28 27940 1161.2
## - scale(Memory, scale = FALSE)     1   1195.08 28496 1166.0
## - scale(Attention, scale = FALSE)  1   1486.64 28787 1168.5
## - Dx                               2   1795.37 29096 1169.0
## 
## Step:  AIC=1157.38
## SocialCog ~ Dx + scale(Attention, scale = FALSE) + scale(Memory, 
##     scale = FALSE) + Sex
## 
##                                   Df Sum of Sq   RSS    AIC
## <none>                                         27500 1157.4
## - Sex                              1    639.47 28139 1161.0
## - scale(Memory, scale = FALSE)     1    998.35 28498 1164.0
## - scale(Attention, scale = FALSE)  1   1292.67 28792 1166.5
## - Dx                               2   1605.03 29105 1167.1
# This command will remove predictors and do model comparisons until it finds the simplest, best-fitting model
summary(NeuroCogRegressionStep) 
## 
## Call:
## lm(formula = SocialCog ~ Dx + scale(Attention, scale = FALSE) + 
##     scale(Memory, scale = FALSE) + Sex, data = NeuroCog)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -31.4269  -7.3055  -0.2892   7.4942  26.3815 
## 
## Coefficients:
##                                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                     47.90041    1.27267  37.638 < 0.0000000000000002 ***
## DxSchizophrenia                 -6.58970    1.78777  -3.686             0.000283 ***
## DxSchizoaffective               -1.49693    2.02007  -0.741             0.459413    
## scale(Attention, scale = FALSE)  0.23058    0.06923   3.331             0.001005 ** 
## scale(Memory, scale = FALSE)     0.22098    0.07550   2.927             0.003756 ** 
## SexMale                         -3.39423    1.44890  -2.343             0.019980 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.79 on 236 degrees of freedom
## Multiple R-squared:  0.2964, Adjusted R-squared:  0.2815 
## F-statistic: 19.88 on 5 and 236 DF,  p-value: < 0.00000000000000022

The step() function has tested all the predictors and removed those that don’t contribute to the model. The output shows us the final model, as well as the intermediate steps to get there.

Run ?step() to see different arguments this function can take.

Check for (and fix) collinearity

Collinearity is when two or more predictors in the model are strongly correlated with each other. When this happens, including them both can have strange effects on the regression parameters.

Let’s explore this with our flights data. We’ll examine 3 models. Each has a dependent variable of arrival delay. The first includes both flight distance and air time. These two variables are highly correlated (r = 0.9906496).

library(nycflights13)
flights_regression_all <- lm(arr_delay ~ distance + air_time , data = flights) 
flights_regression_distance <- lm(arr_delay ~ distance, data = flights) 
flights_regression_airtime <- lm(arr_delay ~ air_time , data = flights) 

Here are the outputs of the 3 models:

##              Estimate Std. Error     t value Pr(>|t|)
## (Intercept) -1.455936   0.172887   -8.421332        0
## distance    -0.087655   0.000761 -115.145312        0
## air_time     0.665263   0.005980  111.256476        0
##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept) 10.829198   0.135521  79.90781        0
## distance    -0.003752   0.000106 -35.46495        0
##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept)  9.429260   0.147654  63.86058        0
## air_time    -0.016816   0.000832 -20.20746        0

First, let’s examine the effect of distance. The coefficient in the combined model is -0.088, but when we take air_time out, the coefficient changes to -0.0038 - that’s an order of magnitude smaller!

Things are even worse for air_time”: the coefficient in the combined model is 0.67. When we take distance away, this changes - to -0.017! It gets a lot smaller AND changes direction from positive to negative.

This is the danger of collinearity - it can mess up your coefficients so much that you draw completely wrong conclusions. How embarrassed would you be if you tried to report that first model?!

Fortunately, we can check for collinearity, using the vif() function:

if (!require("car")) install.packages("car") 
library(car) # Add these two lines to your setup chunk!

vif(flights_regression_all)
## distance air_time 
## 53.72509 53.72509

The variance inflation factor for both predictors in this model is above 50; we want to see numbers < 5.

How do we fix collinearity?

  • Remove one of the variables
  • Center your variables

In the case of our flights example, the collinearity is so bad that removing either air_time or distance is the best solution.

In other cases, centering your variables will alleviate this problem. Let’s return to our miles per gallon regression. First, we’ll look at the vif without centering:

vif_uncentered <- lm(mpg ~ disp * hp, data = mtcars)
vif(vif_uncentered)
##     disp       hp  disp:hp 
## 13.60363 12.31598 37.82176

That’s >5, much higher than we want. This is because adding an interaction term increases the collinearity of the model; we’re telling the model to test how these two variables are related, so we shouldn’t be surprised that it does.

Let’s see what happens when we center our variables:

vif_centered <- lm(mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), data = mtcars)
vif(vif_centered)
##                          scale(disp, scale = FALSE)                            scale(hp, scale = FALSE) 
##                                            2.672299                                            2.738923 
## scale(disp, scale = FALSE):scale(hp, scale = FALSE) 
##                                            1.084565

And the collinearity is gone!

Collinearity that arises from strongly intercorrelated predictors is best dealt with by removing one of them. Collinearity that arises from the structure of our model (i.e. interactions) is best dealt with by centering variables.

Assess Model Fit

Now we should have our final model, the one we want to report. Before we’re ready to do that, we need to assess our model fit.

There are two measures that are commonly used to assess model fit:

  • Akiake Information Criterion (AIC)
  • R2

AIC

The AIC is a number that represent our prediction error - how much information our model did NOT account for in the data.

So, smaller AIC is better. It means that the model fits the data better.

To get AIC, use the this function: AIC(modelname)

R2

R2 is the opposite of AIC; it measures how much variance our model DID account for.

So, bigger R2 is better. It means that the model fits the data better.

You can think of R2 as a proportion of the variance explained.

To get R2, just look in the model output.

Use adjusted R2; this accounts for the number of predictors in your model and is a better metric for comparing different models.

Describing Regression Results

Now that we have finalized our regression model, its time to report our results. We need to report information for the model as a whole, as well as the individual predictors:

Model as a Whole:

  • What was the dependent variable?
  • What were the predictors?
  • What transformations (such as centering) were applied?
  • What interactions were included?
  • What predictors and/or interactions were discarded during model fitting?
    • How was model fitting accomplished?
  • How well does the model fit the data?
    • Report AIC and/or adjusted R2 - see below.

For each predictor and interaction, we need to report the following:

  • The regression coefficient
  • The standard error of the regression coefficient
  • The t value.
  • the p value.
  • A plain-language description of the interpretation.

Reporting the Diamond Regression

So, let’s look at our diamonds regression one more time:

## 
## Call:
## lm(formula = price ~ scale(carat, scale = FALSE), data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##                             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)                 3932.800      6.668   589.8 <0.0000000000000002 ***
## scale(carat, scale = FALSE) 7756.426     14.067   551.4 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 0.00000000000000022

First, let’s describe the model as a whole:

The regression model included price (in dollars) as the dependent variable and carat as the predictor. Prior to analysis, carat was centered. The adjusted R2 of the model was 0.85.

Next, let’s describe the result of the carat predictor:

The effect of carat was significant (Estimate = 7756, SE = 14.07, t = 551.4, p < .001), indicating that for each increase of 1 carat in diamond weight, the price increased by $7,756 dollars.

Reporting the Miles per Gallon Regression

Now let’s do one that’s a bit more complicated.

summary(vif_centered)
## 
## Call:
## lm(formula = mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), 
##     data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5153 -1.6315 -0.6346  0.9038  5.7030 
## 
## Coefficients:
##                                                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)                                         18.20207699  0.73949796  24.614 < 0.0000000000000002 ***
## scale(disp, scale = FALSE)                          -0.03082639  0.00637635  -4.834            0.0000436 ***
## scale(hp, scale = FALSE)                            -0.03097348  0.01166910  -2.654              0.01296 *  
## scale(disp, scale = FALSE):scale(hp, scale = FALSE)  0.00029005  0.00008694   3.336              0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.692 on 28 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8005 
## F-statistic: 42.48 on 3 and 28 DF,  p-value: 0.0000000001499

First, let’s describe the model as a whole:

The regression model included miles per gallon as the dependent variable. The predictors included the total displacement of the engine (in cubic inches) and the overall horsepower of the engine (in horses), as well as the interaction between them. Prior to analysis, both variables were centered. Likelihood ratio tests confirmed that both predictors and the interaction term contributed to model fit. The adjusted R2 of the model was 0.8, and the AIC was 159.9087233.

Next, let’s describe the result of each predictor:

The effect of displacement was significant (Estimate = -0.03, SE = 0.0064, t = -4.83, p < .001), indicating that for each increase of 1 cubic inch of displacement, the miles per gallon decreased slightly. Similarly, the effect of horsepower (Estimate = -0.03, SE = 0.012, t = -2.65, p = .013) indicated that as horsepower increased, miles per gallon decreased. The interaction of the two predictors was also significant (Estimate = 0.0003, SE = 0.00009, t = 3.34, p = .0024), indicating that as horsepower increased, the effect of displacement on miles per gallon got weaker, and vice versa.

This might be a good place to include a graph illustrating the interaction.

Predicting First-Year GPA

Now let’s do a regression, start to finish. Run the following code:

data(FirstYearGPA)
?FirstYearGPA
head(FirstYearGPA)
##    GPA HSGPA SATV SATM Male   HU   SS FirstGen White CollegeBound
## 1 3.06  3.83  680  770    1  3.0  9.0        1     1            1
## 2 4.15  4.00  740  720    0  9.0  3.0        0     1            1
## 3 3.41  3.70  640  570    0 16.0 13.0        0     0            1
## 4 3.21  3.51  740  700    0 22.0  0.0        0     1            1
## 5 3.48  3.83  610  610    0 30.5  1.5        0     1            1
## 6 2.95  3.25  600  570    0 18.0  3.0        0     1            1
  1. Set up a regression with:
    • Dependent Variable: GPA
    • Predictors: All the other variables! Don’t forget to center continuous predictors, and make categorical predictors into factors.
  2. Now use the step() function to fit the model. What should the final model be?
  3. Show how you would check for collinearity. Is this model OK, or is the collinearity too high?
  4. Using only the output of your final model, answer the following questions:
    • If your High School GPA were 1 point higher than mine, how much higher would we expect your first-year college GPA to be than mine?
    • What appears to be more important for predicting GPA - verbal or math skills?
    • What is the average GPA for a nonwhite student? For a white student?
  5. Give a description of a student who you would expect to have a high first year GPA, based on these results.
  6. Now write up an APA-style paragraph reporting the results of this regression. First describe the model as a whole:
    • Dependent variable
    • Predictor(s)
    • Any transformations to the variables
    • How the model was fitted? What predictors were kept?
    • How well did the model fit the data
    For each predictor, include:
    • Estimate
    • SE
    • t value
    • p value
    • A description of the results in human words.

Uses of multiple regression

Why do a multiple regression? Do a multiple regression when you want to…

  1. perform multiple tests in a single model
  2. statistically control for co-variates
  3. look for interactions between predictors

Perform Multiple Tests

We’ve talked about Type I error and how doing more tests increases Type I error probability. If we can do all our ‘tests’ in a single model, it only counts as 1 test, so we avoid this problem. We saw this in the example above, when we evaluated the influence of many different factors on college GPA.

Statistically Control for Co-Variates

In the real world, different variables are often highly inter-correlated, co-occurring together willy-nilly in ways that are outside of our control. Perhaps you have heard of confounds: two predictors that are so overlapping that it is difficult to know which is influencing our dependent variable. For example:

  • Couples with more education are less likely to get divorced. Does education really prevent divorce, or is it the case that more educated people make more money, and it’s really the Income that protects against divorce (by reducing financial conflict in marriage). In this case, Income might be a confound.
  • People with bigger shoes commit more violent crimes. But is it really the case that shoe size is predictive of violence, or is there perhaps another variable (gender, height, being a clown) that is truly predictive of violent behavior, and shoe size is simply correlated with that other variable? In this case, gender, height, and clownhood are all possible confounds, other variables that might be actually responsible for the apparent relationship between shoe size and violence.

If we can’t (or shouldn’t) do an experiment where we exert control by manipulating variables, we are faced with a challenge; how do we know that this variable is really influencing our dependent variable, and not some other confound?

Regression gives us a chance to test this. If we have a variable we think might be a confound, we can add it to the regression. Since each predictor is only significant if it adds new predictive power over and above all other predictors, then our original predictor will only be significant if it provides unique information about the dependent variable. If our original predictor only seemed like it was significant because it was correlated with the new predictor we added, it will no longer be statistically significant.

Each new predictor we add to our model in this way is called a co-variate, because we are checking to see if it varies with (co-varies) our original predictor of interest.

Here’s an example from the NeuroCog data set. We might hypothesize that people with good Verbal skills would have better overall social abilities. So we could regress Verbal (a continouous variable, which we have centered) on SocialCog scores, like so:

multiple_regression <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog)
summary(multiple_regression)
## 
## Call:
## lm(formula = SocialCog ~ Dx + scale(Verbal, scale = FALSE), data = NeuroCog)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.4408  -6.9409  -0.8932   8.5092  26.2735 
## 
## Coefficients:
##                              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                  46.22542    0.99659  46.384 < 0.0000000000000002 ***
## DxSchizophrenia              -7.93779    1.93509  -4.102            0.0000563 ***
## DxSchizoaffective            -2.41438    2.17986  -1.108                0.269    
## scale(Verbal, scale = FALSE)  0.34286    0.08039   4.265            0.0000289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 238 degrees of freedom
## Multiple R-squared:  0.1964, Adjusted R-squared:  0.1863 
## F-statistic: 19.39 on 3 and 238 DF,  p-value: 0.00000000002774
#summary(NeuroCog)

It’s quite significant; as Verbal scores go up, so does Social Cognition. But wait! Since cognitive abilities are inter-related, maybe better Verbal scores are really just the result of other abilities, such as Attention. Verbal ability and Attention ARE correlated (r = 0.5947171.

To test this, let’s add Attention as a co-variate to the model:

multiple_regression_w_covariate <- lm(SocialCog ~ Dx + scale(Verbal, scale = FALSE) + scale(Attention, scale = FALSE), data = NeuroCog)
summary(multiple_regression_w_covariate)
## 
## Call:
## lm(formula = SocialCog ~ Dx + scale(Verbal, scale = FALSE) + 
##     scale(Attention, scale = FALSE), data = NeuroCog)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.8208  -7.3469  -0.2813   8.1393  29.0185 
## 
## Coefficients:
##                                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                     45.81316    0.96044  47.700 < 0.0000000000000002 ***
## DxSchizophrenia                 -6.84981    1.87166  -3.660             0.000311 ***
## DxSchizoaffective               -1.47423    2.10157  -0.701             0.483686    
## scale(Verbal, scale = FALSE)     0.12345    0.09052   1.364             0.173900    
## scale(Attention, scale = FALSE)  0.30617    0.06607   4.634           0.00000593 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.02 on 237 degrees of freedom
## Multiple R-squared:  0.2632, Adjusted R-squared:  0.2508 
## F-statistic: 21.16 on 4 and 237 DF,  p-value: 0.000000000000006158

Attention is significant, but Verbal isn’t anymore. This tells us that Verbal wasn’t independently predicting Social Cognition; what we thought was an effect of Verbal was really just an effect of Attention, since the two variables were correlated.

Let’s try another one! For this example, we’ll use a different data set from heplots: TIPI, a set of 1799 people who took a personality test and answered some demographic questions. We’ll use Neuroticism as our dependent variable, and married (Levels: Never Married, Currently Married, Previously Married) as our first predictor.

multiple_regression_w_covariate_2 <- lm(Agreeableness ~ married, 
                                  data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(multiple_regression_w_covariate_2)
## 
## Call:
## lm(formula = Agreeableness ~ married, data = filter(TIPI, age < 
##     100))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8065 -0.7548 -0.1653  0.8347  2.8347 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                4.16534    0.03723 111.882 < 0.0000000000000002 ***
## marriedCurrently married   0.58944    0.08652   6.812      0.0000000000131 ***
## marriedPreviously married  0.64111    0.14827   4.324      0.0000161643021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.384 on 1786 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.03221,    Adjusted R-squared:  0.03112 
## F-statistic: 29.72 on 2 and 1786 DF,  p-value: 0.0000000000002014

Look at that! It looks like Peole who are currently married or previously married are more agreeable than people who are never married. That makes sense, kinda - agreeable people are more likely to find marriage partners. But wait! Let’s add in age as a co-variate!

multiple_regression_w_covariate_2 <- lm(Agreeableness ~ married + scale(age, scale = FALSE), 
                                  data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(multiple_regression_w_covariate_2)
## 
## Call:
## lm(formula = Agreeableness ~ married + scale(age, scale = FALSE), 
##     data = filter(TIPI, age < 100))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6949 -0.9136 -0.0520  0.9480  3.0484 
## 
## Coefficients:
##                           Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)               4.261699   0.039652 107.478 < 0.0000000000000002 ***
## marriedCurrently married  0.192474   0.104938   1.834               0.0668 .  
## marriedPreviously married 0.137727   0.165607   0.832               0.4057    
## scale(age, scale = FALSE) 0.020088   0.003077   6.529      0.0000000000858 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.368 on 1785 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.05478,    Adjusted R-squared:  0.05319 
## F-statistic: 34.48 on 3 and 1785 DF,  p-value: < 0.00000000000000022

Now these married variable is no longer significant. It turns out that people who are or were married are just OLDER than never married people on average, and older people tend to be more agreeable. So it’s not the marital status that really predicts agreeableness, it’s the age.

Multiple regression allows us to control statistically for variables we can’t control experimentally. Neat!

Archery Data

Now let’s do another to help us learn about co-variates. Run the following code:

data(ArcheryData)
?ArcheryData
ArcheryData
##    Attendance Average Sex Day1 LastDay Improvement Improve
## 1          21 229.619   m  195     226          31       1
## 2          20 188.900   m  140     251         111       1
## 3          20  91.650   f   51     142          91       1
## 4          22 188.227   m  133     255         122       1
## 5          21 162.048   m  151     182          31       1
## 6          21 169.952   m  201     253          52       1
## 7          21  95.381   m   42     184         142       1
## 8          22 116.500   f   62      85          23       1
## 9          19 188.368   m  174     191          17       1
## 10         22 122.727   f   56     168         112       1
## 11         22 123.818   f   86     173          87       1
## 12         20  97.500   f   86     131          45       1
## 13         22  95.500   f  109      99         -10       0
## 14         21 165.143   m  141     183          42       1
## 15         21 193.762   m  174     173          -1       0
## 16         20  98.400   f   51      49          -2       0
## 17         16 158.625   m  164     232          68       1
## 18         22  66.818   f   16      78          62       1
  1. Set up and run a regression with:
    • Dependent Variable: Archery score on the last day
    • Predictor 1: Number of days in class
    • Predictor 2: Archery score on the first day
  2. Interpret your results so far, answering the following questions:
    • How is Number of days in class related to scores on the last day? Is this relationship statistically significant?
    • How are scores on day 1 related to scores on the last day? Is this relationship statistically significant?
  3. Define co-variate in your own words. What are co-variates used for in regression?
  4. Re-run your model from 35., this time adding Sex as a predictor.
    • How did your results change?
    • What does this tell you?
  5. Now use the step() function to fit the model. What should the final model be?
  6. Show how you would check for collinearity. Is this model OK, or is the collinearity too high?
  7. Now write up an APA-style paragraph reporting the results of this regression. First describe the model as a whole:
    • Dependent variable
    • Predictor(s)
    • Any transformations to the variables
    • How the model was fitted? What predictors were kept?
    • How well did the model fit the data
    For each predictor, include:
    • Estimate
    • SE
    • t value
    • p value
    • A description of the results in human words.

Look for Interactions

Just like with ANOVAs, we can look for interactions in regression.

For this example, we’ll use athe TIPI data set again. We’ll use Agreeableness as our dependent variable, and gender and age as our predictors, to see if women are more agreeable than men, and if people become more or less agreeable as they age (“Get off my lawn!”). We’ll also add in the interaction with the asterisk (*).

regression_with_interaction <- lm(Agreeableness ~ gender * scale(age, scale = FALSE), 
                                  data = filter(TIPI, age < 100)) # One of the respondents put their age at 5555. I removed that.
summary(regression_with_interaction)

We set up the formula in the same way as ANOVA. The interpretation of interactions in regression a bit more complicated. Remember that ANOVA was an omnibus test - it just told us if something somewhere was significant, and we had to do follow-up tests to figure out what. In regression, we skip straight to the follow-up tests. This is both good and bad. It’s good because there are no extra steps. It’s bad because we have to learn to read and interpret the regression results directly.

Something to keep in mind:

  • The simple coefficients (non-interaction terms in the model) are absolute slopes - they reflect the actual slope of a regression line in one of the conditions.
  • The interaction coefficients (which have a :) are relative slopes - they tell us how much a simple coefficient needs to be adjusted.

Hopefully this will make more sense as we go through some examples.

Interactions of Two Continuous Variables

Let’s start with the interaction of two continuous variables. We’ll use the cars data for that:

continuous_interaction <- lm(mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), data = mtcars)
summary(continuous_interaction)
## 
## Call:
## lm(formula = mpg ~ scale(disp, scale = FALSE) * scale(hp, scale = FALSE), 
##     data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5153 -1.6315 -0.6346  0.9038  5.7030 
## 
## Coefficients:
##                                                        Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)                                         18.20207699  0.73949796  24.614 < 0.0000000000000002 ***
## scale(disp, scale = FALSE)                          -0.03082639  0.00637635  -4.834            0.0000436 ***
## scale(hp, scale = FALSE)                            -0.03097348  0.01166910  -2.654              0.01296 *  
## scale(disp, scale = FALSE):scale(hp, scale = FALSE)  0.00029005  0.00008694   3.336              0.00241 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.692 on 28 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8005 
## F-statistic: 42.48 on 3 and 28 DF,  p-value: 0.0000000001499

Notice how I centered these variables: What’s a 0 horsepower car, anyway?

Let’s walk through these results. Our dependent variable is Miles Per Gallon (mpg), and our two predictors are Displacement (disp; the total volume of all cylinders in the engine in cubic inches) and Horsepower (hp). Let’s discuss each finding:

  1. disp: The slope is -0.031. This means that for each cubic inch increase in displacement, the miles per gallon decreases by 0.031.
  2. hp: The slope is also -0.031. This means that for each unit increase in horsepower, miles per gallon decreases by 0.031. More horses = more gas consumed.
  3. disp:hp: This is the interaction term. It says that the slope of displacement is affected by horsepower: for each unit increase in horsepower, the slope of displacement becomes more positive (by 0.0003). Since the slope started out negative, it gets closer to 0. See the figure below for a visualization of this - the displacement line starts out steep for the lowest hp, but gets more shallow for higher hp values.

This interaction term also applies to the effect of hp: the slope of horsepower is affected by displacement, so that for each unit increase in displacement, the slope of horsepower becomes more positive. Overall, this interaction term indicates that as one value (displacement OR horsepower) gets big, the influence of the other becomes weaker.

Interactions: One Continuous, One Categorical

Looking for interactions between categorical and continuous variables is - perhaps - the most intuitive type of interaction in regression. What we’re asking is: Does the slope of our continuous variable differ for the different levels of our categorical variable.

Now let’s go back to our Personality data. By modeling the interaction, we’re asking: Is the effect of age on Extraversion different for males and females?

categorical_continuous_interaction <- lm(Agreeableness ~ gender * scale(age, scale = FALSE), data = filter(TIPI, age < 100))
summary(categorical_continuous_interaction)
## 
## Call:
## lm(formula = Agreeableness ~ gender * scale(age, scale = FALSE), 
##     data = filter(TIPI, age < 100))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3303 -0.8795 -0.0146  0.8922  3.2363 
## 
## Coefficients:
##                                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                       4.061613   0.043707  92.927 < 0.0000000000000002 ***
## genderF                           0.535813   0.063782   8.401 < 0.0000000000000002 ***
## scale(age, scale = FALSE)         0.019297   0.003159   6.109        0.00000000123 ***
## genderF:scale(age, scale = FALSE) 0.014691   0.004641   3.165              0.00158 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.34 on 1794 degrees of freedom
## Multiple R-squared:  0.09443,    Adjusted R-squared:  0.09291 
## F-statistic: 62.36 on 3 and 1794 DF,  p-value: < 0.00000000000000022
#summary(filter(TIPI, age < 100))

Let’s walk through these results. Our dependent variable is Agreeableness, and our two predictors are Gender (Male vs Female) and age (13-75), which I have centered. Let’s discuss each finding:

  1. genderF: The slope is 0.54, and is significant. This means that the Females are 0.54 points more agreeable than Males. Because the interaction term is included, this only applies when Age = the mean age (0 because of centering). So, this number says: for the average aged person, females are more agreeable than males.
  2. age: The slope is 0.019. This means that for each unit increase in age (each year), agreeableness scores increased by 0.019 points. Since we have included the interaction, this slope only applies in the Males (baseline).
  3. genderF:age: This is the interaction term. It says that the slope of age depended on gender: The slope for the females is more positive than the slope for the males. Specifically, it’s 0.015 more positive. This means that the slope for age was 0.019 for males, but 0.034 (0.019+0.015) for females.

See the figure below for a visualization of these results. Notice that:

  • The line for females is always above the males.
  • The male line is sloped upwards.
  • The age line is for the females is sloped upwards more steeply.
ggplot(data = filter(TIPI, age < 100), aes(x = age, y = Agreeableness)) +
  geom_jitter(aes(color = gender, shape = gender, fill = gender), size = 3) +
  geom_smooth(method = "lm", formula = y ~ x, aes(color = gender, lty = gender), fill = alpha("gray98", 1), se = TRUE, size = 3) + theme_bw() + theme(axis.text = element_text(angle = 90)) + labs(
    color = "Gender", shape = "Gender",
    lty = "Gender", fill = "Gender",
    x = "Age (Years)",
    y = "Agreeableness"
  ) + scale_color_manual(values = c(alpha("blue", 1), alpha("pink2", 1))) + scale_shape_manual(values = c(22, 21)) + scale_fill_manual(values = c(alpha("blue", 0.1), alpha("pink2", 0.1)))

Interaction of two categorical variables

When we include the interaction of two categorical variables, we’re asking: is the effect of one variable different for the different levels of the other variable?

In the regression model below, we’re asking two questions:

  • Does the two genders vary in Neuroticism?
  • Does the effect of gender on Neuroticism depend on whether someone is married or not?

Also note that I’ve left age in as a co-variate, since I want to be sure that any effect of the married variable is NOT really due to age.

categorical_interaction <- lm(Neuroticism ~ gender * married + age, data = filter(TIPI, age < 100))
summary(categorical_interaction)
## 
## Call:
## lm(formula = Neuroticism ~ gender * married + age, data = filter(TIPI, 
##     age < 100))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.764 -1.363  0.067  1.283  4.292 
## 
## Coefficients:
##                                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)                        4.308433   0.115090  37.435 < 0.0000000000000002 ***
## genderF                            0.760614   0.091370   8.325 < 0.0000000000000002 ***
## marriedCurrently married          -0.005425   0.155992  -0.035              0.97226    
## marriedPreviously married          0.513589   0.284352   1.806              0.07106 .  
## age                               -0.023462   0.003824  -6.136        0.00000000104 ***
## genderF:marriedCurrently married  -0.299254   0.213284  -1.403              0.16077    
## genderF:marriedPreviously married -1.179285   0.364415  -3.236              0.00123 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.68 on 1782 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.09217,    Adjusted R-squared:  0.08911 
## F-statistic: 30.15 on 6 and 1782 DF,  p-value: < 0.00000000000000022

Let’s walk through these results. Our dependent variable is Neuroticism, and our two predictors are gender (Male, Female) and married (Never Married, Currently married, Previously married). Let’s discuss each finding:

  1. genderF: The slope is 0.76. This means that Neuroticism scores for females were 0.76 points lower than scores for males. Because the interaction term is included, this only applies to Never Married people (the baseline for married). So, this number says: single women are 0.76 points more neurotic, on average, than single men.
  2. marriedCurrently married: The non-significant slope is -0.005. This means currently married people are not more neurotic than never married people. Since we have included the interaction, this only applies for the males (baseline of gender). So, this number says: currently married males are NOT significantly more neurotic than single males..
  3. marriedPreviously married: Similarly, this effect is not significant. This number says: previously married males are NOT significantly more neurotic than single males..
  4. age: Our co-variate is significant, and negative: **0.023, indicating that neuroticism goes down as age goes up. This number says: young people are more twitchy than old people.*.
  5. genderF:marriedCurrently married: This is the first interaction term. It is not significant, so this says that the effect of gender DOES NOT depended on whether the person is currently married. In other words, we already know that single females are more neurotic than single males, and this doesn’t change for currently married females and males.
  6. genderF:marriedPreviously married: This is the second interaction term. It IS statistically significant. This effects says that the difference between males and females in not the same for previously married people as it is for single people. Specifically, for previously married people, the effect of gender changed by -1.18 compared to single people. So, the effect of gender for single people, as we have already seen, was 0.76. The effect of gender for previously married people is -0.42 (0.76 + -1.18). In other words, while single females are more neurotic than single males, the opposite appears to be true for previously married people!

See the figure below for a visualization of these effects. Notice how the difference between bars is opposite in the ‘Previously Married’ condition. It looks like divorced males revert back to Never Married levels of neuroticism, while divorced females become less neurotic than any other group of females. Weird.

Can You Spot A Cheating Heart?

Now let’s do a regression that involves interactions. Run the following code:

data(FaithfulFaces)
?FaithfulFaces
head(FaithfulFaces)
##   SexDimorph Attract Cheater Trust Faithful FaceSex RaterSex
## 1      4.900   2.000       1 4.118    6.294       F        M
## 2      2.583   2.083       1 4.000    6.588       F        M
## 3      4.300   2.667       1 3.941    5.824       F        M
## 4      4.083   2.917       0 4.118    5.412       F        M
## 5      4.500   3.500       0 3.235    4.000       F        M
## 6      2.000   1.833       0 4.647    6.647       F        M
  1. Set up a regression with:
    • Dependent Variable: Faithfulness, a rating of how faithful a person probably is
    • Predictor 1: Trust, a rating of trustworthiness
    • Predictor 2: Attractiveness
    • Predictor 3: Cheater - was this person unfaithful to a partner? Make this into a factor!
  2. Run your regression, and answer the following questions:
    • Do Trust ratings predict Faithfulness ratings? In what way? Describe the relationship of these variables.
    • Do Attractiveness ratings predict Faithfulness ratings? In what way? Describe the relationship of these variables.
    • Do people rate Cheaters as more or less Faithful than non-cheaters? Describe the relationship of these variables.
  3. Now add the following interactions involving FaceSex, an awkwardly named variable that represents the Sex of the individual in the photograph being rated. The goal of these interactions is to see if the three Predictors have a differential effect on Faithfulness ratings for Males vs. Females.
    • Interaction 1: Trust and FaceSex
    • Interaction 2: Attractiveness and FaceSex
    • Interaction 3: Cheater and FaceSex
  4. Use the Step() function to remove unnecessary predictors and interactions.
  5. Show how you would check for collinearity. Is this model OK, or is the collinearity too high? If too high, how can you fix it?
    • Hint: DId you center your continuous variables?
  6. Run your final model and answer the following questions:
    • Do Trust ratings predict Faithfulness ratings differently for Males vs. Females? If so, how? Describe the interaction.
    • Do Attractiveness ratings predict Faithfulness ratings differently for Males vs. Females? If so, how? Describe the interaction.
    • Do people rate Cheaters as more or less Faithful than non-cheaters when the photo is a Male vs. a Female? Describe the interaction.
  7. Make a graph of one of the statistically significant interactions.
  8. Now write up an APA-style paragraph reporting the results of this regression. First describe the model as a whole:
    • Dependent variable
    • Predictor(s)
    • Interactions
    • Any transformations to the variables
    • How the model was fitted? What predictors were kept?
    • How well did the model fit the data?
    For each predictor AND interaction, include:
    • Estimate
    • SE
    • t value
    • p value
    • A description of the results in human words.
  9. Now show how you would relevel FaceSex so that M is the baseline insted of F.

When should we (and shouldn’t we) use regression?

When do we use Regression?

Between-subjects data

Regression works great on between-subjects data. Remember that between-subjects data means that we only have one observation per participant.

Continuouns variables that are central to our research question

If you only have categorical variables, just do an ANOVA. But regression is good if you have:

  • continuous predictors we want to test
  • a mixture of continuous and categorical predictors we want to test
  • continuous predictors and categorical and/or continuous predictors you want to control for statistically (co-variates)

When do we NOT use Regression?

You should NOT use regression when you have within-subjects data. Recall that in within-subjects data, each subject provides more than one data point, and our statistical model needs to account for the inter-relatedness of these data points: data points that come from the same source are not independent. Regression has no way to account for this, so it is a bad tool for within-subjects data. So, when you have within-subjects AKA repeated-measures AKA nested data, don’t use regression. Use mixed-effects modeling for this instead. What’s mixed-effects modeling? Stay tuned!

Also, it’s a bit weird to use a regression when you only have categorical predictors. You COULD do regression, it’s not wrong, but then ANOVA has nothing to do.

Real Data: Personality, Shmersonality

For this task, we will be using data from this paper:

Soto, C. J. (2021). Do links between personality and life outcomes generalize? Testing the robustness of trait–outcome associations across gender, age, ethnicity, and analytic approaches. Social Psychological and Personality Science, 12(1), 118-130.

https://osf.io/z3xhy

The Open Science Framework Page for this data can be found here:

https://osf.io/d3xb7/

What’s this data about?

Using the paper linked above as a reference, answer the following questions:

  1. In a couple of sentences, describe the goal(s) of this study.
  2. Where did this data come from?
  3. What are the key variables (probably)? Give a couple of examples.

Get the data and prep it

  1. Download the file Survey 1 - Merged - Working data file.sac to your project folder:

https://osf.io/crk92

Note. Download the file, NOT the metadata.

  1. Load the data into R, calling it ‘personalitydata’. As part of this code, use a pipe to do the following:
  2. All the columns start with “S1_”. This is annoying. Let’s rename all the columns to remove this.
    • Hint - you only need one command (and the gsub function) to make this happen. A little googling will help.
  3. Get rid of columns we don’t need.
    • We only need to keep the column Sex and the column from #284 to the end
  4. Currently, Sex is coded as -1 and 1, with -1 being Male. Let’s fix this so -1 is changed to ‘Male’ and 1 to ‘Female.’

Life Satisfaction

  1. Now run a regression to answer the following questions:
    • Which of the big 5 personality traits (PersonalityExtraversion, PersonalityAgreeableness, PersonalityConscientiousness, PersonalityNegativeEmotionality, PersonalityOpenMindedness) predict LifeSatisfaction
    • Do men and women differ in how these traits relate to LifeSatisfaction?
  2. Write an APA-style paragraph reporting the results of this regression.

Romantic Satisfaction

  1. Now run a regression to answer the following questions:
    • Which of the big 5 personality traits (PersonalityExtraversion, PersonalityAgreeableness, PersonalityConscientiousness, PersonalityNegativeEmotionality, PersonalityOpenMindedness) predict RomanticSatisfaction
    • Do men and women differ in how these traits relate to RomanticSatisfaction?
  2. If you found any significant interactions, pick one and graph it!
  3. Write an APA-style paragraph reporting the results of this regression.

Criminality

  1. Now run a regression to answer the following questions:
    • Which of the big 5 personality traits (PersonalityExtraversion, PersonalityAgreeableness, PersonalityConscientiousness, PersonalityNegativeEmotionality, PersonalityOpenMindedness) predict CriminalBehavior
    • Do men and women differ in how these traits relate to CriminalBehavior?
  2. If you found any significant interactions, pick one and graph it!
  3. Write an APA-style paragraph reporting the results of this regression.

Religiosity

  1. Now run a regression to answer the following questions:
    • Which of the big 5 personality traits (PersonalityExtraversion, PersonalityAgreeableness, PersonalityConscientiousness, PersonalityNegativeEmotionality, PersonalityOpenMindedness) predict Religiousness
    • Do men and women differ in how these traits relate to Religiousness?
  2. If you found any significant interactions, pick one and graph it!
  3. What is releveling? What kind of variable can you relevel? Why might you want to?
  4. Now relevel Sex and run the regression again!
    • Note: relevel() only works if the variable is already a factor. You might have to do this first!
  5. Write an APA-style paragraph reporting the results of this regression. To be consistent with the other paragraphs, use the model from item 64

Your Choice! (Bonus)

  1. Now run a regression to answer the following questions:
    • Which of the big 5 personality traits (PersonalityExtraversion, PersonalityAgreeableness, PersonalityConscientiousness, PersonalityNegativeEmotionality, PersonalityOpenMindedness) predict some variable of your choice!
    • Do men and women differ in how these traits relate to some variable of your choice?
  2. Write an APA-style paragraph reporting the results of this regression.

Real Data: Marriage and Other Disasters (Bonus)

  1. Test the following research questions:

    • How does perceived social support influence relationship satisfaction?
    • How does perceived stress influence relationship satisfaction?
    • Do the effects of these predictors differ for husbands and wives?

    The codebook for this study is at https://osf.io/auxzt

    To do this, do the following:

    • Read in ‘marriagedata.csv’ from Checklist 4.
    • Filter the data to include ONLY TimePoint 3.
    • Set up your regression, following the steps above.
  2. Write a brief APA-style paragraph describing the results.

Real Data: Honestly Hot! (Bonus)

  1. Read in the data. If you didn’t do the bonus for Checklist 7, see the instructions there. Otherwise, read in the file “honestlyhotdata.csv”
  2. Do a regression. Include attractiveness as your dependent variable. Include PGender as a categorical predictor. Include all 7 personality variables. Add the interaction of PGender with each personality variable. Run the regression.
  3. Interpret the interaction and write an APA-style paragraph describing the results of your analysis.
  4. Make a graph illustrating the interaction.

Functions! (Bonus)

  1. Using scale() to center variables is kind of cumbersome. Make a function called ctr() that takes a variable and centers it.

Go To Home Page