Video Tutorial

Part 1: https://byu.box.com/s/s7sq31uy7nsr0iugcohll63vc1391wf0
Part 2: https://byu.box.com/s/guq6fdn4kngd4bz55kvquh5bxi2mmilw
Part 3: https://byu.box.com/s/exxtn486wm307vyrozctkh36bu4we8ij
Part 4: https://byu.box.com/s/4vc2i5mevi61dgxq7t3otr9kqphclwb5

Intro to ANOVAs

We learned previously that there are 4 things we need to know about any statistical test:

What variables can the test handle?
What statistic does the test generate?
What distribution does the test use?
What arguments does the R function require?

Let’s talk about these 4 needs in relation to our next test: ANOVAs.

What does ANOVA stand for? Any guesses?

Don’t peek!

ANOVA stands for ANalysis Of VAriance.

Variables

Dependent Variables

Like the t-test, an ANOVA usually takes a single numeric dependent variable.

Predictor Variables

Predictor/independent variables in ANOVA are called FACTORS.

In R, these variables must be factors (do you see the connection?).

In other words, ANOVA can only take categorical predictor variables. No numbers allowed.

How Many Factors?

An ANOVA can have several different Factors. But try not to go overboard. ANOVA can handle it - but can your brain?

How Many Levels Per Factor?

This is a good question, since we know that the t-test can only accept one 2-level ‘factor’.

An ANOVA’s factors can have 2 or more levels. There’s really no computational limit. But there are practical limits - again, don’t add too many levels, or you will drive yourself insane.

ANOVA Variants

The rules for ANOVA variables described above don’t always apply; there are variants of the classic ANOVA.

MANOVA

MANOVA stands for Multivariate analysis of variance. MANOVA is a variant of ANOVA that can take multiple dependent variables.

If you want to know more about this, explore the manova() function in R
Once you know how to do a regular ANOVA, MANOVA isn’t too hard

ANCOVA

ANCOVA stands for analysis of covariance. ANCOVA can take factors as predictors, as well as continuous (numeric) predictors. These continuous predictors are called covariates.

A covariate is a variable that you want to control for, statistically
You aren’t usually directly interested in analyzing a covariate BUT
You want to see if the effect of your other variables is still significant when this covariate is accounted for
In other words, covariates are usually potential confounds.
We’ll talk more about covariates when we talk about regression

ANCOVAs are pretty easy in R – you don’t even need a special function.

An ANOVA needs a dependent variable. What sort of variable (number, factor, string) should this dependent variable be?
How many dependent variables can a regular ANOVA take? If you want more dependent variables, what sort of ANOVA-like test should you perform?
An ANOVA needs predictor/independent variable(s). What sort of variable (number, factor, string) should these be? Give an example of a variable that would work and a variable that would NOT work.
What is the specific term for a predictor/independent variable in an ANOVA?
If you want to include a different kind of variable in an ANOVA, what sort of ANOVA-like test should you perform?
How many variables can an ANOVA take as predictor/independent variable(s)?
What are the levels of the following variables?

Sex: Male, Female
Airports: JFK, LaGuardia, Newark
Car: Mazda, Cadillac, Dodge, Toyota

How many levels can a given predictor/independent variable have in ANOVA?
Describe a hypothetical study where an ANOVA would be the appropriate test to use and a t-test would not be.

The F Statistic

ANOVA uses the F statistic.

To compute the F statistic takes 4 steps:

Compute Sum of Squares

SS_between = Sum of Squares Between
SS_within = Sum of Squares Within

Figure out degrees of freedom

df_between = degrees of freedom between
df_within = degrees of freedom within

Compute Mean Squares

MS_between = SS_b / df_b
MS_within = SS_w / df_w

Compute the F statistic

F = MS_b / MS_w

Step 1: Compute Sum of Squares

Total Sum of Squares

Computing total sum of squares (SS_tot) isn’t necessary for ANOVA, but you might find this conceptually helpful. Keep in mind that:

SS_tot = SS_between + SS_within

if (!require("nycflights13")) install.packages("nycflights13")
library(nycflights13)

In the plot above, the black vertical line represents the Grand Mean, the average of all data points.

The gray dots represent the individual data points. The blue line is a violin plot showing the distribution of data.

The thin horizontal black lines represent the distance of some of the data points (gray dots) from the grand mean (vertical black line). To compute the total sum of squares (SS_tot), we square all these distances, then add them all up. We have summed the squares of the distances of each data point from the grand mean. See where the name sum of squares comes from?

SS_between

To compute SS_between, we do this same thing, except instead of using the individual data points, we use the mean values for the different groups.

Here’s a violin plot of the flights data, but now we’ve divided it into three groups, based on the three NYC airports.

To compute the SS_between, we subtract each group’s mean from the grand mean (the black horizontal lines in the zoomed-in plot below). Then we square these differences and add them all up (we sum the squares).

The SS_between acts as an index of how separated the different groups are from each other:

If SS_between is small, the groups are all packed pretty tightly together (and closer to the grand mean).
If SS_between is big, the groups are spaced far apart from each other (and far from the grand mean).

In other words, SS_between is a number that expresses the variance *between groups.

SS_within

And now on to SS_within. For this one, we are back to looking at the individual data points. BUT instead of finding the distance from each point to the grand mean, like we did for SS_total, we’ll be finding the distance from each data point to the mean of its group (note how the black horizontal lines in the plot below go from a data point to the group mean, NOT to the grand mean). Then we square these differences and add them all up (we sum the squares).

The SS_within acts as an index of how spread out the data points are inside (within) the groups:

If SS_within is small, the data points are all packed pretty tightly together, so the group’s distribution is narrow.
If SS_within is big, the data points are spaced far apart, so the group’s distribution is broad.

In other words, SS_within is a number that expresses the variance *within groups.

What is a ‘Sum of Squares’? What specifically is summed and then squared?

Below is some made-up data. Use it to answer the next two questions.

Observation	Group A	Group B	Group C
1	5	7	3
2	4	6	3
3	7	6	3
4	6	7	5
5	5	7	4
6	5	5	1
7	4	8	5
8	5	3	2
9	6	6	4
10	5	8	4
Group Means:	5.2	6.3	3.4
Grand Mean:	4.9666667

How would you calculate the SS_Between? What numbers would you use (individual observations, group means, grand mean)?
How would you calculate the SS_within? What numbers would you use (individual observations, group means, grand mean)?

Make and complete the following table, placing the graph labels (A, B, C, or D) in the appropriate cell.

	Bigger SS_Between	Smaller SS_Between
Bigger SS_Within
Smaller SS_Within

Step 2: Figure out degrees of freedom

Now that we’ve computed our SS_between and SS_within, we need our degrees of freedom for these statistics.

DF_between = the number of groups - 1
DF_within = the number of observations - the number of groups.

Simple enough. Now you have 2 choices:

Accept on faith that these numbers are what they are and go on with your life.
Read my long-winded explanation of degrees of freedom below.

How do Degrees of Freedom work, anyway.

I’m glad you asked. Let’s start with a concrete example.

Parable of the Pumpkins

Degrees of Freedom

Elsa goes to a pumpkin patch to buy three pumpkins. They cost $1 per pound. She spends $10.

How much did the first pumpkin cost?

Not sure? That’s correct. It must be less than 10, probably 8 or less.

But it could be 2, or 4, or 6. We just don’t know.

How much did the second pumpkin cost?

Still not sure? Again, that’s correct. How could you be?

What if I told you that the first pumpkin cost $5? Does that help?

Maybe some - we know that the second pumpkin can’t cost more that $4. But is it 1, or 2 or 3, or 4? It could be any of those numbers.

How much does the third and final pumpkin cost?

Not sure?

What if I told you that the first pumpkin cost $5. Also, the second cost $3.

Now it’s pretty obvious, right? The last pumpkin has to cost $2.

Why?

Because we know the total is $10.

Degrees of Freedom.

Degrees of freedom represent the number of independent data points that contribute to some statistic.

According to wikipedia, a statistic is “any quantity computed from values in a sample”. Our total pumpkin price of $10 is exactly that, a quantity (10) computed from a sample (the prices of 3 pumpkins).

Any given statistic has degrees of freedom, a number that represents the “the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary” (wikipedia again). This number is usually equal to the number of observations in the samle MINUS 1.

Why minus 1? Let’s consider our pumpkins again:

Was the first pumpkin’s cost “free to vary”? Sure, it could have been anywhere from 1 to 8 lbs, and the total would have still added up to 10. So that’s 1 degree of freedom.
Was the second pumpkin’s cost “free to vary”? Sure!. Even after we knew that the first pumpkin was 5 lbs, the second still had room to vary: it could have been anywhere from 1 to 4 lbs, and the total could have still added up to 10. So that’s a second degree of freedom.
Was the third pumpkin’s cost “free to vary”? NO! Once we knew the weights of the other two pumpkins (5 lbs and 3 lbs), there was no more wiggle room; in order to get to 10 lbs, the last pumpkin HAD TO BE 2 lbs. So no degrees of freedom here.

In sum, for a given statistic, the final value in the sample that gave us that statistic is NOT FREE TO VARY, so it doesn’t add to the degrees of freedom.

DF_between

Let’s return to ANOVA and talk about degrees of freedom.

The Sum of Squares (Between) is a statistic - a quantity computed from a sample. How many degrees of freedom does this statistic have, and why?

We have 3 groups. And we’ve been told that the degrees of freedom (between) is equal to the number of groups - 1. But why? Take a moment to try and explain it to yourself before going on.

The sum of squares (between) AKA SS_between is made up of 3 numbers:

The mean of the LGA group minus the Grand Mean, squared
The mean of the JFK group minus the Grand Mean, squared
The mean of the EWR group minus the Grand Mean, squared

If we know that the SS_between is 11.63, can we figure out the LGA mean? No. It is still free to vary.

If we know the SS_between is 11.63 AND we know that the LGA_SS (LGA mean minus the grand mean, squared) is 5.25, can we figure out the JFK mean? No. We have a reasonable idea of the upper limit of that mean, but we can’t identify precisely what it is. It is still free to vary.

What if we know three things: that the SS_between is 11.63, that the LGA_SS is 5.25, AND that the JFK_SS is 5.25? Can we figure out the EWR mean?

YES! It has to be 15.11. Because SS_between (11.63) = LGA_SS (5.25) + JFK_SS (0.28) + EWR_SS, so EWR_ss must be 6.1. And a simple square root and addition of the grand mean gives 15.1088841. So, given the other numbers, this number is not free to vary - the last number is the sample is fixed.

But what about DF_within ?

We also learned that DF_within is equal to the number of observations minus the number of groups.

Why isn’t it just N - 1, as it was for DF_between and for the t-test? Take a moment and try to figure this out for yourself.

No seriously, stop reading and think.

OK, now let’s see if your explanation meshes with mine:

So far, we’ve learned that the last observation in a sample doesn’t count towards degrees of freedom.

We also know that to compute the SS_within, we subtract each data point from the mean for its group. For example, if a flight departed from JFK, we use the JFK mean.

In other words, we have separated the data into three groups. So, we are treating the data as if it were 3 different samples. And the last observation in each sample doesn’t contribute to degrees of freedom. Since there are 3 groups, that’s three observations that don’t count.

This means that the degrees of freedom for SS_within is reduced by 1 for each group, so df_within = the number of observations - the number of groups.

Observation	Group A	Group B	Group C
1	5	7	3
2	4	6	3
3	7	6	3
4	6	7	5
5	5	7	4
6	5	5	1
7	4	8	5
8	5	3	2
9	6	6	4
10	5	8	4
Group Means:	5.2	6.3	3.4
Grand Mean:	4.9666667

What is the DF_Between for the data in the table above?
What is the DF_Within for the data in the table above?

Now suppose I have a data set of Instagram Usage (in minutes) for four age groups: 45 High Schoolers, 62 Undergraduates, 82 Young Professionals, and 31 Middle-aged Parents, and I want to see if these groups differ in Instagram Usage.

What is the DF_Between for the data in the example above?
What is the DF_Within for the data in the example above?

Step 3. Compute Mean Squares

Now to compute the mean squares. This part is just math:

MS_between = SS_b / df_b
MS_within = SS_w / df_w

Remember that the mean is the sum of a set of data points divided by the total number of data points. Similarly, the mean square is the sum of squares divided by the degrees of freedom (the number of observations that were free to vary).

Some important notes:

If you have more than one factor in your ANOVA, each will have its own mean square. You might see these referred to by their factor names, e.g. ‘mean square airport’ or MS_airport.
MS_within is also often referred to as ‘mean square error’ (MS_e) or ‘mean square residual’ (MS_r).

Complete the following sentence: To compute MS_Between, divide ______ by ______.

Imagine you are reading a journal article about the Instragram study described above. Answer the following questions:

You see the term MS~age group~. What Mean Square is this referring to? MS_Between or MS_Within?
You see the term ‘mean square error’ (MS_e) or ‘mean square residual’ (MS_r). What Mean Square is this referring to? MS_Between or MS_Within?

Step 4. Compute the F statistic

Now, finally, we can compute the F statistic. If we have more than one factor, we do this separately for each one.

F = MS_b / MS_w

F Statistic

The F statistic will be bigger if MS_between (numerator of the fraction) is bigger. In other words, F is big is the different groups are farther apart.

The F statistic will be smaller if MS_within (denominator of the fraction) is bigger. In other words, F is big when the groups are more compact and don’t overlap as much.

Complete the following sentence: To compute the F Statistic, divide ______ by ______.
Place the following things in the appropriate slots in the table below:
- Bigger difference between the groups (Bigger MS_Between)
- Larger standard deviation within each group (Bigger MS_Within)
- Bigger N (sample size; Smaller MS_Within)

Make F statistic bigger	Make F statistic smaller
?
?

The F Distribution

Of course, now that we have our F statistic, we need a distribution to compare it to. Unsurprisingly, we’ll use the F distribution. The shape of the F distribution depends on both the df_between and the df~within. It looks like this:

Copy the code chunk below into your R markdown document. Run it. Set the Alpha Level to 0.05, then play around with the Degrees of Freedom Between and Degrees of Freedom Within. Then answer the questions below.

if (!require("shiny")) install.packages("shiny")
library(shiny)

shinyApp(
  ui = fluidPage(
    fluidRow(
      column(4, wellPanel(
      sliderInput("df1", label = h3("Degrees of Freedom (Between)"), min = 1, 
          max = 10, value = 1),
      )),
      column(4, wellPanel(
      sliderInput("df2", label = h3("Degrees of Freedom (Within)"), min = 1, 
          max = 1000, value = 1),
      )),
      column(4,wellPanel(
      sliderInput("alpha", label = h3("Alpha Level"), min = 0, 
          max = 0.5, value = 0),
    ))),
      fluidRow(
      column(12, offset = 0,
        plotOutput("plot")
      )
    )),
  server = function(input, output) {
      output$plot = renderPlot({
        ggplot(data.frame(x = c(-1, 8)), aes(x=x)) + theme_bw() +
        stat_function(fun = df, geom = "area", fill = "red1", 
                  xlim = c(qf(1 - input$alpha, df1 = input$df1, df2 = input$df2), 8), 
                  args = list(df1 = input$df1, df2 = input$df2), 
                  color = "red", size = 2) +
        stat_function(fun = df, args = list(df1 = input$df1, df2 = input$df2), color = "black", size = 2) + 
        scale_x_continuous(breaks = -1:8, labels = -1:8) + 
        labs(
          x = expression(italic("F")),
          y = expression(paste("P(", italic("F"), ")")),
          title = expression(paste("The ", italic("F"), " Distribution"))
        ) + coord_cartesian(xlim = c(-1, 8)) + theme(text = element_text(size = 30))
      }) 
},
options = list(height = 900)
)

What happens to the shape of the F distribution as you increase DF_Between?
What happens to the red area (alpha level) as you increase the DF_Between?
What happens to the shape of the F distribution as you increase DF_Within?
What happens to the red area (alpha level) as you increase the DF_Within?
In general, does increasing the degrees of freedom make it easier or harder to get a statistically significant result? Why?
If you are in charge of collecting data for a study, what can you do to increase DF_Between? What about DF_Within? Which change do you think is more impactful in helping you find a significant result?

Arguments

Now that we know how ANOVA works, let’s learn about how to implement them in R. We’ll start simple, using the example from the flights data.

Flights_ANOVA <- aov(dep_delay ~ origin, data = flights)

Note the following:

We are using the aov() function to run the ANOVA. There are other functions we could use, but this works fine for a design with a single variable.
We have to specify a formula, just like we did with the t-test.
We have to specify what data we are using, just like we did with the t-test.

Here is what the output looks like. Note we use the summary() function here; it makes the results more readable.

summary(Flights_ANOVA)

##                 Df    Sum Sq Mean Sq F value              Pr(>F)    
## origin           2   1280515  640258   396.9 <0.0000000000000002 ***
## Residuals   328518 529886717    1613                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 8255 observations deleted due to missingness

Notice:

There are three 5 columns:
- Df = degrees of freedom
- Sum Sq = Sum of Squares
- Mean Sq = Mean Square
- F value = the F value. I feel like this one was obvious.
- Pr(>F) = the p value. Remember that 2e-16 means “a really tiny number”.
And 2 rows:
- origin. This row reports df_between, SS_between, MS_between, and the F statistic and p values for our origin factor.
- Residuals. This row reports df_within, SS_within, and MS_within.
  - These numbers were used to compute the F statistic. Check for yourself. 640257.6064865 / 1612.9609862 = 396.9455008

Was the result statistically significant?

ANOVA is an omnibus test.

Notice that even though we have 3 levels in our variable, ANOVA only reported 1 p-value. What does this mean?

ANOVAs don’t compare every level to every other level. Instead, the ANOVA looks for evidence that at least one of the levels is different from at least one of the other levels. It indicates to you that “one of these things is not like the others”. BUT it doesn’t tell you which.

Big Bird is an ANOVA

Following up on ANOVA results

If you get a statistically significant result in ANOVA, do the following:

Graph our data to see which level of the variable appears to be different.
If necessary, do some t-tests to compare pairs of levels together.

Graph the data and look at the graph

Let’s start with a graph. Armed with the knowledge that there are differences between as least two of the groups, we can interpret this graph with confidence.

It looks like all the groups are different from each other.

Follow-up t-tests.

We can confirm our observations using the t_test() function from the rstatix package:

#if(!required("rstatix")) install.packages("rstatix")
library(rstatix)

t.test.results <- flights %>%
  t_test(dep_delay ~ origin) %>%
  adjust_pvalue(method = "none") %>%
  add_significance()

knitr::kable(t.test.results)

.y.	group1	group2	n1	n2	statistic	df	p.adj.signif
dep_delay	EWR	JFK	117596	109416	17.76196	226958.1	****
dep_delay	EWR	LGA	117596	101509	27.36163	216266.2	****
dep_delay	JFK	LGA	109416	101509	10.24619	208866.4	****

We’ll use this one later, when we talk about interactions.

Describing ANOVA results

Now that we understand our ANOVA results, we want to communicate them to others. At a minimum, we need to include the following:

The F statistic, with both degrees of freedom.
the Mean Square Residual, usually called the Mean Square Error or MS_e
The p value.
A description of any follow-up tests.
A plain-language description of the interpretation.

You might be asked to include other information, such as

partial eta squared, a measure of effect size.

So, our paragraph might look something like this:

There was a significant main effect of Airport (F(2, 328518) = 396.9, MS_e = 1613, p < .001, $\eta^2_p$ = 0.0024). Post-hoc comparisons revealed that EWR has the longest delays, JFK the second longest, and LGA the shortest.

In order to do the ANOVAs that follow, you will need to install and load the datarium and Stats2Data packages. Add this code you your setup chunk.

if (!require("datarium")) install.packages("datarium")
library(datarium)
if (!require("Stat2Data")) install.packages("Stat2Data")
library(Stat2Data)

Doing a One-Way ANOVA

A One-Way ANOVA has a single predictor/independent variable. We’ll start with a data set about Alzheimer’s. Run the following code:

data(Amyloid)
?Amyloid
Amyloid

##    Group Abeta
## 1    NCI   114
## 2    NCI    41
## 3    NCI   276
## 4    NCI     0
## 5    NCI    16
## 6    NCI   228
## 7    NCI   927
## 8    NCI     0
## 9    NCI   211
## 10   NCI   829
## 11   NCI  1561
## 12   NCI     0
## 13   NCI   276
## 14   NCI   959
## 15   NCI    16
## 16   NCI    24
## 17   NCI   325
## 18   NCI    49
## 19   NCI   537
## 20   MCI    73
## 21   MCI    33
## 22   MCI    16
## 23   MCI     8
## 24   MCI   276
## 25   MCI   537
## 26   MCI     0
## 27   MCI   569
## 28   MCI   772
## 29   MCI     0
## 30   MCI   260
## 31   MCI   423
## 32   MCI   780
## 33   MCI  1610
## 34   MCI     0
## 35   MCI   309
## 36   MCI   512
## 37   MCI   797
## 38   MCI    24
## 39   MCI    57
## 40   MCI   106
## 41   mAD   407
## 42   mAD   390
## 43   mAD  1154
## 44   mAD   138
## 45   mAD   634
## 46   mAD   919
## 47   mAD  1415
## 48   mAD   390
## 49   mAD  1024
## 50   mAD  1154
## 51   mAD   195
## 52   mAD   715
## 53   mAD  1496
## 54   mAD   407
## 55   mAD  1171
## 56   mAD   439
## 57   mAD   894

Now examine the data and read the help page in the ‘help’ tab. Then answer the following questions:

What is the Dependent variable?
What is the Predictor variable? How many levels does it have, and what are they?
Is this data within- or between-subjects?
Now set up an ANOVA to test whether the levels of the Predictor variable are different.
- Don’t forget to use the summary() function to show your results.
- Your output should look like this:

##             Df  Sum Sq Mean Sq F value  Pr(>F)   
## Group        2 2129969 1064985   5.971 0.00454 **
## Residuals   54 9632060  178371                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpreting ANOVA Output

What is the DF_Between and DF_Within for this data?
Which Mean Squares is MS_Between and which is MS_Within?
Which two numbers were divided to get the F value of 5.971?
Are the levels of the Predictor statistically different? How do you know?

Following Up

Explain what the following statement means in your own words: “ANOVA in an omnibus test”.
Show how you can get summary statistics for the levels of the Predictor, like this:

## # A tibble: 3 × 3
##   Group mean_Amyloid    sd
##   <fct>        <dbl> <dbl>
## 1 mAD           761.  427.
## 2 MCI           341.  406.
## 3 NCI           336.  436.

Now copy the following code into your document. Use comments to explain what each line of code is doing:

Amyloid %>% t_test(Abeta ~ Group) %>%
  adjust_pvalue(method = "none") %>%
  add_significance()

## # A tibble: 3 × 10
##   .y.   group1 group2    n1    n2 statistic    df     p p.adj p.adj.signif
##   <chr> <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl> <dbl> <chr>       
## 1 Abeta mAD    MCI       17    21    3.08    33.6 0.004 0.004 **          
## 2 Abeta mAD    NCI       17    19    2.95    33.7 0.006 0.006 **          
## 3 Abeta MCI    NCI       21    19    0.0358  36.9 0.972 0.972 ns

Now write a brief 1-2 sentence APA-style summary of the results. When you interpret ANOVA results, you want to do 3 things:
- Describe the results numerically. For this ANOVA, you must include the F statistic, both degrees of freedom, the MS_e (mean square error, AKA MS_Within) and the p value. Like this: There was a main effect of Group (F(2,12) = 4.938, MS_e = 1.067, p = .027).
- Describe any follow-up t-tests you did. Like this: Follow-up t-tests indicated that …
- Describe the results in plain language. Like with words that a human would use to talk to another human. Like this: These results show that XXX Group had higher Amyloid Plaque than …

Within-subjects vs. Between-subjects Designs

So that’s how you do a basic one-way ANOVA. Before we proceed, we need to (re)learn the distinction between within-subjects and between-subjects data.

What is the subject in between- and within-subjects designs?

The subject is the person (or thing) providing the data.

All swans are white. The subjects are the specific swans that we examine.
Every bald person is smart. The subjects are the specific bald people that we give IQ tests to.
Do men, on average, weigh more than women? The subjects are the specific men and women we weigh.
Are the mice heavier post-treatment than pre-treatment? The subjects are the specific mice that we weigh.

Between-Subjects Data

In between-subjects data, each subject is compared to a different subject.

The data we used above - flights departing from 3 different airports - is between-subjects data. The subjects - the specific thing providing data - are the individual flights; each flight had its own departure delay (dependent variable) and each flight had an airport of origin (factor). We know that each flight left either EWR, JFK, or LGA; it is impossible for a given airplane flight to leave from two different airports! So when we compared the average departure delay between these three airports, we were comparing different flights to each other. That’s between-subjects data.

Another way to think about between-subjects data is to ask “How many data points did each subject give us?”. Since each flight only gave one data point (i.e. it only had one departure delay value), this is between-subjects data.

When the study’s purpose is to compare two different groups (e.g. ADHD vs. control, Americans vs. Europeans, rural vs. urban), the design and data are usually between-subjects.

Let’s consider some of the examples given above:

All swans are white. Each swan only gives us one data point - black or white - so this is between-swans data.
Every bald person is smart. Each person is either bald or not; we’re comparing different people when we compared baldies and hairies. Also, each person’s IQ is only going to be measured once.
Do men, on average, weigh more than women? When we are comparing men to women, we are comparing two groups of different people. So this is between-subjects.
Are the mice heavier post-treatment than pre-treatment? In this case, we are measuring each mouse twice - before and after some treatment. This means that we are camparing each mouse to itself pre- vs. post-treatment. This is NOT between-subjects data.

Within-Subjects Data

The mouse example above represents within-subjects data. Within-subjects data is often called repeated-measures data. In this kind of data, each subject gives more than one data point, so that an individual subject is being compared to him/herself. Within-subjects data is data where participants experienced more than one level of a variable. In other words, 2 or more data points came from the same participant. This is conceptually much like the paired-samples t-test we saw earlier.

NOTE: More than one data point means more than one measurement of the same variable. If I take an IQ test and a personality test, that’s not within-subjects data because those are different variables. If I take an IQ test before and after I drink a bunch of mountain dew, that’s within-subjects data because the SAME variable is measures twice.

For example, suppose our participant Tim was part of a study about listening to music while studying. He came in and studied for an hour while listening to instrumental classical music. Then he took a test on the material he studied. The next week he studied for an hour listening to classical music with lyrics. Then he took a test on that material. For Tim, Lyrics is a within-subjects variable - he experienced Lyrics AND No Lyrics. Because of this, 2 different data points both came from Tim. And because these data points are both from the same person, they are not truly independent.

Other examples:

Are my students learning? If I compare the same students at the beginning and end of the semester, that’s within-subjects data.
Do I look better in blue? If you try on 6 blue outfits and 6 red ones and get people’s opinion, that’s within-subjects data: since you wore both the blue and the red, you’re comparing blue you to red you.

What do we do with within-subjects data?

The big difference between between-subjects and within-subjects data is that within-subjects data are not independent; because multiple data points come from the same person, those data points “belong together” - they are not free to vary. We already saw this in the last section when we learned about paired t-tests, where we wanted to link up the data points that belonged in neat pairs.

So when we do a within-subjects AKA repeated-measures ANOVA, we have to tell the ANOVA which data points belong to the same subjects. In the next section, we’ll see how that is done. But first, I have some questions:

What is a between-subjects variable? Give a definition AND an example.
What is a within-subjects variable? Give a definition AND an example.
Define the term repeated-measures. Give an example of a repeated-measures study. Is a repeated-measures variable between- or within-subjects?
Look at this data. What is/are the variables? Is each variable between- or within-subjects?

##    Quiz Johnny_Appleseed Paul_Bunyan John_Henry
## 1     1               61          53         50
## 2     2               60          50         51
## 3     3               60          50         48
## 4     4               64          51         51
## 5     5               61          51         53
## 6     6               61          51         49
## 7     7               58          50         48
## 8     8               62          49         49
## 9     9               62          50         51
## 10   10               61          47         49

Now look at this data about cars. Is this between- or within-subjects data?

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  hp_cat
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     Low
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     Low
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  Lowest
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1     Low
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 Average
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1     Low

Now look at this data about dogsledding. Is this between- or within-subjects data?

Musher	Checkpoint	Time
Jerry Sousa	1	243
Jerry Sousa	2	176
Jerry Sousa	3	304
Jerry Sousa	4	201
Melissa Owens	1	215
Melissa Owens	2	421
Melissa Owens	3	334
Melissa Owens	4	220

Now look at this data about diamonds. Is this between- or within-subjects data?

head(diamonds)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Generally speaking, if a data set is in wide format, is it most probably between- or within-subjects?

Repeated-Measures One-Way ANOVA

For this analysis, I’ll be using data from a reading study:

Vasilev, M. R., Hitching, L., & Tyrrell, S. (2023). What makes background music distracting? Investigating the role of song lyrics using self-paced reading. Journal of Cognitive Psychology, 1-27.

The OSF page is here: https://osf.io/8zw4x/ The specific data we will use is here: https://osf.io/7y3v9

This study tested people’s reading rates under 3 conditions:

Silence
Lyrical Music
Instrumental Music

The goal of the study was to test whether background music affected how quickly people read, and whether it matters whether the music had lyrics or not.

Here’s the data for one subjects (the authors conveniently use the word subject to label the column that indicates which subject):

subject	item	sound	reading_time_minutes	number_of_words	wpm	Experiment
1	13	silence	0.7016369	53	75.53765	Experiment 1a
1	10	silence	0.5253559	49	93.27011	Experiment 1a
1	4	silence	0.5805464	53	91.29331	Experiment 1a
1	7	silence	0.5305413	52	98.01310	Experiment 1a
1	1	silence	0.6064904	58	95.63217	Experiment 1a
1	5	lyrical	0.5067230	52	102.62017	Experiment 1a
1	14	lyrical	0.4038367	50	123.81242	Experiment 1a
1	11	lyrical	0.4633005	53	114.39659	Experiment 1a
1	8	lyrical	0.4066892	50	122.94400	Experiment 1a
1	2	lyrical	0.5125489	57	111.20890	Experiment 1a
1	9	instrumental	0.3108060	45	144.78486	Experiment 1a
1	12	instrumental	0.5378115	61	113.42264	Experiment 1a
1	6	instrumental	0.3790384	52	137.18925	Experiment 1a
1	3	instrumental	0.4976833	56	112.52136	Experiment 1a
1	15	instrumental	0.3870599	45	116.26107	Experiment 1a

Notice that this person read some passages in silence, others while listening to lyrical music, and still others while listening to instrumental music. So, when we are comparing reading rates for silence, lyrical, and instrumental music, we will be comparing subject 1 to subject 1 to subject 1 (and the same for the other subjects). This is within-subjects data for this reason.

Doing a repeated-measures ANOVA

So let’s do the ANOVA.

NOTE: There were multiple passages read in each sound condition. Some data points were missing (see “The Horrors of Unbalanced Data”, below) so I averaged the observations together.

NOTE: We’ve filtered the data because this study contains multiple experiments, and the last one included an extra level of the sound factor called “speech”, which I have removed.

NOTE: I’ve created a new subject label called subjectALL, which is a combination of subject number and experiment number. I did this because subject 1 in experiment 2 is NOT the same person as subject 1 in Experiment 3, so they need to be labeled uniquely.

Reading_ANOVA <- aov(wpm ~ sound + Error(subjectALL), data = readingrate)
# NOTE: WPM stands for Words Per Minute, a measure of reading rate.

Before we get to the output, let’s consider the difference here. The main change that makes this a within-subjects ANOVA is the addition of the Error() term to the formula. This terms says “use the subjectALL column to group this data, and treat the sound variable as a within-subjects variable”.

Interpreting repeated-measures output

## 
## Error: subjectALL
##            Df  Sum Sq Mean Sq F value Pr(>F)
## Residuals 819 4679985    5714               
## 
## Error: Within
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## sound        2   5288  2643.8   5.913 0.00276 **
## Residuals 1638 732401   447.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that the ANOVA is split into two parts. We are only interested in the “Error: Within” part. Is there a significant main effect of sound? What can you conclude from the follow-up t-tests below?

library(rstatix)

t.test.results <- readingrate %>% 
  t_test(wpm ~ sound, paired = TRUE) %>% # paired = TRUE because this is within-subjects data
  adjust_pvalue(method = "none") %>%
  add_significance() 

knitr::kable(t.test.results)

.y.	group1	group2	n1	n2	statistic	df	p	p.adj	p.adj.signif
wpm	instrumental	lyrical	820	820	3.414152	819	0.000671	0.000671	***
wpm	instrumental	silence	820	820	1.670099	819	0.095000	0.095000	ns
wpm	lyrical	silence	820	820	-1.793848	819	0.073000	0.073000	ns

One-Way Within-Subjects (Repeated Measures)

Now let’s do an ANOVA with within-subjects data!

Run the following code, then look at the data and the help page.

data(Fingers)
?Fingers
Fingers

Can you tell that this data is within-subjects? How?
Now set up an ANOVA to test whether the levels of the Predictor variable are different.
- To tell R that the data is within-subjects, add this to the Formula: Error(as.factor(Subject))
- Don’t forget to use the summary() function to show your results.
- Your output should look like this:

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  3   5478    1826               
## 
## Error: Within
##           Df Sum Sq Mean Sq F value Pr(>F)  
## Drug       2    872   436.0    7.88  0.021 *
## Residuals  6    332    55.3                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Get the means and standard deviations for the three types of Drugs. Show your code and the output.
Look at the following code. How is it different from the code in 40? More importantly, WHY is it different from the code in 40?

Fingers %>% t_test(TapRate ~ Drug, paired = TRUE) %>%
  adjust_pvalue(method = "none") %>%
  add_significance()

## # A tibble: 3 × 10
##   .y.     group1   group2         n1    n2 statistic    df     p p.adj p.adj.signif
##   <chr>   <chr>    <chr>       <int> <int>     <dbl> <dbl> <dbl> <dbl> <chr>       
## 1 TapRate Caffeine Placebo         4     4     4.08      3 0.026 0.026 *           
## 2 TapRate Caffeine Theobromine     4     4    -0.289     3 0.791 0.791 ns          
## 3 TapRate Placebo  Theobromine     4     4    -4.50      3 0.02  0.02  *

Now write a brief 1-2 sentence APA-style summary of the results.

Factorial Designs

So you can use an ANOVA with a single factor, as in the example we’ve already seen:

Hypothesis: Listening to music while studying will affect reading rate.
Dependent Variable: Reading Rate.
Factor: Sound: Silence, Lyrical Music, Instrumental Music.

An ANOVA with a single factor is called a one-way ANOVA. This simple experiment can’t be analyzed using a t-test, because the predictor variable (Sound) has 3 levels.

However, ANOVAs are ideal for factorial designs, experiments with more than one factor.

If we were to change our example ANOVA to be a factorial design, we would add a second factor:

Hypothesis 1: Listening to music while studying will affect reading rate.
Hypothesis 2: Familiar songs will be less distracting
Dependent Variable: Reading Rate.
Factor 1: Sound: Silence, Lyrical Music, Instrumental Music.
Factor 2: Familiarity (Levels: participants know the songs, songs are unknown to the participants)

	Familiar Songs	Unfamiliar Songs
Lyrical Music	“Stayin’ Alive” by the BeeGees	Pretty much anything else by the BeeGees
Instrumental Music	Theme from Star Wars	Theme from 80’s TV show Airwolf

readingrate <- readingrate %>%
  mutate(MusicFamiliarity = case_when(
    Experiment == "Experiment 1a" | Experiment == "Experiment 1b" ~ "Familiar",
    TRUE ~ "Unfamiliar"
  )) # Here I make the new Factor 'MusicFamiliarity'

Main Effects

The separate effects of the different factors in an ANOVA are called Main Effects. A main effect is the independent effect of a factor on the dependent variable, separate from any other factor.

A one-way ANOVA has only one factor, so only one main effect. A two-way ANOVA has two factors, so two main effects. And so on.

Interactions

Main effects are fine, but the real reason to do a factorial design is to look at Interactions. While a main effect tells us how one factor influences the dependent variable, an interaction explores how two (or more) factors work together to influence the dependent variable. For example, we might ask:

Does the effect of music on reading rate get stronger when the music has is familiar?

We are asking if Sound (Factor 1) affects reading rate (dependent variable) differently for Familiar vs. Unfamiliar songs (Factor 2).

While main effects test each factor separately, an Interaction is a test of how two factors work together to influence the dependent variable.

Arguments in a Factorial Design

How would we set up this factorial design in ANOVA? We can start with what we know how to do: An ANOVA with a single factor.

Reading_ANOVA_Factorial <- aov(wpm ~ sound + Error(subjectALL), data = readingrate)

Now we add our other factors, separating each with a plus (+).

Reading_ANOVA_Factorial <- aov(wpm ~ sound + MusicFamiliarity + Error(subjectALL), data = readingrate)

And then we tell R to give us the output.

summary(Reading_ANOVA_Factorial)

## 
## Error: subjectALL
##                   Df  Sum Sq Mean Sq F value              Pr(>F)    
## MusicFamiliarity   1  344160  344160   64.93 0.00000000000000274 ***
## Residuals        818 4335825    5301                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## sound        2   5288  2643.8   5.913 0.00276 **
## Residuals 1638 732401   447.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that MusicFamiliarity, which is a between-subejcts factor, is in the top section, while sound, still a within-subjects factor, is in the bottom section as usual.

What does the Main Effect of MusicFamiliarity mean? Use the graph below to guide your interpretation.

Adding the Interaction

Let’s see how we would set this ANOVA up to incorporate an interaction:

Reading_ANOVA_Factorial_wInteraction <- aov(wpm ~ sound * MusicFamiliarity + Error(subjectALL), data = readingrate)

I’ve linked sound and MusicFamiliarity with an asterisk (*) instead of a plus (+). This tells R to consider BOTH the main effects AND the interaction of these two variables.

Here’s the output:

summary(Reading_ANOVA_Factorial_wInteraction)

## 
## Error: subjectALL
##                   Df  Sum Sq Mean Sq F value              Pr(>F)    
## MusicFamiliarity   1  344160  344160   64.93 0.00000000000000274 ***
## Residuals        818 4335825    5301                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                          Df Sum Sq Mean Sq F value  Pr(>F)   
## sound                     2   5288  2643.8   5.908 0.00278 **
## sound:MusicFamiliarity    2    258   129.2   0.289 0.74932   
## Residuals              1636 732142   447.5                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interaction is represented as sound:MusicFamiliarity. It is not statistically significant. This tells us that the effect of sound (probably) did not depend on music familiarity - or at least we have no evidence that it does. In other words, whether the music was familiar or not, the effect of sound on reading rate was the same. See the graph below. Notice that even though the bars on the left are lower, the instrumental bar is a bit higher and the lyrical bar a bit lower on both sides.

I’ve seen that word before vs. that word’s hard to see

Let’s see another example of a factorial design. Hopefully this time the interaction will be significant!

Here is the original study that explains this data: http://germel.dyndns.org/psyling/pdf/2008_Yap_et_al_SQ_frequency.pdf

Here is the replication study that re-created the original study: https://osf.io/ahpik/

Here is the replication data we are analyzing: https://osf.io/6kaw2

Explanation of the Data

In this study, participants completed a ‘lexical decision task’, they were shows a string of letters and had to decide, as quickly as possible, if it was a real word or not.

Two factors were manipulated by the researchers:

Word Frequency: How often a given word is used in speech and writing. ‘House’ is a much more common word than ‘Hobby’, and it should be recognized faster.
Clarity: How easy a word is to see. Some of the words alternated on the screen between the word itself and a string of random symbols. This flickering made the word hard to see.

The authors of the study wanted to look for a:

Main Effect of Frequency. Are more common words easier (faster) to recognize? Probably!
Main Effect of Clarity. Are words that are harder to see also harder to recognize? I’m betting yes.
An interaction of frequency and clarity. If I make a word harder to see, does frequency matter more or less?

	High Frequency Word	Low Frequency Word
Clear	HOUSE	HOBBY
Degraded	HOUSE vs. @%?&!	HOBBY vs. @#%?&!

Running the ANOVA

Here’s our ANOVA:

Word_ANOVA <- aov(Mu ~ Frequency * Clarity + Error(subject), data = worddata)
summary(Word_ANOVA)

## 
## Error: subject
##           Df  Sum Sq Mean Sq F value Pr(>F)
## Residuals 70 6384156   91202               
## 
## Error: Within
##                    Df  Sum Sq Mean Sq F value            Pr(>F)    
## Frequency           1 1747096 1747096  60.008 0.000000000000397 ***
## Clarity             1  172073  172073   5.910            0.0159 *  
## Frequency:Clarity   1  115052  115052   3.952            0.0481 *  
## Residuals         210 6114014   29114                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like both main effects are significant, and the interaction is too!

Frequency Main Effect

Let’s interpret the main effect of Frequency. Here’s a graph.

worddata %>% 
  group_by(Frequency) %>%
  summarise(delay = mean(Mu, na.rm = TRUE), se = std.error(Mu)) %>%
ggplot(aes(x = Frequency, y = delay)) +
  geom_bar(stat = "identity", position = position_dodge(), aes(fill = Frequency), color = "black") + 
  geom_errorbar(aes(ymin=delay-se, ymax=delay+se),
                width=.2, # Width of the error bars
                position=position_dodge(.9)) + theme(legend.position = "none", text = element_text(size = 20)) + labs(
                  x = "Frequency",
                  y = "Response Time"
                ) + theme_bw() + theme(legend.position = "none") #+ coord_cartesian(ylim = c(140, 190))

And here’s how I would write up the results for this part.

There was a significant main effect of Word Frequency (F(1, 210) = 60, MS_e = 29114, p < .001), indicating that high frequency words were recognized more quickly than low frequency words.

Clarity Main Effect

Now let’s do the same thing for Clarity.

worddata %>% 
  group_by(Clarity) %>%
  summarise(delay = mean(Mu, na.rm = TRUE), se = std.error(Mu)) %>%
ggplot(aes(x = Clarity, y = delay)) +
  geom_bar(stat = "identity", position = position_dodge(), aes(fill = Clarity), color = "black") + 
  geom_errorbar(aes(ymin=delay-se, ymax=delay+se),
                width=.2, # Width of the error bars
                position=position_dodge(.9)) + theme(legend.position = "none", text = element_text(size = 20)) + labs(
                  x = "Clarity",
                  y = "Response Time"
                ) + theme_bw() + theme(legend.position = "none") #+ coord_cartesian(ylim = c(140, 190))

There was a significant main effect of Clarity (F(1, 210) = 5.91, MS_e = 29114, p = .016), indicating that clear words were recognized more quickly than degraded words.

Interaction

And finally, let’s interpret the interaction. We’ll start with a graph.

worddata %>% 
  group_by(Clarity, Frequency) %>%
  summarise(delay = mean(Mu, na.rm = TRUE), se = std.error(Mu)) %>%
ggplot(aes(x = Frequency, y = delay, group= Clarity)) +
  geom_bar(stat = "identity", position = position_dodge(), aes(fill = Clarity), color = "black") + 
  geom_errorbar(aes(ymin=delay-se, ymax=delay+se),
                width=.2, # Width of the error bars
                position=position_dodge(.9)) + theme(legend.position = "none", text = element_text(size = 20)) + labs(
                  x = "Clarity",
                  y = "Response Time"
                ) + theme_bw() + theme(legend.position = "none") #+ coord_cartesian(ylim = c(140, 190))

And some follow-up t-tests.

library(rstatix)

t.test.results <- worddata %>% 
  group_by(Frequency) %>%
  t_test(Mu ~ Clarity, paired = TRUE) %>% # paired = TRUE because this is within-subjects data
  adjust_pvalue(method = "none") %>%
  add_significance() 

knitr::kable(t.test.results)

Frequency	.y.	group1	group2	n1	n2	statistic	df	p	p.adj	p.adj.signif
High	Mu	clear	degraded	71	71	-0.8483092	70	0.399000	0.399000	ns
Low	Mu	clear	degraded	71	71	-3.6352108	70	0.000526	0.000526	***

And here is what I would conclude:

These two main effects were qualified by a significant interaction (F(1, 210) = 3.952, MS_e = 29114, p < .048). Follow-up t-tests indicated that the effect of Clarity was significant for Low Frequency words, but not for High Frequency Words.

Two-Way ANOVA

A two-way ANOVA has 2 Factors, instead of just one. Two-way ANOVAs let us test 3 things: * Does Factor A predict the dependent variable? * Does Factor B predict the dependent variable? * Do Factor A and Factor B interact (work together) to predict the dependent variable?

Run the code below:

data(stress)
?stress
stress

## # A tibble: 60 × 5
##       id score treatment exercise   age
##    <int> <dbl> <fct>     <fct>    <dbl>
##  1     1  95.6 yes       low         59
##  2     2  82.2 yes       low         65
##  3     3  97.2 yes       low         70
##  4     4  96.4 yes       low         66
##  5     5  81.4 yes       low         61
##  6     6  83.6 yes       low         65
##  7     7  89.4 yes       low         57
##  8     8  83.8 yes       low         61
##  9     9  83.3 yes       low         58
## 10    10  85.7 yes       low         55
## # ℹ 50 more rows

Now examine the data and read the help page in the ‘help’ tab. Then answer the following questions:

What variable would make the best dependent variable?
What variables would make acceptable ANOVA Predictors? I think there are two.
Is this data within- or between-subjects?
Now set up an ANOVA to test whether your two Predictors
- Don’t forget to include the Interaction!
- Don’t forget to use the summary() function to show your results.
- Your output should something look like this:

##                    Df Sum Sq Mean Sq F value        Pr(>F)    
## treatment           1  351.4   351.4  12.295      0.000923 ***
## exercise            2 1776.3   888.1  31.076 0.00000000104 ***
## treatment:exercise  2  217.3   108.7   3.802      0.028522 *  
## Residuals          54 1543.3    28.6                          
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice that there are 3 lines of results instead of just 1. The first line is the ‘main effect’ of treatment, the second line is the ‘main effect’ of exercise, and the third line is the interaction of treatment and exercise.

Main Effects

In your own words, define ‘main effect’.
Use the t-test results below to help you write a brief 2-4 sentence APA-style summary of the two main effects.

stress %>% group_by(treatment) %>% t_test(score ~ exercise) %>%
  adjust_pvalue(method = "none") %>%
  add_significance()

## # A tibble: 6 × 11
##   treatment .y.   group1   group2      n1    n2 statistic    df          p      p.adj p.adj.signif
##   <fct>     <chr> <chr>    <chr>    <int> <int>     <dbl> <dbl>      <dbl>      <dbl> <chr>       
## 1 yes       score low      moderate    10    10    0.388   17.8 0.703      0.703      ns          
## 2 yes       score low      high        10    10    6.65    16.0 0.00000562 0.00000562 ****        
## 3 yes       score moderate high        10    10    6.65    16.8 0.00000437 0.00000437 ****        
## 4 no        score low      moderate    10    10    0.0809  17.4 0.936      0.936      ns          
## 5 no        score low      high        10    10    3.36    17.2 0.004      0.004      **          
## 6 no        score moderate high        10    10    3.01    18.0 0.007      0.007      **

Interaction

An interaction means that the effect of one Factor depends on the other Factor. Consider the graph below. How is the effect of the Factor exercise different for Treatment=yes than for Treatment=no?

Now write a brief 1-2 sentence APA-style summary of the interaction effect. Remember, you need numbers AND words.

Two-Way Repeated Measures ANOVA

Now let’s do a repeated-measures ANOVA!

data(selfesteem2)
?selfesteem2
selfesteem2

## # A tibble: 24 × 5
##    id    treatment    t1    t2    t3
##    <fct> <fct>     <dbl> <dbl> <dbl>
##  1 1     ctr          83    77    69
##  2 2     ctr          97    95    88
##  3 3     ctr          93    92    89
##  4 4     ctr          92    92    89
##  5 5     ctr          77    73    68
##  6 6     ctr          72    65    63
##  7 7     ctr          92    89    79
##  8 8     ctr          92    87    81
##  9 9     ctr          95    91    84
## 10 10    ctr          92    84    81
## # ℹ 14 more rows

Now examine the data and read the help page in the ‘help’ tab. Then do the following:

How do you know that the variables (time or treatment) are within-subjects?
Pivot the data so it is in long format.
- Only pivot columns 3:5 (the time columns). Send the column names to a new column called “time” and the values to “self_esteem_score”
Now do an ANOVA on the pivoted data. You should get the results shown below:
- You’ll need to tell R that this is within-subjects data by adding something like this to your formula: Error(VariableThatIdentifiesTheSubject)
Now do the follow-up t-tests.
- Hint: Do you need to group your data before you do the t-tests? What variable should you group by?
- Hint: Should you use paired=TRUE or paired=FALSE?

Based on the ANOVA results, follow-up t-tests, and the graph provided, write an APA-style summary of the results.
- Describe both main effects in numbers and in words. Use the closes mean square error.
- Describe the interaction in numbers and in words.

When do we NOT use ANOVAs (but we could, if we had to)

Below are two situations where I don’t think you should use ANOVAs:

When you have within-subjects data.
When you have unbalanced data.

First I’ll explain how you COULD do an ANOVA in these situations. Then we’ll talk about why I don’t think you SHOULD.

Within-subjects data

I can hear you saying, “What do you mean, I shouldn’t use ANOVAs with within-subjects data? Why’d you teach me about it, then?”

Here are 3 good reasons why I taught you about within-subjects data here:

It IS important to understand the difference between between- and within-subjects data, and now seemed as good a time as any to teach you.
You may be called upon to do an ANOVA, or at least to interpret one. It’s still important to know what a good ANOVA looks like.
Remember when you were in grade school and they taught you about drugs? Same reason - so you could stay away.

The Horrors of Unbalanced Data

Since ANOVA was made for factorial designs, which are usually experimental or quasi-experimental in nature, ANOVA expects that the data will be balanced: there will be (close to) the same number of observations in each cell.

There are a couple of reasons why you might have unbalanced data:

You have a lot of missing data
Your data are from a naturally occurring data set, so there are more data points in some cells than others by chance.

If your data is unbalanced, factorial designs require a special variation of ANOVA. If you are interested in the how and why, google “Type I and Type II ANOVAs”. If not, just know that you can do a factorial ANOVA on unbalanced data this way:

if (!require("car")) install.packages("car")
library(car)

Flights_ANOVA_Unbalanced <- Anova(lm(dep_delay ~ origin + carrier, data = flights))
Flights_ANOVA_Unbalanced

## Anova Table (Type II tests)
## 
## Response: dep_delay
##              Sum Sq     Df F value                Pr(>F)    
## origin       111096      2  34.777  0.000000000000000791 ***
## carrier     5178615     15 216.144 < 0.00000000000000022 ***
## Residuals 524708102 328503                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Why Not ANOVA?

There’s a simple reason why you shouldn’t do an ANOVA in these situations: a better test is available. Mixed-effects models handle repeated-measures data better than ANOVAs do, and they deal with unbalanced designs better, too.

Where do I get these ‘mixed-effects models’, you ask? Stick around - we’ll get to them soon enough.

Real Data: Do People Want to Be More Moral

Test the following research question: Which work-related trait do People want to change the most: Organization, Productiveness, Assertiveness, or Responsibility? To do this, do the following:
- Read in ‘moraldatalong.csv’ from Checklist 4.
- Filter the data to include ONLY Organization, Productiveness, Assertiveness, and Responsibility from the Trait variable.
- Decide if the ANOVA is between-, within-, or mixed.
- Perform the appropriate ANOVA
- Perform follow-up t-tests as needed.
Make a graph comparing the four Traits.
Write an APA-style paragraph describing the results of your analysis.

(Bonus) Do an ANCOVA

An ANCOVA in an ANOVA that includes one or more continuous predictors. The purpose of including this additional predictor is to statistically control for this value. In other words, an ANCOVA checks to see if the Factors are significant even after accounting for some potential confounding factors.

Redo the ANOVA of the stress data from item 58 as an ANCOVA, this time including age as a co-variate.

##                    Df Sum Sq Mean Sq F value         Pr(>F)    
## treatment           1  351.4   351.4  14.141       0.000425 ***
## exercise            2 1776.3   888.1  35.743 0.000000000149 ***
## age                 1  222.7   222.7   8.964       0.004177 ** 
## treatment:exercise  2  220.9   110.5   4.446       0.016409 *  
## Residuals          53 1316.9    24.8                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Re-make the graph I made for item 61.

Real Data: Honestly Hot! (Bonus)

For this task, we will be replicating part of this paper:

Niimi, R., & Goto, M. (2023). Good conduct makes your face attractive: The effect of personality perception on facial attractiveness judgments. Plos one, 18(2), e0281758.

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0281758

The Open Science Framework Page for this data can be found here:

https://osf.io/rysnm/

Let’s use the data from Experiment 1:

Data: https://osf.io/5qx9j Codebook: https://osf.io/szn2y

Read in the data. Create a new variable called attractiveness that is the inverse of Phys1 (see the note in the codebook about this).
- Hint: Google “formula to reverse code Likert scale” or something similar.
- Once you’ve gotten the data ready, save it as “honestyhotnessdata.csv” so we can use it later.
Do an appropriate ANOVA. Include attractiveness as your dependent variable. Include 3 factors: StimHonesty, StimAtty, and StimGender, as well as the interactions of all 3 factors and the three-way interaction.
Using graphs and follow-up t-tests, interpret the ANOVA results and write up a paragraph describing what you found.

Real Data: Autism (Bonus! But not easy!)

For this task, we will be replicating part of this paper:

Birmingham, E., Stanley, D., Nair, R., & Adolphs, R. (2015). Implicit social biases in people with autism. Psychological Science, 26(11), 1693-1705.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4636978/

The Open Science Framework Page for this data can be found here:

https://osf.io/9tu5r/

Find the file “iatDataForSPSS.txt”. It’s hidden deep within a zip file. Move that file into your project folder.
Read in the data and prep it for analysis:
- Remember, it’s a .txt, NOT a .csv
- Create a new variable called BiasType. If Exp contains the words “Flower” or “Shoes”, BiasType should be “Non-Social”. If Exp contains the words “Gender” or “Race”, BiasType should be “Social”. Otherwise it should be NA.
Set up and run your ANOVA, using BiasType and Group as your Factors and D as the dependent variable.
- Hint: One of the Factors is within-subjects. Can you figure out which one?

## 
## Error: Subj
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## BiasType        1  0.147  0.1469   0.797 0.375630    
## Group           1  2.723  2.7230  14.768 0.000292 ***
## BiasType:Group  1  0.010  0.0104   0.057 0.812869    
## Residuals      61 11.248  0.1844                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Error: Within
##                 Df Sum Sq Mean Sq F value              Pr(>F)    
## BiasType         1  7.421   7.421  73.275 0.00000000000000108 ***
## BiasType:Group   1  0.404   0.404   3.992              0.0468 *  
## Residuals      255 25.826   0.101                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conduct follow-up t-tests to explore the ANOVA results.
Make a graph to explore the ANOVA results
Write an APA-style paragraph describing the outcome of this analysis.

Go To Home Page

Inference II: ANOVAs

Steven Luke

2026-01-02

Video Tutorial

Intro to ANOVAs

Variables

Dependent Variables

Predictor Variables

How Many Factors?

How Many Levels Per Factor?

ANOVA Variants

MANOVA

ANCOVA

The F Statistic

Step 1: Compute Sum of Squares

Total Sum of Squares

SSbetween

SSwithin

Step 2: Figure out degrees of freedom

How do Degrees of Freedom work, anyway.

Parable of the Pumpkins

How much did the first pumpkin cost?

How much did the second pumpkin cost?

How much does the third and final pumpkin cost?

Degrees of Freedom.

DFbetween

But what about DFwithin ?

Step 3. Compute Mean Squares

Step 4. Compute the F statistic

The F Distribution

Arguments

ANOVA is an omnibus test.

Following up on ANOVA results

Graph the data and look at the graph

Follow-up t-tests.

Describing ANOVA results

Doing a One-Way ANOVA

Interpreting ANOVA Output

Following Up

Within-subjects vs. Between-subjects Designs

What is the subject in between- and within-subjects designs?

Between-Subjects Data

Within-Subjects Data

What do we do with within-subjects data?

Repeated-Measures One-Way ANOVA

Doing a repeated-measures ANOVA

Interpreting repeated-measures output

One-Way Within-Subjects (Repeated Measures)

Factorial Designs

Main Effects

Interactions

Arguments in a Factorial Design

Adding the Interaction

I’ve seen that word before vs. that word’s hard to see

Explanation of the Data

Running the ANOVA

Frequency Main Effect

Clarity Main Effect

Interaction

Two-Way ANOVA

Main Effects

Interaction

Two-Way Repeated Measures ANOVA

When do we NOT use ANOVAs (but we could, if we had to)

Within-subjects data

The Horrors of Unbalanced Data

Why Not ANOVA?

Real Data: Do People Want to Be More Moral

(Bonus) Do an ANCOVA

Real Data: Honestly Hot! (Bonus)

Real Data: Autism (Bonus! But not easy!)

SS_between

SS_within

DF_between

But what about DF_within ?