Video Tutorial: Inferential Statistics Review & t-test basics

https://byu.box.com/s/bpw0okzgn7vg0p6atac0bx6649rkysar

Find the Cheater

Here’s a little code I wrote.

Part of it might look familiar - it creates scores for 4 different Dungeons and Dragons characters.

It also randomly makes one of these players a dirty, dirty CHEATER by adding +1 to all of their ability scores. Shame!

Except that sometimes it doesn’t make anyone a cheater. Can you tell why?

Players <- c("Travis", "Laura", "Liam", "Ashley")
CharacterStatsNames = c("Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
CharacterStats = c(0, 0, 0, 0, 0, 0)
Which_Cheater <- sample(1:5, 1, replace = TRUE) # Suspicious...
Player_Data <-  data.frame()

for (Which_Player in 1:length(Players)) { # Make ability scores for each player
for (Which_Stat in 1:6) { #create a for loop that will iterate 6 times
    sample(1:6, 4, replace = TRUE) %>% # Roll the Die
      sort(., decreasing = TRUE) %>% # Sort Vector from largest to smallest
      .[-4] %>% # Drop the last (smallest) number
      sum(.) %>% # Add up elements in vector and store in Score
      { if(Which_Player == Which_Cheater) # Add 1 to all the cheater's rolls!
        .+1 else . } ->
    CharacterStats[Which_Stat] #Input Score into Character Stats vector
} # Repeat this process 6 times, so that you have 6 different numbers in a vector
Player_Data <- rbind(Player_Data, c(Players[Which_Player], CharacterStats))
}
Player_Data <- as_tibble(Player_Data)
colnames(Player_Data) <- c("Player_Name", "Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
Player_Data <- pivot_longer(Player_Data, 2:7, names_to = "Stat", values_to = "Score") %>%
  mutate(Player_Name = as.factor(Player_Name), Stat = as.factor(Stat), Score = as.numeric(Score))

Here are the scores:

Re-rolling the Scores

Let’s re-roll the scores a few times and see if it’s always obvious who the cheater is. Add this code to your setup chunk:

if (!require("shiny")) install.packages("shiny")
library(shiny)

Now copy and run this code chunk.

shinyApp(

  ui = fluidPage(
    actionButton("action", "Re-roll Scores"),
    plotOutput("plot")
  ),

  server = function(input, output) {
    
    observeEvent(input$action, {
      output$plot = renderPlot({
        
        Players <- c("Travis", "Laura", "Liam", "Ashley", "Nobody")
        CharacterStatsNames = c("Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
        CharacterStats = c(0, 0, 0, 0, 0, 0)
        Which_Cheater <- sample(1:5, 1, replace = TRUE) # Suspicious...
        Player_Data2 <-  data.frame()
        for (Which_Player in 1:(length(Players)-1)) { # Make ability scores for each player
        for (Which_Stat in 1:6) { #create a for loop that will iterate 6 times
            sample(1:6, 4, replace = TRUE) %>% # Roll the Die
              sort(., decreasing = TRUE) %>% # Sort Vector from largest to smallest
              .[-4] %>% # Drop the last (smallest) number
              sum(.) %>% # Add up elements in vector and store in Score
              { if(Which_Player == Which_Cheater) # Add 1 to all the cheater's rolls!
                .+1 else . } ->
            CharacterStats[Which_Stat] #Input Score into Character Stats vector
        } # Repeat this process 6 times, so that you have 6 different numbers in a vector
        Player_Data2 <- rbind(Player_Data2, c(Players[Which_Player], CharacterStats))
        }
        Player_Data2 <- as_tibble(Player_Data2)
        colnames(Player_Data2) <- c("Player_Name", "Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
        Player_Data2 <- pivot_longer(Player_Data2, 2:7, names_to = "Stat", values_to = "Score") %>%
          mutate(Player_Name = as.factor(Player_Name), Stat = as.factor(Stat), Score = as.numeric(Score))
        
       ggplot(data = Player_Data2, aes(x = Stat, y = Score)) +
        geom_bar(stat = "identity", aes(color = Stat, fill = Stat)) + 
        theme_bw() +
        scale_color_manual(values = c("red4", "orange4", "yellow4", "green4", "blue4", "purple4")) + 
        scale_fill_manual(values = c("red1", "orange1", "yellow1", "green1", "blue1", "purple1")) +
        facet_grid(. ~ Player_Name) + 
        labs(
          x = NULL,
          title = "Who is the Cheater?",
          caption = paste("The cheater is", Players[Which_Cheater])
        ) + theme(legend.position = "none", axis.text.x = element_text(angle = 90, size = 15), axis.text.y = element_text(size = 15), strip.text = element_text(size = 25))
      })
    })  
    
},
options = list(height = 500)

)

Press “Re-roll scores” a few times and try to guess who the cheater is. The answer is in the bottom right, but don’t peek until you’ve guessed.

Can you tell who the cheater is? How sure are you? Sometimes you’re probably pretty sure, but other times, you might not be. Or you might be sure, but wrong!

What is it hard to tell who the cheater is?

Sometimes it feels pretty obvious who the cheater is. But usually it’s not - and it is hard to ever be 100% certain. Why? Because…

Variability within and between players obscures the systematic differences in this data.

Consider other, real-world data sets. What sources of variability might hide group differences?

In this unit, we’ll start to talk about how we can use inferential statistics to see through variability to find systematic patterns.

Let’s explore more specifically why it is hard to see group differences in noise.

Copy the following code chunk into your R Markdown document

quizmeans <- c(50, 50, 60) # Sets the 3 means
quizsd <- 20 # Sets the Standard Deviation

quizmeans <- sample(quizmeans) # Randomizes the order of the means

Quiz <- c(1:10) # Numbers the Quizzes 1-10
Johnny_Appleseed <- round(rnorm(100, mean = quizmeans[1], sd = quizsd))
Paul_Bunyan <- round(rnorm(100, mean = quizmeans[2], sd = quizsd))
John_Henry <- round(rnorm(100, mean = quizmeans[3], sd = quizsd))

# Combines the four vectors into a data frame
Quizdata <- data.frame(cbind(Quiz, Johnny_Appleseed, Paul_Bunyan, John_Henry))

# Prints the data 
Quizdata

# Makes a density plot
Quizdata %>% pivot_longer(cols = 2:4, names_to = "Student", values_to = "Grade") %>%
  ggplot(aes(x = Grade)) + geom_density(aes(fill = Student), alpha = 0.5) + 
  geom_density(aes(color = Student)) +
  theme_bw()

What does the line of code below do? Google if you have to. More specifically, what do the arguments ‘mean =’ and ‘sd =’ do?
- round(rnorm(10, mean = quizmeans[1], sd = quizsd))
Run the code chunk. Can you tell by looking at the numbers which student has the highest mean grade? On a scale from 1-10, how sure are you?
Look at the Density Plot that was generated. Based on what you see, why might you be having a hard time deciding which student has the highest average grade?
Now change the first line of the code above to this: quizmeans <- c(50, 50, 600). Then run the code.
- What did you change (use a statistical term)?
- Does this make it easier or harder to see the differences between the students? Why?
Now change the first line of code back (quizmeans <- c(50, 50, 60)) and change the second line of the code above to this: quizsd <- 2. Now run the code.
- What did you change (use a statistical term)?
- Does this make it easier or harder to see the differences between the students? Why?
Based on your experimentation, complete the following sentences:
- ______ differences between groups make group differences easier to detect.
- ______ variability within each group makes group differences easier to detect.

Basics of Statistical Inference

Variables

Any time we do statistics, we are working with at least 1 variable. We’ll return to this when we discuss individual statistical tests, because different tests apply to different sets of variables.

In our cheater example, what is/are the variable(s)?

Population vs. Sample

What’s the difference between a population and a sample?
Describe both the population and a sample of…
- Flights
- Second Language Learners
- Patients

What is the population in our cheater example? What is the sample?

What is the difference between a sample and a population?
Complete the following sentence:
- We use inferential statistics to try to understand the ______ by looking at a ______.
Make and fill out the table below for each of the following hypothetical studies.

Research Question	Population	(Possible) Sample
Do people want to be more moral?
Does experiencing a natural disaster temporarily boost relationship satisfaction in newlywed couples?
Do behavioral interventions improve sustained attention in those diagnosed with ADHD?
Is it Great being Nate? Rates of narcissism among individual named Nate.
Who survived the sinking of the Titanic?

Statistics

Statistics is a tool for estimating the population parameters from the sample.
- We want to make conclusions about everyone from someone.
  - Understand the population by looking at a sample.
- This is inductive reasoning: going from specific to general
  - As opposed to mathematics, which is deductive: general to specific

Steps in Statistical Inference

Assume that there is no effect (null hypothesis)
Collect data
What are the chances that I would have gotten this data if there were no effect?
Decide whether to retain or reject the null hypothesis

WITHOUT LOOKING, try and list the four steps of statistical inference. Full points for making the attempt.

Step 1: Assume there is no effect

We start with a null hypothesis. We assume that our variable does not differ from whatever we’re comparing it to. This might be a fixed number, a couple of different groups, etc.

In other words, we assume that there’s nothing to see here.

Null Hypothesis

Why do we start with this null hypothesis? What’s the logic here?

Black Swans, Bald Guys, and Proving Something True

Is it possible to prove that something is true? What would you have to do to prove the following?

All swans are white
Every bald person is smart

Hopefully you see that this would be a tall order:

How many swans would you have to check? ALL OF THEM! And when a new swan is born, you have to start over.
How many bald guys do you have to test? ALL OF THEM! And any time some guy gets a bad haircut or trips and falls into a bucket of Nair, you have to start over.

In other words, to prove that a universal statement (ALL, EVERY) is TRUE, you have to examine the population. We can’t do that, usually - if we had access to the population, we wouldn’t need statistics!

Proving Something False is a Lot Easier!

Is it possible to prove that something is true? What would you have to do to disprove the following?

All swans are white
Every bald person is smart

Right! Much easier:

You just have to find one black swan, and you’ve demonstrated that not all swans are white
All you have to do is find one bald guy who liked Rise of Skywalker and you’ve demonstrated that not all bald guys are smart.

So, you can disprove a universal statement by examining a sample. This is much more doable.

A Sample Disproving a Universal Statement

Applying This Step to Find the Cheater

So, how do we test whether someone in our D&D group is a cheater?

First, we assume that there is no cheater - everyone in our D&D group is honest.

We won’t believe that someone is a cheater unless we have strong evidence.

Then we plan to look for evidence to contradict this assumption. That’s step 2.

Define Null Hypothesis in your own words.
Again using your own words, explain why we use a null hypothesis. Examples are encouraged. Unique examples are especially encouraged. Silly or stupid unique examples are preferred.
Make and fill out the table below, providing a null hypothesis for each of the following hypothetical studies.

Research Question	Null Hypothesis
Do people want to improve moral traits more than morally neutral traits?
Does experiencing a natural disaster temporarily boost relationship satisfaction in newlywed couples?
Do behavioral interventions improve sustained attention in those diagnosed with ADHD?
Is it Great being Nate? Rates of narcissism among individual named Nate.
Did proportionately more women than men survive the sinking of the Titanic?

Step 2: Collect Data

Gathering evidence to test our null hypothesis is an art unto itself.

If you want to know how to do this, you’ll have to take a different class.

But remember this:

If step 2 is done correctly, it makes the next steps easy.
If step 2 is done incorrectly, it will lead to confusion - or even make the next steps impossible.

Sometimes, a person such as yourself who is proficient in R will be called upon to analyze data that someone else has collected. But often, you will be a part of the study from the beginning and will have some say in how the study is conducted. Based on what you have learned so far, what is one thing you would insist on during data collection to make your life easier when doing the analysis in R? This could be data format, the way things are labeled, clarity of research questions and hypotheses, or anything else you think of.

Step 3: What are the chances that I would have gotten this data if there were no effect?

In this step, we look at our data and ask yourself - do I have any evidence against the null hypothesis?

What counts as evidence against the null hypothesis? Let’s consider our D&D cheater. If we compare their scores to the expected scores for non-cheating players, how different are they?

To answer this question, we have to know what is expected. We need a distribution. So, I rolled up 100,000 D&D characters, and computed their average scores (the mean of all 6 of their ability scores). The average was about 12.24 Then I compared each individual set of scores to the average of all the scores to see how far it was from 12.24. In other words, how different was each score from the mean score? The graph below shows how what proportion of scores were within 1 point of 12.24, 2 points, 3 points, etc.

Notice how most scores were pretty close to the mean. These scores happen a lot, while other scores farther from the mean are quite rare. Importantly, you can see that even when no cheating is involved, there is a lot of variability in the ability scores of D&D characters, with extreme scores happening once in a while.

Let’s consier some specific examples. If a player gets an average score around 12, that’s only .25 points less than the average, so that player has a score that occurs quite often for non-cheaters. We have no reason to believe that this player is a cheater.

BUT, what if a player got an average score of about 16? That’s 3.75 points higher than average. Look at the graph - scores 3.75 points above the average don’t happen often in non-cheaters. So, we have reason to suspect that the player is, in fact, a cheater.

Summary

So, in sum, to find evidence against the null hypothesis, we need two things:

A statistic: a number that expresses how far some sample is from the expected value.
A distribution: something to compare our statistic to, so we know how (un)expected the statistic is.

This is true for all the tests we will talk about going forward.

Applying Step 3 to Find the Cheater

So, let’s look at our players, shall we? We have:

A statistic: We’ll just use the mean score for each player. This is crude - we’ll talk about better scores later.
A distribution: We created a distribution by making 100,000 characters. Again, there are better ways to do this, which we’ll talk about later.

Using these two bits of information, we can figure out how unexpected each player’s mean score is.

This graph shows the different probabilities for each of our players’ mean scores. We’ll need these probabilities for Step 4.

Define statistic in your your own words. Define distribution in your your own words.
Consider the hypothesis “Bananas go bad faster than most fruits”. To test this hypothesis, what would be the statistic and what would be the distribution?
What is a p-value? Define in your own words.
If I have a p-value of 0.13, what does that mean? What if I have a p-value of 0.002? What does that mean?

Step 4: Making a Decision

Now we have to decide. We have two choices:

Retain the Null Hypothesis: We keep believing that there is nothing happening in our data.
Reject the Null Hypothesis: We decide that we have good evidence to disprove the null hypothesis.

What is the basis for our decision? We use the probabilities we got in step 3. If those probabilities are small enough, we reject, otherwise we retain.

What does “small enough” mean. Usually, this means a probability of *0.05, or 5%.

Why? Let’s talk about this in the next section.

Type 1 and Type 2 Error

Take a look at this histogram. The white bars represent the distribution of ‘honest’ scores - scores rolled in the usual way without cheating. The red bards show the distribution of ‘cheaty’ scores = the dishonest player added +1 to each score. These two distributions are highly overlapping (the pink area).

Remember that this is the population distribution - we don’t have any real access to this. We only have a sample - a player who may or may not be a cheater.

We need to set a decision threshold that helps us identify cheaters. Specifically, we want to:

Correctly identify cheaters (reject null hypothesis when we should)
Not falsely accuse honest players (retain the null hypothesis when we should)

We do not want to:

Let cheaters get away with it (retain the null hypothesis when we should NOT).
Falsely accuse honest players (reject the null hypothesis when we should NOT).

That black dashed line shows a decision threshold of about 5%, meaning that about 5% of the ‘honest scores’ are above this threshold; we will falsely accuse honest players about 5% of the time. But most of the scores above this threshold are ‘cheaty scores’, so usually we will only accuse cheaters of being cheaters.

What would happen if we moved this line to the right? (i.e. lower than 5%)

We would falsely accuse honest players less often (i.e. reject the null hypothesis when we should NOT) - this is GOOD.
We would let cheaters get away with it more often (retain the null hypothesis when we should NOT) - this is BAD.

What would happen if we moved our decision threshold line to the left? (i.e. higher than 5%)

We would falsely accuse honest players more often (i.e. reject the null hypothesis when we should NOT) - this is BAD.
We would let cheaters get away with it less often (retain the null hypothesis when we should NOT) - this is GOOD.

So you see there is no perfect solution. As we try to avoid false accusations, we make getting away with cheating more likely. As as we try to crack down on cheaters, we make more false accusations. We can’t eliminate these two mistakes, only try and balance them against each other.

Decision	Honest Player	Cheaty Player
Accuse (Reject Null Hypothesis)	FALSE ALARM (Type 1 Error)	Correct!
Don’t Accuse (Retain Null Hypothesis)	Correct!	MISS (Type 2 Error)

A Type 1 Error is a false accusation - we say we found a cheater when we really didn’t

A Type 2 Error is a miss - we continue to believe our cheating player is honest.

So we have to pick a decision threshold that balances these two types of errors. And 0.05 (5%) is a pretty good balance.

Applying Step 4 to Find the Cheater

So, if our threshold is 0.05, is there anyone we want to accuse of cheating?

Define Type 1 Error. Define Type 2 Error.
Complete the following table, using these terms (6 terms but 4 cells means some cells will have 2):
- FALSE ALARM
- MISS!
- Type 1 Error
- Type 2 Error
- Correct!
- Correct!

	Retain Null Hypothesis	Reject Null Hypothesis
Null Hypothesis is Really False
Null Hypothesis is Really True

What is the typical decision threshold for a p-value? Why is this threshold typically used?

Summary

So, we’ve learned that there are 4 steps to making a statistical inference:

Assume that there is no effect (null hypothesis)
Collect data
What are the chances that I would have gotten this data if there were no effect?
Decide whether to retain or reject the null hypothesis

We’ve also learned that we need four things to make this decision:

At least one variable. Some number that we want to evaluate, and perhaps a grouping factor, too.
A statistic. A number derived from our variable
A distribution.
A decision threshold.

Now let’s apply what we’ve learned to some actual statistical tests, starting with the ever-popular t-test

Video Tutorial: Doing t-tests

https://byu.box.com/s/k8b2pcljgd7si888j3ub0rhwjtg5pwi8

The t-test

Let’s talk about these 4 things in relation to the t-test.

Variable(s)

We can divide our variables into dependent and independent/predictor variables.

Dependent Variable

A dependent variable is the variable we are measuring or testing. It should always be a number!

What’s the dependent variable in our cheater example?

Independent/Predictor Variable

An independent/predictor variable is a variable that we thing might be influencing the dependent variable, so that different values of this independent/predictor variable are associated with different values of the dependent variable.

First question: What’s the difference between a predictor and an independent variable?

The term “independent variable” implies manipulation, as in an experiment.
If your data didn’t come from an experiment, use “predictor” or “grouping variable” or something instead.

We treat them the same in R. But its important to make this distinction when we describe our analyses.

In a t-test, the independent/predictor variable can only be a factor. AND IT CAN NEVER HAVE MORE THAN TWO LEVELS. This is a peculiarity of the t-test; you can never compare three groups together.

A t-test needs a dependent variable. What sort of variable (number, factor, string) should this dependent variable be?
What is the difference between a predictor and an independent variable? When should you use one term versus the other?
A t-test needs a predictor/independent variable. What sort of variable (number, factor, string) should this be?
Explain in your own words what the following statement means:
- “A t-test can only take as a predictor a categorical variable with at most 2 levels.”
- Be sure to define categorical and levels
Give an example of a variable with:
- 2 levels
- 3 levels
- 4 levels
Which of these variables would be a good independent/predictor variable for a t-test? Which would be bad? Why?

Variable: Height. A number.
Variable: Which_Shoe. A factor - Levels: Left, Right.
Variable: Chipmunk_Name. A factor - Levels: Alvin, Simon, Theodore.

Statistic

What statistic do you think the t-test uses?

Right, the t statistic!

Formula for the t statistic

Here is one version of the formula used to compute the t statistic. The t statistic is a way to express, in a single number, how different two groups are from each other.

Note three things:

The numerator is the difference in the two group means. So, we will get a bigger t statistic if the groups are more different.
The denominator includes the standard deviation (SD) of each group. So, we will get a bigger t statistic if the groups have smaller standard deviations (i.e. the two groups’ distributions are narrower and don’t overlap as much).
The denominator includes the size of each group (N). So, we will get a bigger t statistics if the groups are bigger.

You don’t usually have much control over #1 or #2, although good experimental design choices can help here (see Step 2, above). But researchers have more control over #3; by increasing the sample size, a researcher can make it easier to see differences between groups, if they exist.

Place the following things in the appropriate slots in the table below:
- Bigger difference between the means of the 2 groups
- Larger standard deviation within each group
- Bigger N (sample size)

Make t statistic bigger	Make t statistic smaller
?
?

Review: What type of error do we reduce when we increase sample size? Type 1 or Type 2?

Distribution

Once we have a statistic, we have to compare it to a distribution to figure out how expected that t statistic is. Others have blazed this trail for us: here is the t distribution:

The t distribution isn’t always exactly the same, though. It changes shape depending on the degrees of freedom

Degrees of Freedom

Degrees of Freedom is related to our sample size. The larger the sample size, the more degrees of freedom we have.

Alpha Level

Alpha Level is the threshold we set (meaning we choose it ourselves) for deciding if we should retain or reject the null hypothesis. (Remember step 4, above?) If our t value falls into the red area, we will reject the null hypothesis.

Most of the time, you set the Alpha Level to 0.05, so that only a small part of the t distribution is red. If you were to set it to 0.25, a lot more of the distribution would become red.

Review: Assuming we keep the Alpha Level the same but increase the degrees of freedom, what type of error do we reduce? Type 1 or Type 2?

Copy the code chunk below into your R markdown document. Run it. Set the Alpha Level to 0.05, then play around with the Degrees of Freedom. Then answer the questions below.

if (!require("shiny")) install.packages("shiny")
library(shiny)


shinyApp(

  ui = fluidPage(
    fluidRow(
      column(5, wellPanel(
      sliderInput("df", label = h3("Degrees of Freedom"), min = 1, 
          max = 25, value = 1),
      )),
      column(5,wellPanel(
      sliderInput("alpha", label = h3("Alpha Level"), min = 0, 
          max = 0.5, value = 0),
    ))),
      fluidRow(
      column(10, offset = 0,
        plotOutput("plot")
      )
    )),
    
  server = function(input, output) {
 
      output$plot = renderPlot({
        ggplot(data.frame(x = c(-8, 8)), aes(x=x)) + theme_bw() +
        stat_function(fun = dt, geom = "area", fill = "red1", # This creates the red area under the curve
                  xlim = c(qt(1 - input$alpha, df = input$df), 8), # What does this red area represent?
                  args = list(df = input$df), 
                  color = "red", size = 2) +
        stat_function(fun = dt, args = list(df = input$df), color = "black", size = 2) + # This creates the t distribution
        scale_x_continuous(breaks = -8:8, labels = -8:8) + 
        labs(
          x = expression(italic("t")),
          y = expression(paste("P(", italic("t"), ")")),
          title = expression(paste("The ", italic("t"), " Distribution"))
        ) + coord_cartesian(xlim = c(-8, 8))
      }) 

    
},
options = list(height = 900)

)

What happens to the shape of the t distribution as you increase Degrees of Freedom?
What happens to the red area (alpha level) as you increase the Degrees of Freedom?
In general, does increasing the degrees of freedom make it easier or harder to get a statistically significant result? Why?
If you are in charge of collecting data for a study, what can you do to increase the degrees of freedom?

That’s all great, but how do I do a t test?

OK, fine.

Start by loading the rstatix package, which is the easist way I’ve found to do t-tests. Add this code to your setup chunk:

if (!require("rstatix")) install.packages("rstatix")
library(rstatix)

This gives us the t_test function, which has the following arguments:

t_test(data = my_data, 
       # What data are you using?
       DV ~ GroupVariable, 
       # This is the formula, with dependent variable on the left and our
       # grouping variable on the right of the ~
       Mu = 0, # This is only for one-sample t-tests.
       #Leave it out if you're not doing a one-sample test
       Paired = TRUE # This is only for paired t-tests
       #Leave it out if you're not doing a paired test
       )

There are 3 kinds of t-tests:

One-sample t-tests compare a single group against a fixed number. E.g. do these 10 used cars cost more than $8000 on average?
Two-sample (AKA independent-sample) t-tests compare two separate groups together. E.g. do these 10 used cars cost more than these 10 used trucks?
Paired-sample t-tests compare the same group to itself, usually at different times. E.g. do these 10 used cars cost more now than they did in 2020?

What is a one-sample t-test? How many groups or levels do we have, and what are we comparing to?
What is a two-sample (AKA independent-sample) t-test? How many groups or levels do we have, and what are we comparing to?
What is a paired-sample t-test? How many groups or levels do we have, and what are we comparing to?
What is the difference between a two-sample and a paired-sample t-test? Give an example of a situation where you would use a two-sample t-test and a different situation where a paired-sample t-test would be the right choice.

One-sample t-test

Let’s start with a one-sample t test. In this kind of test, we only have 1 group, which we compare to a fixed expected value.

Like this:

library(nycflights13) # Remember this?

t_test(data = flights, # Using the flights data
       dep_delay ~ 1, # DV is dep_delay. 1 means we're doing a one sample t-test
       mu = 0) # Comparing mean dep_delay to 0

Before we run this command, let’s look at the arguments:

data. what data set are we working with?
formula. The formula has the format variable ~ 1
- The variable to the left of the ~ is our dependent variable
- The 1 tells the t.test command that we’re doing a one-sample test - there’s only 1 group.
mu. This argument defines what the expected value is that we want to compare our variable to.

What question are we trying to answer with this test? (Hint: Is/are ___ different from ___?)

Now let’s see the output of our test.

## # A tibble: 1 × 7
##   .y.       group1 group2          n statistic     df     p
## * <chr>     <chr>  <chr>       <int>     <dbl>  <dbl> <dbl>
## 1 dep_delay 1      null model 328521      180. 328520     0

There’s a lot of information here. The most important bits of info are: * the statistic is the t statistic * df is the degrees of freedom * p is the p-value * The p-value is the probability of that t statistic occurring in a distribution with those degrees of freedom. * This p-value is so small, they just gave us a 0. It’s a very small number.

In order to do the three t-tests that follow, you will need to install and load the following packages:

the rstatix package. This package has the t_test function, which we will use to do our t-tests.
the datarium package. This package has the data we need.

Add this code you your setup chunk.

if (!require("rstatix")) install.packages("rstatix")
library(rstatix)
if (!require("datarium")) install.packages("datarium")
library(datarium)

Using the mice data included in datarium, perform a one-sample t-test to see if the mice are, on average, less than 25g. Be sure to show your code.
- You’ll want to examine the mice dataset first. Just run mice as a command.
You should get the output below. What does it mean? Answer the following questions:
- What does the n part tell us?
- What does the statistic part tell us?
- What does df tell us?
- What is the p-value? (Hint: how does R show scientific notation?)
Based on this output, should we retain or reject the null hypothesis? What did you base this decision on?

## # A tibble: 1 × 7
##   .y.    group1 group2         n statistic    df       p
## * <chr>  <chr>  <chr>      <int>     <dbl> <dbl>   <dbl>
## 1 weight 1      null model    10     -8.10     9 0.00002

Now write a brief 1-2 sentence APA-style summary of the results. When you interpret statistical results, you want to do 2 things:
- Describe the results numerically. For t-test, you must include the t statistic and degrees of freedom and the p value. If the p value is really small, just put p < .001
- Describe the results in plain language. Like with words that a human would use to talk to another human. You still remember how to do that, right?

Two-sample t-test

Now let’s do a two-sample t-test. In this kind of test, we compare two groups to each other.

We’ll start by doing the test like this:

t_test(data = flights, 
       dep_delay ~ origin)  
      # formula: dep_delay is the dependent variable
      # origin is the grouping variable

Well, it looks like it did 3 t-tests, comparing the mean departure delays of each airport to each other airport. This isn’t what we want right now, but it will come in handy later.

For now let’s just look at one pair of airports, by filtering out the New Jersey one.

filter(flights, origin != "EWR") %>%
  # I have to filter the data because there are three airports, but
  # A t-test can only compare 2 things
  t_test(data = ., dep_delay ~ origin)

## # A tibble: 1 × 8
##   .y.       group1 group2     n1     n2 statistic      df        p
## * <chr>     <chr>  <chr>   <int>  <int>     <dbl>   <dbl>    <dbl>
## 1 dep_delay JFK    LGA    109416 101509      10.2 208866. 1.25e-24

Notice the differences in arguments compared to the one-sample test. Data is the same

data. what data set are we working with? This is the same as before
formula. The formula has the format variable ~ variable
- The variable to the left of the ~ is our dependent variable
- The variable to the right is our grouping variable. It tells the t.test command how to divide the dependent variable up.
mu. It’s gone. We don’t need it here.

The output should look pretty familiar:

statistic is the t statistic
df is the degrees of freedom (df)
p is the p-value. It’s in scientific notation. How many 0 should you put before that 1?

Using the genderweight data included in datarium, perform a two-sample t-test to see if the the men weigh, on average, more than the women. Be sure to show your code.
- To see what the column names are, examine genderweight.

## # A tibble: 1 × 8
##   .y.    group1 group2    n1    n2 statistic    df        p
## * <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>    <dbl>
## 1 weight F      M         20    20     -20.8  26.9 4.30e-18

Now write a brief 1-2 sentence APA-style summary of the results.

Paired t-test

A paired t-test is a special type of two-sample t-test. This type of test is used when the observations are not independent. Instead, they come in neat little pairs. Usually, this means each pair of data points came from the same person, or the same place, or something similar. For example, look at this data:

Musher	Checkpoint	Time
Jerry Sousa	1	243
Jerry Sousa	2	176
Jerry Sousa	3	304
Jerry Sousa	4	201
Melissa Owens	1	215
Melissa Owens	2	421
Melissa Owens	3	334
Melissa Owens	4	220

Jerry’s times and Melissa’s times match up neatly into pairs - Checkpoint 1 goes with Checkpoint 1, Checkpoint 2 with Checkpoint 2, and so on. So before we compare the two musher’s times, we should tell R to match those times up into paits. This t statistic is computed differently, and there are fewer degrees of freedom (degrees of freedom are based on the number of pairs).

To make R do a paired t-test, we just add one argument to our function:

iditaroddata %>% 
  t_test(data = ., # I used a . because it's in a pipe. Otherwise just type the data name
         Time ~ Musher, 
         # This is the formula. Dependent variable on the left of the ~, 
         #predictor variable on the right
         paired = TRUE) # This is the part that does the t.test

## # A tibble: 1 × 8
##   .y.   group1      group2           n1    n2 statistic    df     p
## * <chr> <chr>       <chr>         <int> <int>     <dbl> <dbl> <dbl>
## 1 Time  Jerry Sousa Melissa Owens     4     4     -1.09     3 0.354

The output is the same as the two-sample t-test.

Using the mice2 data included in datarium, perform a two-sample t-test to see if the the mice weigh, on average, more after than before the treatment. Be sure to show your code.
- To see what the column names are, examine mice2.
- Is mice2 wide data or long data? If it is long data, pivot it and save as mice2_long!

## # A tibble: 1 × 8
##   .y.    group1 group2    n1    n2 statistic    df             p
## * <chr>  <chr>  <chr>  <int> <int>     <dbl> <dbl>         <dbl>
## 1 weight after  before    10    10      25.5     9 0.00000000104

Now write a brief 1-2 sentence APA-style summary of the results.

Real Data: Do People Want to Be More Moral (Bonus)

Test the following research question: People want to increase ther Sociability more than they want to increase their Honesty. To do this, do the following:
- Read in ‘moraldatalong.csv’ from Checklist 4.
- Filter the data to include ONLY Sociability and Honesty from the Trait variable.
- Decide if a two-sample or paired-sample t-test is more appropriate.
- Perform the t-test.
- Decide whether to reject or retain the null hypothesis.
- Write a brief APA-style paragraph describing the results. You might need to summarize the data to see the actual means.

Real Data: Marriage and Other Disasters (Bonus)

Test the following research question: Husbands have higher relationship satisfaction than wives at Time Point 1 (the start of the study). To do this, do the following:
- Read in ‘marriagedata.csv’ from Checklist 4.
- Filter the data to include ONLY TimePoint 1.
- Decide if a two-sample or paired-sample t-test is more appropriate.
- Perform the t-test.
- Decide whether to reject or retain the null hypothesis.
- Write a brief APA-style paragraph describing the results.

Real Data: Time Pressure = Peer Pressure (Bonus)

For this task, we will be replicating part of this unpublished paper:

Protzko, J., Zedelius, C. M., & Schooler, J. (2018). Rushing towards virtue: time pressure increases socially desirable responding.

https://osf.io/preprints/psyarxiv/y9vpj

The Open Science Framework Page for this data can be found here:

https://osf.io/rt6un/

Go to the OSF page, find the file called “Speeded social desirabilityanon.sav”, and download it to your R Project folder.
Read in the .sav file (you know how!) as “fastdata”
Make the following adjustments to the fastdata using mutate
- Create a new column called “Speed_Condition” based on the fast column; when fast = 1, Speed_Condition should equal “Fast”, and when fast = 0, Speed_Condition = “Slow”
- Create a new column called “Social_Desirability_Score” that is the sum of the following columns: sd1 + sd2R + sd3 + sd4R + sd5 + sd6R + sd7R +sd8R + sd9 +sd10
Looking at the data and the procedure section of the paper, decide if a two-sample or paired-sample t-test is more appropriate. Why do you think so?
Conduct the appropriate t-test, using Speed_Condition as the independent variable and Social_Desirability_Score as the dependent variable.
Write a brief APA-style paragraph describing the results.

Go To Home Page

Inference I: Review & t-tests

Steven Luke

2026-01-02

Video Tutorial: Inferential Statistics Review & t-test basics

Find the Cheater

Re-rolling the Scores

What is it hard to tell who the cheater is?

Basics of Statistical Inference

Variables

Population vs. Sample

Statistics

Steps in Statistical Inference

Step 1: Assume there is no effect

Black Swans, Bald Guys, and Proving Something True

Proving Something False is a Lot Easier!

Applying This Step to Find the Cheater

Step 2: Collect Data

Step 3: What are the chances that I would have gotten this data if there were no effect?

Summary

Applying Step 3 to Find the Cheater

Step 4: Making a Decision

Type 1 and Type 2 Error

Applying Step 4 to Find the Cheater

Summary

Video Tutorial: Doing t-tests

The t-test

Variable(s)

Dependent Variable

Independent/Predictor Variable

Statistic

Distribution

Degrees of Freedom

Alpha Level

That’s all great, but how do I do a t test?

One-sample t-test

Two-sample t-test

Paired t-test

Real Data: Do People Want to Be More Moral (Bonus)

Real Data: Marriage and Other Disasters (Bonus)

Real Data: Time Pressure = Peer Pressure (Bonus)