Here’s a little code I wrote.
Part of it might look familiar - it creates scores for 4 different Dungeons and Dragons characters.
It also randomly makes one of these players a dirty, dirty CHEATER by adding +1 to all of their ability scores. Shame!
Except that sometimes it doesn’t make anyone a cheater. Can you tell why?
Players <- c("Travis", "Laura", "Liam", "Ashley")
CharacterStatsNames = c("Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
CharacterStats = c(0, 0, 0, 0, 0, 0)
Which_Cheater <- sample(1:5, 1, replace = TRUE) # Suspicious...
Player_Data <- data.frame()
for (Which_Player in 1:length(Players)) { # Make ability scores for each player
for (Which_Stat in 1:6) { #create a for loop that will iterate 6 times
sample(1:6, 4, replace = TRUE) %>% # Roll the Die
sort(., decreasing = TRUE) %>% # Sort Vector from largest to smallest
.[-4] %>% # Drop the last (smallest) number
sum(.) %>% # Add up elements in vector and store in Score
{ if(Which_Player == Which_Cheater) # Add 1 to all the cheater's rolls!
.+1 else . } ->
CharacterStats[Which_Stat] #Input Score into Character Stats vector
} # Repeat this process 6 times, so that you have 6 different numbers in a vector
Player_Data <- rbind(Player_Data, c(Players[Which_Player], CharacterStats))
}
Player_Data <- as_tibble(Player_Data)
colnames(Player_Data) <- c("Player_Name", "Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
Player_Data <- pivot_longer(Player_Data, 2:7, names_to = "Stat", values_to = "Score") %>%
mutate(Player_Name = as.factor(Player_Name), Stat = as.factor(Stat), Score = as.numeric(Score))
Here are the scores:
Let’s re-roll the scores a few times and see if it’s always obvious who the cheater is. Add this code to your setup chunk:
if (!require("shiny")) install.packages("shiny")
library(shiny)
Now copy and run this code chunk.
shinyApp(
ui = fluidPage(
actionButton("action", "Re-roll Scores"),
plotOutput("plot")
),
server = function(input, output) {
observeEvent(input$action, {
output$plot = renderPlot({
Players <- c("Travis", "Laura", "Liam", "Ashley", "Nobody")
CharacterStatsNames = c("Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
CharacterStats = c(0, 0, 0, 0, 0, 0)
Which_Cheater <- sample(1:5, 1, replace = TRUE) # Suspicious...
Player_Data2 <- data.frame()
for (Which_Player in 1:(length(Players)-1)) { # Make ability scores for each player
for (Which_Stat in 1:6) { #create a for loop that will iterate 6 times
sample(1:6, 4, replace = TRUE) %>% # Roll the Die
sort(., decreasing = TRUE) %>% # Sort Vector from largest to smallest
.[-4] %>% # Drop the last (smallest) number
sum(.) %>% # Add up elements in vector and store in Score
{ if(Which_Player == Which_Cheater) # Add 1 to all the cheater's rolls!
.+1 else . } ->
CharacterStats[Which_Stat] #Input Score into Character Stats vector
} # Repeat this process 6 times, so that you have 6 different numbers in a vector
Player_Data2 <- rbind(Player_Data2, c(Players[Which_Player], CharacterStats))
}
Player_Data2 <- as_tibble(Player_Data2)
colnames(Player_Data2) <- c("Player_Name", "Strength", "Dexterity", "Constitution", "Intelligence", "Wisdom", "Charisma")
Player_Data2 <- pivot_longer(Player_Data2, 2:7, names_to = "Stat", values_to = "Score") %>%
mutate(Player_Name = as.factor(Player_Name), Stat = as.factor(Stat), Score = as.numeric(Score))
ggplot(data = Player_Data2, aes(x = Stat, y = Score)) +
geom_bar(stat = "identity", aes(color = Stat, fill = Stat)) +
theme_bw() +
scale_color_manual(values = c("red4", "orange4", "yellow4", "green4", "blue4", "purple4")) +
scale_fill_manual(values = c("red1", "orange1", "yellow1", "green1", "blue1", "purple1")) +
facet_grid(. ~ Player_Name) +
labs(
x = NULL,
title = "Who is the Cheater?",
caption = paste("The cheater is", Players[Which_Cheater])
) + theme(legend.position = "none", axis.text.x = element_text(angle = 90, size = 15), axis.text.y = element_text(size = 15), strip.text = element_text(size = 25))
})
})
},
options = list(height = 500)
)
Press “Re-roll scores” a few times and try to guess who the cheater is. The answer is in the bottom right, but don’t peek until you’ve guessed.
Can you tell who the cheater is? How sure are you? Sometimes you’re probably pretty sure, but other times, you might not be. Or you might be sure, but wrong!
Sometimes it feels pretty obvious who the cheater is. But usually it’s not - and it is hard to ever be 100% certain. Why? Because…
Consider other, real-world data sets. What sources of variability might hide group differences?
In this unit, we’ll start to talk about how we can use inferential statistics to see through variability to find systematic patterns.
Let’s explore more specifically why it is hard to see group differences in noise.
quizmeans <- c(50, 50, 60) # Sets the 3 means
quizsd <- 20 # Sets the Standard Deviation
quizmeans <- sample(quizmeans) # Randomizes the order of the means
Quiz <- c(1:10) # Numbers the Quizzes 1-10
Johnny_Appleseed <- round(rnorm(100, mean = quizmeans[1], sd = quizsd))
Paul_Bunyan <- round(rnorm(100, mean = quizmeans[2], sd = quizsd))
John_Henry <- round(rnorm(100, mean = quizmeans[3], sd = quizsd))
# Combines the four vectors into a data frame
Quizdata <- data.frame(cbind(Quiz, Johnny_Appleseed, Paul_Bunyan, John_Henry))
# Prints the data
Quizdata
# Makes a density plot
Quizdata %>% pivot_longer(cols = 2:4, names_to = "Student", values_to = "Grade") %>%
ggplot(aes(x = Grade)) + geom_density(aes(fill = Student), alpha = 0.5) +
geom_density(aes(color = Student)) +
theme_bw()
Any time we do statistics, we are working with at least 1 variable. We’ll return to this when we discuss individual statistical tests, because different tests apply to different sets of variables.
In our cheater example, what is/are the variable(s)?
What is the population in our cheater example? What is the sample?
| Research Question | Population | (Possible) Sample | |
|---|---|---|---|
| Do people want to be more moral? | |||
| Does experiencing a natural disaster temporarily boost relationship satisfaction in newlywed couples? | |||
| Do behavioral interventions improve sustained attention in those diagnosed with ADHD? | |||
| Is it Great being Nate? Rates of narcissism among individual named Nate. | |||
| Who survived the sinking of the Titanic? |
We start with a null hypothesis. We assume that our variable does not differ from whatever we’re comparing it to. This might be a fixed number, a couple of different groups, etc.
In other words, we assume that there’s nothing to see here.
Why do we start with this null hypothesis? What’s the logic here?
Is it possible to prove that something is true? What would you have to do to prove the following?
Hopefully you see that this would be a tall order:
In other words, to prove that a universal statement (ALL, EVERY) is TRUE, you have to examine the population. We can’t do that, usually - if we had access to the population, we wouldn’t need statistics!
Is it possible to prove that something is true? What would you have to do to disprove the following?
Right! Much easier:
So, you can disprove a universal statement by examining a sample. This is much more doable.
So, how do we test whether someone in our D&D group is a cheater?
First, we assume that there is no cheater - everyone in our D&D group is honest.
We won’t believe that someone is a cheater unless we have strong evidence.
Then we plan to look for evidence to contradict this assumption. That’s step 2.
| Research Question | Null Hypothesis | ||
|---|---|---|---|
| Do people want to improve moral traits more than morally neutral traits? | |||
| Does experiencing a natural disaster temporarily boost relationship satisfaction in newlywed couples? | |||
| Do behavioral interventions improve sustained attention in those diagnosed with ADHD? | |||
| Is it Great being Nate? Rates of narcissism among individual named Nate. | |||
| Did proportionately more women than men survive the sinking of the Titanic? |
Gathering evidence to test our null hypothesis is an art unto itself.
If you want to know how to do this, you’ll have to take a different class.
But remember this:
In this step, we look at our data and ask yourself - do I have any evidence against the null hypothesis?
What counts as evidence against the null hypothesis? Let’s consider our D&D cheater. If we compare their scores to the expected scores for non-cheating players, how different are they?
To answer this question, we have to know what is expected. We need a distribution. So, I rolled up 100,000 D&D characters, and computed their average scores (the mean of all 6 of their ability scores). The average was about 12.24 Then I compared each individual set of scores to the average of all the scores to see how far it was from 12.24. In other words, how different was each score from the mean score? The graph below shows how what proportion of scores were within 1 point of 12.24, 2 points, 3 points, etc.
Notice how most scores were pretty close to the mean. These scores happen a lot, while other scores farther from the mean are quite rare. Importantly, you can see that even when no cheating is involved, there is a lot of variability in the ability scores of D&D characters, with extreme scores happening once in a while.
Let’s consier some specific examples. If a player gets an average score around 12, that’s only .25 points less than the average, so that player has a score that occurs quite often for non-cheaters. We have no reason to believe that this player is a cheater.
BUT, what if a player got an average score of about 16? That’s 3.75 points higher than average. Look at the graph - scores 3.75 points above the average don’t happen often in non-cheaters. So, we have reason to suspect that the player is, in fact, a cheater.
So, in sum, to find evidence against the null hypothesis, we need two things:
This is true for all the tests we will talk about going forward.
So, let’s look at our players, shall we? We have:
Using these two bits of information, we can figure out how unexpected each player’s mean score is.
This graph shows the different probabilities for each of our players’ mean scores. We’ll need these probabilities for Step 4.
Now we have to decide. We have two choices:
What is the basis for our decision? We use the probabilities we got in step 3. If those probabilities are small enough, we reject, otherwise we retain.
What does “small enough” mean. Usually, this means a probability of *0.05, or 5%.
Why? Let’s talk about this in the next section.
Take a look at this histogram. The white bars represent the distribution of ‘honest’ scores - scores rolled in the usual way without cheating. The red bards show the distribution of ‘cheaty’ scores = the dishonest player added +1 to each score. These two distributions are highly overlapping (the pink area).
Remember that this is the population distribution - we don’t have any real access to this. We only have a sample - a player who may or may not be a cheater.
We need to set a decision threshold that helps us identify cheaters. Specifically, we want to:
We do not want to:
That black dashed line shows a decision threshold of about 5%, meaning that about 5% of the ‘honest scores’ are above this threshold; we will falsely accuse honest players about 5% of the time. But most of the scores above this threshold are ‘cheaty scores’, so usually we will only accuse cheaters of being cheaters.
What would happen if we moved this line to the right? (i.e. lower than 5%)
What would happen if we moved our decision threshold line to the left? (i.e. higher than 5%)
So you see there is no perfect solution. As we try to avoid false accusations, we make getting away with cheating more likely. As as we try to crack down on cheaters, we make more false accusations. We can’t eliminate these two mistakes, only try and balance them against each other.
| Decision | Honest Player | Cheaty Player |
|---|---|---|
| Accuse (Reject Null Hypothesis) | FALSE ALARM (Type 1 Error) | Correct! |
| Don’t Accuse (Retain Null Hypothesis) | Correct! | MISS (Type 2 Error) |
A Type 1 Error is a false accusation - we say we found a cheater when we really didn’t
A Type 2 Error is a miss - we continue to believe our cheating player is honest.
So we have to pick a decision threshold that balances these two types of errors. And 0.05 (5%) is a pretty good balance.
So, if our threshold is 0.05, is there anyone we want to accuse of cheating?
| Retain Null Hypothesis | Reject Null Hypothesis | |
|---|---|---|
| Null Hypothesis is Really False | ||
| Null Hypothesis is Really True |
So, we’ve learned that there are 4 steps to making a statistical inference:
We’ve also learned that we need four things to make this decision:
Now let’s apply what we’ve learned to some actual statistical tests, starting with the ever-popular t-test
Let’s talk about these 4 things in relation to the t-test.
We can divide our variables into dependent and independent/predictor variables.
A dependent variable is the variable we are measuring or testing. It should always be a number!
What’s the dependent variable in our cheater example?
An independent/predictor variable is a variable that we thing might be influencing the dependent variable, so that different values of this independent/predictor variable are associated with different values of the dependent variable.
First question: What’s the difference between a predictor and an independent variable?
We treat them the same in R. But its important to make this distinction when we describe our analyses.
In a t-test, the independent/predictor variable can only be a factor. AND IT CAN NEVER HAVE MORE THAN TWO LEVELS. This is a peculiarity of the t-test; you can never compare three groups together.
What statistic do you think the t-test uses?
Right, the t statistic!
Here is one version of the formula used to compute the t statistic. The t statistic is a way to express, in a single number, how different two groups are from each other.
Note three things:
You don’t usually have much control over #1 or #2, although good experimental design choices can help here (see Step 2, above). But researchers have more control over #3; by increasing the sample size, a researcher can make it easier to see differences between groups, if they exist.
| Make t statistic bigger | Make t statistic smaller |
|---|---|
| ? | |
| ? |
Once we have a statistic, we have to compare it to a distribution to figure out how expected that t statistic is. Others have blazed this trail for us: here is the t distribution:
The t distribution isn’t always exactly the same, though. It changes shape depending on the degrees of freedom
Degrees of Freedom is related to our sample size. The larger the sample size, the more degrees of freedom we have.
Alpha Level is the threshold we set (meaning we choose it ourselves) for deciding if we should retain or reject the null hypothesis. (Remember step 4, above?) If our t value falls into the red area, we will reject the null hypothesis.
Most of the time, you set the Alpha Level to 0.05, so that only a small part of the t distribution is red. If you were to set it to 0.25, a lot more of the distribution would become red.
Review: Assuming we keep the Alpha Level the same but increase the degrees of freedom, what type of error do we reduce? Type 1 or Type 2?
Copy the code chunk below into your R markdown document. Run it. Set the Alpha Level to 0.05, then play around with the Degrees of Freedom. Then answer the questions below.
if (!require("shiny")) install.packages("shiny")
library(shiny)
shinyApp(
ui = fluidPage(
fluidRow(
column(5, wellPanel(
sliderInput("df", label = h3("Degrees of Freedom"), min = 1,
max = 25, value = 1),
)),
column(5,wellPanel(
sliderInput("alpha", label = h3("Alpha Level"), min = 0,
max = 0.5, value = 0),
))),
fluidRow(
column(10, offset = 0,
plotOutput("plot")
)
)),
server = function(input, output) {
output$plot = renderPlot({
ggplot(data.frame(x = c(-8, 8)), aes(x=x)) + theme_bw() +
stat_function(fun = dt, geom = "area", fill = "red1", # This creates the red area under the curve
xlim = c(qt(1 - input$alpha, df = input$df), 8), # What does this red area represent?
args = list(df = input$df),
color = "red", size = 2) +
stat_function(fun = dt, args = list(df = input$df), color = "black", size = 2) + # This creates the t distribution
scale_x_continuous(breaks = -8:8, labels = -8:8) +
labs(
x = expression(italic("t")),
y = expression(paste("P(", italic("t"), ")")),
title = expression(paste("The ", italic("t"), " Distribution"))
) + coord_cartesian(xlim = c(-8, 8))
})
},
options = list(height = 900)
)
OK, fine.
Start by loading the rstatix package, which is the easist way I’ve found to do t-tests. Add this code to your setup chunk:
if (!require("rstatix")) install.packages("rstatix")
library(rstatix)
This gives us the t_test function, which has the following arguments:
t_test(data = my_data,
# What data are you using?
DV ~ GroupVariable,
# This is the formula, with dependent variable on the left and our
# grouping variable on the right of the ~
Mu = 0, # This is only for one-sample t-tests.
#Leave it out if you're not doing a one-sample test
Paired = TRUE # This is only for paired t-tests
#Leave it out if you're not doing a paired test
)
There are 3 kinds of t-tests:
Let’s start with a one-sample t test. In this kind of test, we only have 1 group, which we compare to a fixed expected value.
Like this:
library(nycflights13) # Remember this?
t_test(data = flights, # Using the flights data
dep_delay ~ 1, # DV is dep_delay. 1 means we're doing a one sample t-test
mu = 0) # Comparing mean dep_delay to 0
Before we run this command, let’s look at the arguments:
What question are we trying to answer with this test? (Hint: Is/are ___ different from ___?)
Now let’s see the output of our test.
## # A tibble: 1 × 7
## .y. group1 group2 n statistic df p
## * <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 dep_delay 1 null model 328521 180. 328520 0
There’s a lot of information here. The most important bits of info are: * the statistic is the t statistic * df is the degrees of freedom * p is the p-value * The p-value is the probability of that t statistic occurring in a distribution with those degrees of freedom. * This p-value is so small, they just gave us a 0. It’s a very small number.
In order to do the three t-tests that follow, you will need to install and load the following packages:
Add this code you your setup chunk.
if (!require("rstatix")) install.packages("rstatix")
library(rstatix)
if (!require("datarium")) install.packages("datarium")
library(datarium)
## # A tibble: 1 × 7
## .y. group1 group2 n statistic df p
## * <chr> <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 weight 1 null model 10 -8.10 9 0.00002
Now let’s do a two-sample t-test. In this kind of test, we compare two groups to each other.
We’ll start by doing the test like this:
t_test(data = flights,
dep_delay ~ origin)
# formula: dep_delay is the dependent variable
# origin is the grouping variable
Well, it looks like it did 3 t-tests, comparing the mean departure delays of each airport to each other airport. This isn’t what we want right now, but it will come in handy later.
For now let’s just look at one pair of airports, by filtering out the New Jersey one.
filter(flights, origin != "EWR") %>%
# I have to filter the data because there are three airports, but
# A t-test can only compare 2 things
t_test(data = ., dep_delay ~ origin)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 dep_delay JFK LGA 109416 101509 10.2 208866. 1.25e-24
Notice the differences in arguments compared to the one-sample test. Data is the same
The output should look pretty familiar:
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 weight F M 20 20 -20.8 26.9 4.30e-18
A paired t-test is a special type of two-sample t-test. This type of test is used when the observations are not independent. Instead, they come in neat little pairs. Usually, this means each pair of data points came from the same person, or the same place, or something similar. For example, look at this data:
| Musher | Checkpoint | Time |
|---|---|---|
| Jerry Sousa | 1 | 243 |
| Jerry Sousa | 2 | 176 |
| Jerry Sousa | 3 | 304 |
| Jerry Sousa | 4 | 201 |
| Melissa Owens | 1 | 215 |
| Melissa Owens | 2 | 421 |
| Melissa Owens | 3 | 334 |
| Melissa Owens | 4 | 220 |
Jerry’s times and Melissa’s times match up neatly into pairs - Checkpoint 1 goes with Checkpoint 1, Checkpoint 2 with Checkpoint 2, and so on. So before we compare the two musher’s times, we should tell R to match those times up into paits. This t statistic is computed differently, and there are fewer degrees of freedom (degrees of freedom are based on the number of pairs).
To make R do a paired t-test, we just add one argument to our function:
iditaroddata %>%
t_test(data = ., # I used a . because it's in a pipe. Otherwise just type the data name
Time ~ Musher,
# This is the formula. Dependent variable on the left of the ~,
#predictor variable on the right
paired = TRUE) # This is the part that does the t.test
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 Time Jerry Sousa Melissa Owens 4 4 -1.09 3 0.354
The output is the same as the two-sample t-test.
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 weight after before 10 10 25.5 9 0.00000000104
For this task, we will be replicating part of this unpublished paper:
Protzko, J., Zedelius, C. M., & Schooler, J. (2018). Rushing towards virtue: time pressure increases socially desirable responding.
https://osf.io/preprints/psyarxiv/y9vpj
The Open Science Framework Page for this data can be found here: