As (budding) scientists, we want to be able to work with real data.
This data will come from different places, and we will need to be able to
before we can analyse the data.
Before we even think about bringing the data into R, we should take some time to understand our data. You’ll want to ask and answer some or all of the following questions:
Wide data does NOT mean the data is big or has lots of columns! Data is wide if the same variable is measured multiple times and each measure is in the same row. Consider the example below, in which we measured Tim’s quiz scores 4 times, and put all the scores in the same row. That’s wide data.
| Student | Quiz 1 | Quiz 2 | Quiz 3 | Quiz 4 |
|---|---|---|---|---|
| Tim | 84 | 76 | 81 | 67 |
| Anna | 98 | 45 | 78 | 61 |
Here’s another example:
| Participant | Age | Sex | Anxiety Score Time 1 | Anxiety Score Time 2 | Treatment Group | |
|---|---|---|---|---|---|---|
| Joanna | 25 | F | 8 | 7 | Control | |
| Diego | 31 | M | 6 | 4 | Intervention |
This data has a lot of columns, but that’s not what makes it wide. It’s wide because Anxiety Score was measured twice, at Time 1 and Time 2. Since there are two columns that measure the same thing, we have wide data.
Wide data is easier for humans to read, but not so great for computers. So, we do NOT want wide data in R. If we have wide data, we know that we will have to turn it into long data after we read the data into R.
Long data does not mean there are a lot of rows in the data! It means that each row is a unique observation, a single data point - there are no repeated measurements in a row, because each row is one observation per subject.
This IS what we want in R. Long data = happy R.
Here is the quiz data converted to long format. Each row has a single quiz on it. Some of the other variables are repeated, but that’s OK - computer memory is cheap and plentiful these days.
| Student | Quiz_Number | Score |
|---|---|---|
| Tim | Quiz 1 | 84 |
| Tim | Quiz 2 | 76 |
| Tim | Quiz 3 | 81 |
| Tim | Quiz 4 | 67 |
| Anna | Quiz 1 | 98 |
| Anna | Quiz 2 | 45 |
| Anna | Quiz 3 | 78 |
| Anna | Quiz 4 | 61 |
Here’s the anxiety data in long format. Again, some of the variables are repeated across rows, but that’s OK because we got what we wanted: each row represents a unique data point.
| Participant | Age | Sex | Time | Anxiety_Score | Treatment_Group | |
|---|---|---|---|---|---|---|
| Joanna | 25 | F | 1 | 8 | Control | |
| Joanna | 25 | F | 2 | 7 | Control | |
| Diego | 31 | M | 1 | 6 | Intervention | |
| Diego | 31 | M | 2 | 4 | Intervention |
If you know the difference between within-subjects and between-subjects data, congratulations on being super cool and popular. You may recognize that when we talked about wide and long data we were talking about within-subjects data where each participant has 2 or more observations. What about between-subjects data, where there is only one observation from each participant? In that case, there are no observations to repeat, so the data is already in a good format for R. Hooray!
| Musher | Checkpoint 1 Time | Checkpoint 2 Time | Checkpoint 3 Time | Checkpoint 4 Time |
|---|---|---|---|---|
| Jerry Sousa | 04:03 | 02:56 | 05:04 | 03:21 |
| Melissa Owens | 03:35 | 07:01 | 05:34 | 03:40 |
Study variables can take one of 3 data types:
Column names must conform to the same naming rules as variables (no spaces, no weird characters).
Are there any variables/columns that are missing? Do we need to
Once you feel that you adequately understand the data, you can bring it into R.
Ideally, you want your data to be a comma-separated (.csv) or tab-delimited (.txt) file.
We’ll need to make some data so we can practice working with data.
| Participant | Condition | Observation |
|---|---|---|
| 1 | Condition A | 5 |
| 2 | Condition A | 2 |
| 3 | Condition A | 7 |
| 4 | Condition A | 4 |
| 5 | Condition A | 1 |
| 1 | Condition B | 10 |
| 2 | Condition B | 4 |
| 3 | Condition B | 10 |
| 4 | Condition B | 5 |
| 5 | Condition B | 9 |
| 1 | Condition C | 1 |
| 2 | Condition C | 3 |
| 3 | Condition C | 8 |
| 4 | Condition C | 2 |
| 5 | Condition C | 5 |
Now save the data
# This looks for the file in the same directory as our R project
my_data <- read_csv("FakeData.csv")
# Notice that read_csv has an underscore, not a perior (.). read.csv() is an existing function in base R, but read_csv is a bit better.
# my_file <- read_tsv("FakeData.csv") # read in a tab delimited txt
# my_file <- read_delim("FakeData.csv", delim = "/") # read in a file with a unique delimiter (e.g. "/")
my_data <- read_csv("./data/FakeData.csv")
# This looks for the file in the "data" subdirectory of our R project folder
When we import our data into R, we sometimes need to tell R what missing data looks like. We can do this by adding na = to our read command:
my_data <- read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!"))
This tells R to treat any of the following as NA (missing data):
We can add anything we need to this vector.
In a new code chunk, do the following:
You can see how many rows and columns your data has by looking in the ‘Environment’ tab (top right).
There are also several commands that allow you to examine your data:
ncol(my_data) # Number of columns
nrow(my_data) # Number of rows
dim(my_data) # Get rows and columns
colnames(my_data) # What are the columns/variables
summary(my_data) # Shows a summary of each column in the data
head(my_data) # Shows the first few rows
tail(my_data) # Shows the last few rows
If you want to see your whole dataset, go to the ‘Environment’ tab and click on it there. Or use the view() command. RStudio will open it in a separate tab.
Usually, we will need to do some things to our data before it is fully ready to be analyzed.
my_data_long <- pivot_longer(my_data_wide, # What data are you starting with?
cols = c(2:4), # This vector says which columns should be pivoted. Any column not listed is not pivoted.
# Typically, participant and demographic columns are not pivoted.
# Only the columns with repeated values are pivoted
names_to = "ColumnNamesGoHere", # What new column should the existing column names go into?
values_to = "CellValuesGoHere") # What new column should the existing column names go into?
This command can do some complex things, like splitting column names up into separate columns, selecting a subset of the data to pivot, and more. See ?pivot_longer for more info.
Here’s a specific example. We’re starting with this wide data, which we’ll call Quiz_Data:
| Student | Quiz 1 | Quiz 2 | Quiz 3 | Quiz 4 |
|---|---|---|---|---|
| Tim | 84 | 76 | 81 | 67 |
| Anna | 98 | 45 | 78 | 61 |
Here’s the command:
Quiz_Data_Long <- pivot_longer(Quiz_Data,
cols = c(2:5), # We only want to pivot the repeated quiz data (columns 2 to 5), not the participant name (column 1)
names_to = "Quiz_Number", # Take the column names ("Quiz 1", "Quiz 2", etc.) and put them in a column called "Quiz_Number"
values_to = "Score") # Take the quiz scores and put them in a column called "Score"
And here’s what we would get:
| Student | Quiz_Number | Score |
|---|---|---|
| Tim | Quiz 1 | 84 |
| Tim | Quiz 2 | 76 |
| Tim | Quiz 3 | 81 |
| Tim | Quiz 4 | 67 |
| Anna | Quiz 1 | 98 |
| Anna | Quiz 2 | 45 |
| Anna | Quiz 3 | 78 |
| Anna | Quiz 4 | 61 |
Here’s some data we’ll call Anxiety_Study:
| Participant | Age | Sex | Anxiety Score Time 1 | Anxiety Score Time 2 | Treatment Group | |
|---|---|---|---|---|---|---|
| Joanna | 25 | F | 8 | 7 | Control | |
| Diego | 31 | M | 6 | 4 | Intervention |
Here’s the command:
Anxiety_Study_Long <- pivot_longer(Anxiety_Study,
cols = c(4:5), # We only want to pivot the anxiery score columns (columns 4 and 5), not the other ones
names_to = "Time",
# Take the column names ("Anxiety Score Time 1", "Anxiety Score Time 2") and put them in a column called "Time"
values_to = "Anxiety_Score") # Take the anxiety scores in the cells and put them in a column called "Anxiety_Score"
Anxiety_Study_Long <- mutate(Anxiety_Study_Long, Time = gsub("Anxiety Score Time ", "", Time), Time = as.numeric(Time))
#Take the words out of the time column and make it a number. You'll learn more about this later!
And here’s what we would get:
| Participant | Age | Sex | Time | Anxiety_Score | Treatment_Group | |
|---|---|---|---|---|---|---|
| Joanna | 25 | F | 1 | 8 | Control | |
| Joanna | 25 | F | 2 | 7 | Control | |
| Diego | 31 | M | 1 | 6 | Intervention | |
| Diego | 31 | M | 2 | 4 | Intervention |
| Participant | Condition A | Condition B | Condition C |
|---|---|---|---|
| 6 | 8 | 5 | 1 |
| 7 | 6 | 2 | 8 |
| 8 | 7 | 7 | 0 |
| 9 | 5 | 4 | 0 |
| 10 | 3 | 1 | 8 |
| 11 | 0 | 10 | 0 |
| 12 | 9 | 4 | 0 |
| 13 | 4 | 10 | 1 |
| 14 | 1 | 5 | 3 |
| 15 | 1 | 9 | 4 |
If you’ve done it right, the data should look like this (I’ve shown the first few rows only):
| Participant | Condition | Observation |
|---|---|---|
| 6 | Condition A | 8 |
| 6 | Condition B | 5 |
| 6 | Condition C | 1 |
| 7 | Condition A | 6 |
| 7 | Condition B | 2 |
| 7 | Condition C | 8 |
So you want to combine two data sets together? Here’s how you do it!
i.e. if you have similar data from two different participants or groups, use bind_rows():
my_file_combined <- bind_rows(my_file, my_file_2)
i.e. you have the data FROM the participants (such as their performance on some test or measure) and data ABOUT the participants (such as demographic data) that you need to combine, use left_join() or one of its variants (right_join(), inner_join(), full_join()):
my_file_combined <- left_join(data_from_participants, data_about_participants, by = "Participant")
There must be at least one column in each data with the same information (e.g. a unique participant ID number). Use by = to specify which.
If the matching columns are not the same, you must tell R how to match them up, like so:
my_file_combined <- left_join(data_from_participants, data_about_participants, by = c("Participant" = "Respondent"))
The pipe operator [%>%, shortcut CTRL + Shift + M] is a unique way to link a series of commands together. As long as these commands all operate on the same data set, you can create a ‘pipeline’ where the data goes in one end and comes out the other end transformed.
Like so:
my_file_new <- read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!")) %>%
# This line reads in the data
filter(Participant != 1) # This line filters the data
You can also write it like this, which makes the process clearer:
read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!")) %>%
# This line reads in the data
filter(Participant != 1) -> # This line filters the data
my_file_new # New Name
Image from https://www.dynamicdrainstx.com/wp-content/uploads/2018/12/pipe-mess.jpg
Below are some examples of useful pipe commands. There are a lot more
than these! Google can teach you about them.
my_data_combined <- my_data_combined %>%
# we use the assigner like this when we want to change the data
rename(Participant = Subject) %>% # Change a column name from Subject to Participant
rename_with(toupper, starts_with("C")) %>% # Change name of column(s) that start with 'T' to uppercase
rename_with(~ gsub("o", "O", .x)) # Replace all lower-case "o" strings with upper-case "O" strings in all column names
Mutate changes columns or their contents
my_data_combined <- my_data_combined %>%
mutate(Participant = as.factor(Participant)) %>%
# Change participant to a factor
# as.numeric() or as.character() can be used here, too, to try and change a variable to number or string, respectively.
# Below, creates a new column called NewCondition
mutate(NewCondition = case_when(
Condition == "A" ~ "Control",
# if Condition equals "A", NewCondition is given the value "Control"
Condition == "B" ~ "Experimental",
# if Condition equals "B", NewCondition is given the value "Experimental"
TRUE ~ "Other" # What if it's not A or B?
# if Condition is not "A" or "B", NewCondition is given the value "Other"
), NewCondition = as.factor(NewCondition))
# And then we make NewCondition a factor
Note the use of the equal sign “=” instead of the assignmer “<-” in mutate().
select includes or excludes certain columns from our data
my_data_combined <- my_data_combined %>%
select(c(2:3)) # selects only columns 2 to 3, leaving out column 1
my_data_combined <- my_data_combined %>%
select(-c(3)) # selects all columns EXCEPT column 3
my_data_combined <- my_data_combined %>%
select(c("Participant", "Condition")) # selects only named columns
Filter keeps only rows for which a conditional statement is true
my_data_combined <- my_data_combined %>%
filter(Participant == 10) # only keeps participant 10
my_data_combined <- my_data_combined %>%
filter(Participant < 10) # only keeps participant 1 to 9
When you want to apply a series of functions to a single input.
Use a pipe to transform my_data_combined in the following ways:
If you want to create a backup of your data, use the following command:
write_csv(my_data_transformed, "FakeData2.csv")
It’s good practice to do this after you’ve done some work to prep the data file, because…
BUT, don’t ever overwrite your original data!
To get summary stats for a variable (or variables), use the pipes and summarise().
my_data_transformed %>% # Notice this is different than above
# we're not changing the data, just looking at it
group_by(IndependentVariable) %>%
# group_by groups our data by 1 or more variables, giving a summary for each variable level
# IndependentVariable has 3 levels, so we get 3 means, SDs, etc. - one for each level.
summarise(myMean = mean(DependentVariable), mySD = sd(DependentVariable))
How does this work?
These functions can be used inside summarise to get descriptive statistics. We’ve already seen how to use some of these
length() # To get a count of # of elements
mean()
median()
sd() #standard deviation
std.error() # Requires plotrix package
min()
max()
range() # max and min
cor(Observation1, Observation2) # Correlation requires 2 columns
Using pipes, use my_data_transformed to get summary statistics:
Now that we’ve learned to work with fake data, let’s look at some real data and apply the skills we have learned. We’ll be using this data in future units, so be sure to get this part right!
For this task, we will be replicating part of this paper:
Sun, J., & Goodwin, G. P. (2020). Do people want to be more moral?. Psychological Science, 31(3), 243-257.
https://jessiesun.me/publication/sun-2020c/sun-2020c.pdf
The Open Science Framework Page for this data can be found here:
Using the paper linked above as a reference, answer the following questions:
Note. Download the file, NOT the metadata.
Note. Download the file, NOT the metadata.
moraldata <- read_csv("sample1-dat-clean.csv") %>%
select(c(1,
tc.EXT.S, # Sociability
tc.EXT.A, # Assertiveness
tc.EXT.E, # Energy_Level
tc.AGR.C, # Compassion
tc.AGR.R, # Respectfulness
tc.AGR.T, # Trust
tc.CON.O, # Organization
tc.CON.P, # Productiveness
tc.CON.R, # Responsibility
tc.NEG.A, # Anxiety
tc.NEG.D, # Depression
tc.NEG.E, # Emotional_Volatility
tc.anger, # Anger
tc.OPE.I, # Intellectual_Curiosity
tc.OPE.A, # Aesthetic_Sensitivity
tc.OPE.C, # Creative_Imagination
tc.MCQ.GM, # General_Morality
tc.MCQ.H, # Honesty
tc.MCQ.F, # Fairness
tc.MCQ.L, # Loyalty
tc.MCQ.P # Purity
))
Summarize the data, grouping by Trait. Show the mean and standard deviation of Desired_Change for each trait.
Which trait do people want to increase the most?
Which trait do people want to decrease the most?
In APA style, means an standard deviations should be rounded to 1 digit. Change your code to do this.
write_csv(moraldatalong, "moraldatalong.csv")
Now, let’s take a look at the data from this paper:
Williamson, H. C., Bradbury, T. N., & Karney, B. R. (2021). Experiencing a natural disaster temporarily boosts relationship satisfaction in newlywed couples. Psychological science, 32(11), 1709-1719.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8907491/
We’ll be using this data in future units, so be sure to get this part right.
The Open Science Framework Page for this data can be found here:
Using the paper linked above as a reference, answer the following questions:
Open the codebook for this study, which is at https://osf.io/auxzt.
marriagedata <- marriagedata %>%
pivot_longer(.,
c(2:13),
names_to = c("Spouse", "TimePoint"),
names_sep = "relsat",
values_to = "RelSat"
) %>%
mutate(time = case_when(
TimePoint == 1 ~ time1,
TimePoint == 2 ~ time2,
TimePoint == 3 ~ time3,
TimePoint == 5 ~ time5,
TimePoint == 6 ~ time6,
TimePoint == 7 ~ time7
)) %>%
select(-c("time1", "time2", "time3", "time5", "time6", "time7")) %>%
mutate(exposure = case_when(
Spouse == "h" ~ hexposure,
Spouse == "w" ~ wexposure
)) %>%
select(-c("hexposure", "wexposure")) %>%
mutate(supp3 = case_when(
Spouse == "h" ~ hsupp3,
Spouse == "w" ~ wsupp3
)) %>%
select(-c("hsupp3", "wsupp3")) %>%
mutate(ps3 = case_when(
Spouse == "h" ~ hps3,
Spouse == "w" ~ wps3
)) %>%
select(-c("hps3", "wps3"))
If you understand your data and your goals, you will know how your data needs to look. If you don’t, you will get frustrated very quickly.