Video Tutorial

https://byu.box.com/s/govuet5sj9fb2mwyqgg0nzk48jerjelm

Working with Data

As (budding) scientists, we want to be able to work with real data.

This data will come from different places, and we will need to be able to

Understand the data
Import the data into R
Examine the data
Format the data for R
Summarize the data

before we can analyse the data.

Understanding Data

Before we even think about bringing the data into R, we should take some time to understand our data. You’ll want to ask and answer some or all of the following questions:

Where does data come from?

Archival/public records
Surveys
Interviews
Experiments
Others?

How Was the Study Conducted?

How was the data collected?
Why was it collected? (i.e. what’s the purpose of the study?)
What are the key variables?

Examining the data

Wide vs. Long?
- Is the data long format or wide format?
Variables (columns)
- What are the columns? In other words, what variables are there?
- What data type (number, factor, string) should each column be when we import it into R?
- Are the column names appropriate for R?
Data Completeness
- Are there any columns that we don’t have but will need to create?
- What are the NA values? In other words, how are missing data represented?

Wide vs. Long Format

Wide Data

Wide data does NOT mean the data is big or has lots of columns! Data is wide if the same variable is measured multiple times and each measure is in the same row. Consider the example below, in which we measured Tim’s quiz scores 4 times, and put all the scores in the same row. That’s wide data.

Student	Quiz 1	Quiz 2	Quiz 3	Quiz 4
Tim	84	76	81	67
Anna	98	45	78	61

Here’s another example:

Participant	Age	Sex	Anxiety Score Time 1	Anxiety Score Time 2	Treatment Group
Joanna	25	F	8	7	Control
Diego	31	M	6	4	Intervention

This data has a lot of columns, but that’s not what makes it wide. It’s wide because Anxiety Score was measured twice, at Time 1 and Time 2. Since there are two columns that measure the same thing, we have wide data.

Wide data is easier for humans to read, but not so great for computers. So, we do NOT want wide data in R. If we have wide data, we know that we will have to turn it into long data after we read the data into R.

Long Data

Long data does not mean there are a lot of rows in the data! It means that each row is a unique observation, a single data point - there are no repeated measurements in a row, because each row is one observation per subject.

This IS what we want in R. Long data = happy R.

Here is the quiz data converted to long format. Each row has a single quiz on it. Some of the other variables are repeated, but that’s OK - computer memory is cheap and plentiful these days.

Student	Quiz_Number	Score
Tim	Quiz 1	84
Tim	Quiz 2	76
Tim	Quiz 3	81
Tim	Quiz 4	67
Anna	Quiz 1	98
Anna	Quiz 2	45
Anna	Quiz 3	78
Anna	Quiz 4	61

Here’s the anxiety data in long format. Again, some of the variables are repeated across rows, but that’s OK because we got what we wanted: each row represents a unique data point.

Participant	Age	Sex	Time	Anxiety_Score	Treatment_Group
Joanna	25	F	1	8	Control
Joanna	25	F	2	7	Control
Diego	31	M	1	6	Intervention
Diego	31	M	2	4	Intervention

Within- and Between-subjects Data

If you know the difference between within-subjects and between-subjects data, congratulations on being super cool and popular. You may recognize that when we talked about wide and long data we were talking about within-subjects data where each participant has 2 or more observations. What about between-subjects data, where there is only one observation from each participant? In that case, there are no observations to repeat, so the data is already in a good format for R. Hooray!

What is the difference between wide and long format data?
Which format do we prefer to work with in R?
Is this wide or long data? How would you change it to be in the other format?

Musher	Checkpoint 1 Time	Checkpoint 2 Time	Checkpoint 3 Time	Checkpoint 4 Time
Jerry Sousa	04:03	02:56	05:04	03:21
Melissa Owens	03:35	07:01	05:34	03:40

Variables (Columns)

Variables in R

Study variables can take one of 3 data types:

A number
- 18, 4388, 3.14159, 0.17
A factor
- A string/number with a limited set of possible values, called levels
- c(‘Control’, ‘Experimental’), c(‘Male’, ‘Female’)
A string
- Could have any value
- “There’s a bright golden haze on the meadow”, “That’s not a duck!”

Column formatting

Column names must conform to the same naming rules as variables (no spaces, no weird characters).

Define each of the following types of study variables, giving an example of each:
- Continuous (Numeric)
- Categorical (Factor)
- String
Define ‘levels’ as the term applies to factors. What are the levels of your example categorical variable?
What are the rules for column names in R?

Data Completeness

Missing variables?

Are there any variables/columns that are missing? Do we need to

create/compute the variable?
get the variable somewhere else and add it in?

Missing Data Points?

How are missing data points (NA) represented? How much data is missing?

What is ‘missing data’?
- What does ‘NA’ mean in R?

Importing Data into R

Once you feel that you adequately understand the data, you can bring it into R.

Ideally, you want your data to be a comma-separated (.csv) or tab-delimited (.txt) file.

We’ll need to make some data so we can practice working with data.

Open excel or google sheets and create a small data file using the table below as a guide. Save the file as ‘fakedata.csv’.
- Your data should have the rows and columns (including column names) shown in the table below.

Participant	Condition	Observation
1	Condition A	5
2	Condition A	2
3	Condition A	7
4	Condition A	4
5	Condition A	1
1	Condition B	10
2	Condition B	4
3	Condition B	10
4	Condition B	5
5	Condition B	9
1	Condition C	1
2	Condition C	3
3	Condition C	8
4	Condition C	2
5	Condition C	5

Now save the data

As a .csv file: file -> Save As… -> File Format -> Comma Separated Values (.csv)
Save the data into the same folder as your R Project!

Now let’s read in the data!

# This looks for the file in the same directory as our R project
my_data <- read_csv("FakeData.csv") 
# Notice that read_csv has an underscore, not a perior (.). read.csv() is an existing function in base R, but read_csv is a bit better.

# my_file <- read_tsv("FakeData.csv") # read in a tab delimited txt
# my_file <- read_delim("FakeData.csv", delim = "/") # read in a file with a unique delimiter (e.g. "/")

my_data <- read_csv("./data/FakeData.csv") 
# This looks for the file in the "data" subdirectory of our R project folder

When we import our data into R, we sometimes need to tell R what missing data looks like. We can do this by adding na = to our read command:

my_data <- read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!"))

This tells R to treat any of the following as NA (missing data):

a blank cell (““)
“NA”
a period (“.”)
#REF!, which is a common error message in excel

We can add anything we need to this vector.

In a new code chunk, do the following:

Use read_csv to read in FakeData.csv into R. Call it ‘my_data’
Change your read_csv command to tell R what to do with missing data points in the file.
- Assume some missing data points are labeled NA and others are simply empty (““).
What kind of study variable should Participant be?
What kind of study variable should Condition be?
What kind of study variable should Observation be?

Examining your data

You can see how many rows and columns your data has by looking in the ‘Environment’ tab (top right).

Examining data using code

There are also several commands that allow you to examine your data:

ncol(my_data) # Number of columns
nrow(my_data) # Number of rows
dim(my_data) # Get rows and columns
colnames(my_data) # What are the columns/variables
summary(my_data) # Shows a summary of each column in the data
head(my_data) # Shows the first few rows
tail(my_data) # Shows the last few rows

Examining data using R Studio

If you want to see your whole dataset, go to the ‘Environment’ tab and click on it there. Or use the view() command. RStudio will open it in a separate tab.

What are two ways to find out how many rows and columns my_dataa has?
In a code chunk, show how to find out the names of the columns in my_data.
In a code chunk, show how to use the summary command to find out lots of good information about my_data.

Format the Data

Usually, we will need to do some things to our data before it is fully ready to be analyzed.

Making Wide Data Long

my_data_long <- pivot_longer(my_data_wide, # What data are you starting with?
                             cols = c(2:4), # This vector says which columns should be pivoted. Any column not listed is not pivoted.
                             # Typically, participant and demographic columns are not pivoted.
                             # Only the columns with repeated values are pivoted
                             names_to = "ColumnNamesGoHere", # What new column should the existing column names go into?
                             values_to = "CellValuesGoHere") # What new column should the existing column names go into?

This command can do some complex things, like splitting column names up into separate columns, selecting a subset of the data to pivot, and more. See ?pivot_longer for more info.

Pivot Example 1: Quiz Data

Here’s a specific example. We’re starting with this wide data, which we’ll call Quiz_Data:

Student	Quiz 1	Quiz 2	Quiz 3	Quiz 4
Tim	84	76	81	67
Anna	98	45	78	61

Here’s the command:

Quiz_Data_Long <- pivot_longer(Quiz_Data,
                               cols = c(2:5), # We only want to pivot the repeated quiz data (columns 2 to 5), not the participant name (column 1)
                               names_to = "Quiz_Number", # Take the column names ("Quiz 1", "Quiz 2", etc.) and put them in a column called "Quiz_Number"
                               values_to = "Score") # Take the quiz scores and put them in a column called "Score"

And here’s what we would get:

Student	Quiz_Number	Score
Tim	Quiz 1	84
Tim	Quiz 2	76
Tim	Quiz 3	81
Tim	Quiz 4	67
Anna	Quiz 1	98
Anna	Quiz 2	45
Anna	Quiz 3	78
Anna	Quiz 4	61

Pivot Example 2: Anxiety Study

Here’s some data we’ll call Anxiety_Study:

Participant	Age	Sex	Anxiety Score Time 1	Anxiety Score Time 2	Treatment Group
Joanna	25	F	8	7	Control
Diego	31	M	6	4	Intervention

Here’s the command:

Anxiety_Study_Long <- pivot_longer(Anxiety_Study,
                               cols = c(4:5), # We only want to pivot the anxiery score columns (columns 4 and 5), not the other ones
                               names_to = "Time", 
                               # Take the column names ("Anxiety Score Time 1", "Anxiety Score Time 2") and put them in a column called "Time"
                               values_to = "Anxiety_Score") # Take the anxiety scores in the cells and put them in a column called "Anxiety_Score"
Anxiety_Study_Long <- mutate(Anxiety_Study_Long, Time = gsub("Anxiety Score Time ", "", Time), Time = as.numeric(Time))
#Take the words out of the time column and make it a number. You'll learn more about this later!

And here’s what we would get:

Participant	Age	Sex	Time	Anxiety_Score	Treatment_Group
Joanna	25	F	1	8	Control
Joanna	25	F	2	7	Control
Diego	31	M	1	6	Intervention
Diego	31	M	2	4	Intervention

Your Turn

Open excel or google sheets and create a small data file using the table below as a guide. Save the file as ‘fakewidedata.csv’.

Be sure to save it to your project folder!

Participant	Condition A	Condition B	Condition C
6	8	5	1
7	6	2	8
8	7	7	0
9	5	4	0
10	3	1	8
11	0	10	0
12	9	4	0
13	4	10	1
14	1	5	3
15	1	9	4

Read this data into R. Call it my_data_wide
Use the pivot_longer function to transform my_data_wide into long format, and save the output as my_data_long

Send the column names to “Condition” and the values to “Observation”
Include columns 2 to 4 of the data, but NOT column 1!

If you’ve done it right, the data should look like this (I’ve shown the first few rows only):

Participant	Condition	Observation
6	Condition A	8
6	Condition B	5
6	Condition C	1
7	Condition A	6
7	Condition B	2
7	Condition C	8

Combining Data Sets

So you want to combine two data sets together? Here’s how you do it!

If the data sets have the SAME columns

i.e. if you have similar data from two different participants or groups, use bind_rows():

my_file_combined <- bind_rows(my_file, my_file_2)

If the data sets have DIFFERENT columns

i.e. you have the data FROM the participants (such as their performance on some test or measure) and data ABOUT the participants (such as demographic data) that you need to combine, use left_join() or one of its variants (right_join(), inner_join(), full_join()):

my_file_combined <- left_join(data_from_participants, data_about_participants, by = "Participant")

There must be at least one column in each data with the same information (e.g. a unique participant ID number). Use by = to specify which.

If the matching columns are not the same, you must tell R how to match them up, like so:

my_file_combined <- left_join(data_from_participants, data_about_participants, by = c("Participant" = "Respondent"))

In a code chunk, combine my_data and my_data_long into a super data set called my_data_combined.

Note. These two files SHOULD have the same column names.

The pipe (%>%) operator

The pipe operator [%>%, shortcut CTRL + Shift + M] is a unique way to link a series of commands together. As long as these commands all operate on the same data set, you can create a ‘pipeline’ where the data goes in one end and comes out the other end transformed.

Like so:

my_file_new <- read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!")) %>% 
  # This line reads in the data
  filter(Participant != 1) # This line filters the data

You can also write it like this, which makes the process clearer:

read_csv("FakeData.csv", na = c("", "NA", ".", "#REF!")) %>% 
  # This line reads in the data
  filter(Participant != 1) -> # This line filters the data
  my_file_new # New Name

Pipes Image from https://www.dynamicdrainstx.com/wp-content/uploads/2018/12/pipe-mess.jpg

Below are some examples of useful pipe commands. There are a lot more than these! Google can teach you about them.

Rename: Renaming Columns

my_data_combined <- my_data_combined %>% 
  # we use the assigner like this when we want to change the data
rename(Participant = Subject) %>% # Change a column name from Subject to Participant
rename_with(toupper, starts_with("C")) %>% # Change name of column(s) that start with 'T' to uppercase
rename_with(~ gsub("o", "O", .x)) # Replace all lower-case "o" strings with upper-case "O" strings in all column names

Mutate: Changing What’s in a Column

Mutate changes columns or their contents

my_data_combined <- my_data_combined %>%
  mutate(Participant = as.factor(Participant)) %>%
# Change participant to a factor
# as.numeric() or as.character() can be used here, too, to try and change a variable to number or string, respectively.
# Below, creates a new column called NewCondition 
  mutate(NewCondition = case_when(
      Condition == "A" ~ "Control", 
      # if Condition equals "A", NewCondition is given the value "Control"
      Condition == "B" ~ "Experimental",
      # if Condition equals "B", NewCondition is given the value "Experimental"
      TRUE ~ "Other" # What if it's not A or B?
      # if Condition is not "A" or "B", NewCondition is given the value "Other"
    ), NewCondition = as.factor(NewCondition)) 
    # And then we make NewCondition a factor

Note the use of the equal sign “=” instead of the assignmer “<-” in mutate().

Select: Which columns should we keep?

select includes or excludes certain columns from our data

my_data_combined <- my_data_combined %>% 
  select(c(2:3)) # selects only columns 2 to 3, leaving out column 1

my_data_combined <- my_data_combined %>% 
  select(-c(3)) # selects all columns EXCEPT column 3

my_data_combined <- my_data_combined %>% 
  select(c("Participant", "Condition")) # selects only named columns

Filter: Which rows should we keep?

Filter keeps only rows for which a conditional statement is true

my_data_combined <- my_data_combined %>% 
  filter(Participant == 10) # only keeps participant 10

my_data_combined <- my_data_combined %>% 
  filter(Participant < 10) # only keeps participant 1 to 9

When should you use the pipe operator?

When you want to apply a series of functions to a single input.

When should you NOT use the pipe?

When the list of functions is so long that it might get confusing.
When you need to see or use the intermediate output.
When you have multiple inputs
When you need multiple outputs
When you need to do something non-linear
- i.e. IF branching
When you don’t want to
- There’s always another way to do it in R. It might be less efficient, but it’ll work.

Use a pipe to transform my_data_combined in the following ways:

Rename the columns “Participant”, “Condition”, “Observation” to “Subject”, “IndependentVariable”, and “DependentVariable”, respectively.
Mutate “Subject”, “IndependentVariable”, “DependentVariable” into a factor, a factor, and a number, respectively
Mutate the levels in “IndependentVariable” from “Condition A”, “Condition B”, “Condition C” to “Group 1”, “Group 2”, “Group 3” respectively. Use case_when for this.
Mutate “DependentVariable” into a new column called “Proportion”, which is equal to DependentVariable divided by 10.
Filter the data, removing all 0 values from DependentVariable
Save this transformed data as my_data_transformed

Writing to a file from R

If you want to create a backup of your data, use the following command:

write_csv(my_data_transformed, "FakeData2.csv")

It’s good practice to do this after you’ve done some work to prep the data file, because…

It serves as a backup in case of an R crash
You can read in this new file and you don’t have to re-run your data cleaning code

BUT, don’t ever overwrite your original data!

Summarize the Data

To get summary stats for a variable (or variables), use the pipes and summarise().

my_data_transformed %>% # Notice this is different than above
  # we're not changing the data, just looking at it
  group_by(IndependentVariable) %>%
  # group_by groups our data by 1 or more variables, giving a summary for each variable level
  # IndependentVariable has 3 levels, so we get 3 means, SDs, etc. - one for each level.
  summarise(myMean = mean(DependentVariable), mySD = sd(DependentVariable))

How does this work?

my_data_combined is the data we want to summarize
group_by tells R what column(s) we want to use to subdivide the data. E.g. if I put Sex here, I get summaries for Males and Females separately.
summarise then creates output vectors using existing R functions applied to columns in the data.
- Note the use of a single =

Useful summary stats functions

These functions can be used inside summarise to get descriptive statistics. We’ve already seen how to use some of these

length() # To get a count of # of elements
mean()
median()
sd() #standard deviation
std.error() # Requires plotrix package
min()
max()
range() # max and min
cor(Observation1, Observation2) # Correlation requires 2 columns

Using pipes, use my_data_transformed to get summary statistics:

Get the median value of DependentVariable for each Group within IndependentVariable
Get the standard error of DependentVariable for each Group within IndependentVariable
Get the maximum value of DependentVariable for each Group within IndependentVariable
get a count of the number of observations of DependentVariable for each Group within IndependentVariable

Real Data: Do People Want to be Moral?

Now that we’ve learned to work with fake data, let’s look at some real data and apply the skills we have learned. We’ll be using this data in future units, so be sure to get this part right!

For this task, we will be replicating part of this paper:

Sun, J., & Goodwin, G. P. (2020). Do people want to be more moral?. Psychological Science, 31(3), 243-257.

https://jessiesun.me/publication/sun-2020c/sun-2020c.pdf

The Open Science Framework Page for this data can be found here:

https://osf.io/rbeuw

What’s this data about?

Using the paper linked above as a reference, answer the following questions:

In a couple of sentences, describe the goal(s) of this study.
Where did this data come from?
What are the key variables (probably)? Give a couple of examples.

Codebooks

Go to the page linked below and download Target Codebook.xlsx. Open up the excel workbook you just downloaded.

https://osf.io/rbeuw

Note. Download the file, NOT the metadata.

What do you think a ‘codebook’ is? What purpose does it serve?

Get the data

Download the file sample1-dat-clean.csv to your project folder:

https://osf.io/6rw7m

Note. Download the file, NOT the metadata.

Load the data into R using the code below.

moraldata <- read_csv("sample1-dat-clean.csv") %>% 
  select(c(1, 
           tc.EXT.S, # Sociability
           tc.EXT.A, # Assertiveness 
           tc.EXT.E, # Energy_Level
           tc.AGR.C, # Compassion
           tc.AGR.R, # Respectfulness
           tc.AGR.T, # Trust
           tc.CON.O, # Organization
           tc.CON.P, # Productiveness
           tc.CON.R, # Responsibility
           tc.NEG.A, # Anxiety
           tc.NEG.D, # Depression
           tc.NEG.E, # Emotional_Volatility
           tc.anger, # Anger
           tc.OPE.I, # Intellectual_Curiosity
           tc.OPE.A, # Aesthetic_Sensitivity
           tc.OPE.C, # Creative_Imagination
           tc.MCQ.GM, # General_Morality
           tc.MCQ.H, # Honesty
           tc.MCQ.F, # Fairness
           tc.MCQ.L, # Loyalty
           tc.MCQ.P  # Purity 
        ))

Explain what the code you just ran does.

Examine the data in R

How many rows does the data have?
How many columns?
Which column provides a unique ID for each participant in the study?
Should this be a number, a factor, or a string?

Renaming Columns

Expand the code provided in 37., adding a new command to the pipe that renames all of the columns. Use the comments as the new names for the columns.

Pivot the Data

This data is in wide format. How can you tell that this is true?
Use the pivot_longer() command to turn it into long data (call is moraldatalong. The column names should go to a variable named Trait and the column values to a variable named Desired_Change.

Summarize the Data

Summarize the data, grouping by Trait. Show the mean and standard deviation of Desired_Change for each trait.
Which trait do people want to increase the most?
Which trait do people want to decrease the most?
In APA style, means an standard deviations should be rounded to 1 digit. Change your code to do this.

Save the Data

Now use the write_csv() function to save moraldatalong to your project folder. Call is “moraldatalong.csv”.

write_csv(moraldatalong, "moraldatalong.csv")

Real Data: Marriage and Other Disasters

Now, let’s take a look at the data from this paper:

Williamson, H. C., Bradbury, T. N., & Karney, B. R. (2021). Experiencing a natural disaster temporarily boosts relationship satisfaction in newlywed couples. Psychological science, 32(11), 1709-1719.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8907491/

We’ll be using this data in future units, so be sure to get this part right.

The Open Science Framework Page for this data can be found here:

https://osf.io/e5jwt

What’s this data about?

Using the paper linked above as a reference, answer the following questions:

In a couple of sentences, describe the goal(s) of this study.
Where did this data come from?
What are the key variables (probably)? Give a couple of examples.

Codebook

Open the codebook for this study, which is at https://osf.io/auxzt.

What is the difference between the files ‘scales_datafile_rev.dta’ and ‘scales_datafile_v3long_rev.dta’? If you were allowed to choose, which would you download to analyze in R? Why?

Note: You are not allowed to choose.

Explain what data the hrelsat1 column contains.

How is it different from the hrelsat7 column?
How is it different from the wrelsat1 column?

Get the Data

Download ‘scales_datafile_rev.dta’
What kind of program uses ‘.dta’ files? Use the internet to find out.
Use the internet to find a way to load ‘.dta’ files directly into R. Describe what you found.

Hint: There might be a package with a function that can do this.

Install and load this package.

Where should you load the package?

Load the ‘.dta’ file into R, calling it marriagedata.

Pivot and Prep the Data

Copy the code chunk below into your markdown document.
Now examine it closely, interpreting its function and admiring its beauty.
After each line of code (except the pipes %>%), add a comment describing what this line of code does.
- Make sure the chunk settings show the code and output

marriagedata <- marriagedata %>%
  pivot_longer(., 
               c(2:13), 
               names_to = c("Spouse", "TimePoint"), 
               names_sep = "relsat", 
               values_to = "RelSat"
  ) %>%     
  mutate(time = case_when( 
    TimePoint == 1 ~ time1,
    TimePoint == 2 ~ time2,
    TimePoint == 3 ~ time3,
    TimePoint == 5 ~ time5,
    TimePoint == 6 ~ time6,
    TimePoint == 7 ~ time7
  )) %>%
  select(-c("time1", "time2", "time3", "time5", "time6", "time7")) %>% 
  mutate(exposure = case_when( 
    Spouse == "h" ~ hexposure,
    Spouse == "w" ~ wexposure
  )) %>%
    select(-c("hexposure", "wexposure")) %>% 
  mutate(supp3 = case_when( 
    Spouse == "h" ~ hsupp3,
    Spouse == "w" ~ wsupp3
  )) %>%
    select(-c("hsupp3", "wsupp3")) %>% 
  mutate(ps3 = case_when( 
    Spouse == "h" ~ hps3,
    Spouse == "w" ~ wps3
  )) %>%
    select(-c("hps3", "wps3"))

Now run the code.

Summarize the Data

Summarize the marriagedata data, grouping by TimePoint and by Spouse. Show the mean and standard deviation of RelSat for each trait. Also show the mean of time.

Hint There is some missing data. How do you tell the mean() and sd() functions to ignore missing data and just use the numbers that are there?
Round these numbers to one decimal point, in compliance with APA style.

Using the descriptive statistics and the codebook, explain where the hurricane occurred (i.e. between which two time points).
Using the descriptive statistics and the codebook to answer the following questions:

How long before the hurricane was the relationship satisfaction measured?
How long after the hurricane was the relationship satisfaction measured again?

Who (husbands or wives) experienced the largest numeric increase in relationship satisfaction from just before the hurricane to after?
How many timepoints did it take for people to return to baseline levels of relationship satisfaction? How long was this, in months or years?
Who was more satisfied with their relationship, on average, husbands or wives?

Save the Data

Now use the write_csv() function to save marriagedata to your project folder. Call is “marriagedata.csv”

(Bonus) More Pipes!

Take the code for making a Dungeons and Dragons character from Checklist 3 (steps 32-40) and rewrite it with pipes!

Something to Keep In Mind

If you understand your data and your goals, you will know how your data needs to look. If you don’t, you will get frustrated very quickly.

Go To Home Page

Basics IV: Working with Data

Steven Luke

2026-01-02

Video Tutorial

Working with Data

Understanding Data

Where does data come from?

How Was the Study Conducted?

Examining the data

Wide vs. Long Format

Wide Data

Long Data

Within- and Between-subjects Data

Variables (Columns)

Variables in R

Column formatting

Data Completeness

Missing variables?

Missing Data Points?

Importing Data into R

Now let’s read in the data!

Examining your data

Examining data using code

Examining data using R Studio

Format the Data

Making Wide Data Long

Pivot Example 1: Quiz Data

Pivot Example 2: Anxiety Study

Your Turn

Combining Data Sets

If the data sets have the SAME columns

If the data sets have DIFFERENT columns

The pipe (%>%) operator

Rename: Renaming Columns

Mutate: Changing What’s in a Column

Select: Which columns should we keep?

Filter: Which rows should we keep?

When should you use the pipe operator?

When should you NOT use the pipe?

Writing to a file from R

Summarize the Data

Useful summary stats functions

Real Data: Do People Want to be Moral?

What’s this data about?

Codebooks

Get the data

Examine the data in R

Renaming Columns

Pivot the Data

Summarize the Data

Save the Data

Real Data: Marriage and Other Disasters

What’s this data about?

Codebook

Get the Data

Pivot and Prep the Data

Summarize the Data

Save the Data

(Bonus) More Pipes!

Something to Keep In Mind