Open RStudio.
Open a new R script in R and save it as
final_project_LastFirst.R
(where Last and First is your
last and first name).
Careful about: capitalizing, last and first name order, and using
_
instead of -
.
At the top of your script, write the following (with appropriate changes):
# Final Project
# Name: Laura Fontanesi
# Date: 24 May 2022
The final project is divided in two parts:
Deadline (for both parts): 3rd of June
In your final project, you will analyze data from a random dot motion task, a classical paradigm in perceptual decision making research, published in 2017 by Nathan Evans and Scott Brown; link to the paper.
In this task, participants were required to decide whether a cloud of 40 moving white dots on a black background was moving towards the top-left or top-right of the screen. Eighty-five participants were randomly assigned into one of six groups formed by the factorial combination of two factor variables: block type (fixed time vs. fixed trial) and feedback type (low, high, medium). All participants completed 24 blocks in total.
In the fixed trial condition, there were 40 trials in each block, while in the fixed time condition each block lasted 1 min.
Participants in the low feedback detail condition received only the trial-by-trial feedback on correctness. Participants in the medium feedback condition received the same trial-by-trial feedback, plus information on their average accuracy and response time every 200 trials. Participants in the high feedback condition received the trial-by-trial feedback and the feedback every 200 trials, plus extra guidance on how they could change their speed–accuracy tradeoff to achieve an optimal reward rate.
In sum, participants were randomly assigned to the following conditions (you should use these labels in the parsed dataset):
Note: In this data set each subject partipated in only one condition, so each of the condition can be obtained from the file names, according to the following labels:
For example: the condition of ‘10014_Optim_Time_Friday 29th of May 2015 11/40/00 AM.txt’ corresponds to Fixed_Time_High_Information.
You can find the data at the OSF depository: https://osf.io/8592y/ (only
download the Data
folder). Or on dropbox: https://www.dropbox.com/sh/pz275wapxekjh6z/AAAsLcgz6hAWLSg8cm5rrdVPa?dl=0
Note that the data of each participant is in separate files. We therefore need to import them and merge them in a unique dataset. Instead of writing yourself all the data paths (which would take forever), you can use the following function: https://www.math.ucla.edu/~anderson/rw1001/library/base/html/list.files.html
Here is my example code on how to import the files. You only need to adapt the path to the folder on your system where you downloaded the raw data.
library(tidyverse)
# define the data folder
data_folder = '~/Dropbox/teaching/r-course22/data/final_project/'
# define lists of files
list_files = paste(data_folder, list.files(path=data_folder), sep="")
# apply the read_delim function to each element in the list
data_list = lapply(list_files, function(x) read_delim(x, delim='\t'))
# loop through all the participants to add the participant number as a column as well as the name of the file, since we have no other way to know in which conditions the participants were in
data = NULL
participant = 0
for (n in 1:(length(data_list))){
participant = participant + 1
cond_list = strsplit(list_files[n], "_")
condition = paste(cond_list[[1]][3], "_", cond_list[[1]][4], sep="")
data=rbind(data,cbind(participant=participant, condition=condition, data_list[[n]]))
}
head(data)
## participant condition blkNum trlNum coherentDots numberofDots percentCoherence winningDirection response correct
## 1 1 Norm_Time 1 1 4 40 10 left left 1
## 2 1 Norm_Time 1 2 4 40 10 left right 0
## 3 1 Norm_Time 1 3 4 40 10 right left 0
## 4 1 Norm_Time 1 4 4 40 10 right left 0
## 5 1 Norm_Time 1 5 4 40 10 left left 1
## 6 1 Norm_Time 1 6 4 40 10 left left 1
## eventCount averageFrameRate RT
## 1 51 15.389 3314
## 2 40 15.250 2623
## 3 11 16.492 667
## 4 19 15.690 1211
## 5 99 15.175 6524
## 6 15 16.060 934
glimpse(data)
## Rows: 73,075
## Columns: 13
## $ participant <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ condition <chr> "Norm_Time", "Norm_Time", "Norm_Time", "Norm_Time", "Norm_Time", "Norm_Time", "Norm_Time",…
## $ blkNum <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ trlNum <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,…
## $ coherentDots <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ numberofDots <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40…
## $ percentCoherence <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ winningDirection <chr> "left", "left", "right", "right", "left", "left", "left", "right", "left", "right", "left"…
## $ response <chr> "left", "right", "left", "left", "left", "left", "right", "right", "left", "right", "right…
## $ correct <dbl> 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, …
## $ eventCount <dbl> 51, 40, 11, 19, 99, 15, 10, 16, 10, 13, 21, 9, 18, 10, 5, 6, 14, 56, 41, 24, 15, 21, 8, 26…
## $ averageFrameRate <dbl> 15.389, 15.250, 16.492, 15.690, 15.175, 16.060, 16.474, 16.097, 15.291, 15.971, 15.825, 15…
## $ RT <dbl> 3314, 2623, 667, 1211, 6524, 934, 607, 994, 654, 814, 1327, 579, 1151, 659, 323, 347, 863,…
As you can see, there are a few things to fix to make the dataset look a bit cleaner. It’s on you now to do the rest: The final data should consist of the following columns:
The columns in the dataset that are not in this list you should remove. The labels of the column that are different you should change to the ones below.
Hint: Note that the accuracy
column
corresponds to the correct
column in the original dataset.
The recoding of the conditions should follow the following scheme:
Hint: Try summarizing the data using
n_distinct(participant))
.
Hint: First group the data by condition
and then summarize using n_distinct(participant))
.
Hint: First, you can create a new column called
block_type
with the following recoding of the
condition
column:
Then, you can calculate the number of trials per participant (e.g.,
using group_by(data, participant)
and then summarize with
the count function n=()
). You should obtain a tibble with
as many rows as the number of participants and a column containing the
number of trials per participant.
Then, you need to add the information of which condition each
participant was assigned to to previous output (of the
summarize
function). An easy way is to use the
distinct
function (https://dplyr.tidyverse.org/reference/distinct.html),
specifically using distinct(data, participant, condition)
which will output a similar tibble as before (i.e., with the same number
of rows: one per participant) with the name of the condition each
participant was assigned to.
Now, you can join the 2 datasets, it should like this:
# A tibble: 85 x 3
participant n_trials block_type
<dbl> <int> <chr>
1 1 911 FixedTime
2 2 960 FixedTrial
3 3 753 FixedTime
4 4 960 FixedTrial
...
Finally, you can group by block_type
and calculate the
mean number fo trial in each level (time vs trial), using again
group_by
and summarize
.
Calculate a summary, which includes the average and SD of accuracy and response time per condition (which are 6 in total) and use this to describe the overall performance across participants in these conditions (i.e., which conditions produced hiigher accuracy, which produced faster responses? Do accuracy and speed always trade off?).
Calculate a summary, which includes the average accuracy and average response time per participant, as well as the percentage of trials below 150 ms (too fast trials) and above 5000 ms (too slow trials). Are there any participants with more then 10% fast or slow trials?
Hint: Create two new columns in the dataset, called
slow_trials
and fast_trials
using
case_when
: You can assign values of 1 in
slow_trials
when rt >= 5
and 0 otherwise;
and assign values of 1 in fast_trials
when
rt <= .15
and 0 otherwise. Now you have two columns
that, when their average is calculated per participant and multiplied by
100, give you the percentage of slow and fast trials for each
participant.
Exclude from the dataset the participants that have less than 60% accuracy. The trials with a response time less than 150 ms or greater than 5000 ms should be also excluded. Check that the accuracy variable only contains 0 and 1.
Hint: Create a vector by using the
filter
fuction on the output of the last question in Part
1.2 and extract the participants that have mean accuracy lower than .6,
like so:
participants_to_exclude = filter(..., meanACC < .6)$participant
Since you are likely going to have many participants, it’s a bit of a
hassle to write filter(data, participant != X)
for every
participant number X. Therefore, you can build a loop that iterates
through the participants to exclude and filters them out one by one,
like so:
for (n in 1:length(participants_to_exclude)) {
data = data %>%
filter(participant != participants_to_exclude[n])
}
Now it’s time to visualize the data set.
Hint: First, calculate the mean response times and
accuracy per condition and block number using again
group_by
and summarize
. Then, use the
geom_line()
function with the summarized data:
ggplot(data = ..., mapping = aes(x = block_number, y = ..., color=condition)) +
geom_line()
In the paper, they define the reward rate as:
\(\frac {PC}{MRT + ITI + FDT + (1-PC)*ET}\)
where MRT and PC refer to the mean correct response time and probability of a correct response, ITI is the inter-trial interval (i.e., 100 ms), FDT is the feedback display time (i.e., 300 ms), and ET is the error timeout (i.e., 500 ms).
By calculating the MRT and PC per block per participant (i.e., create
a summary using participant
and block_number
as grouping variables), you can calculate the reward_rate
by filling in the rest of the equation above:
### group data to calculate mean correct response times
grouped_data_pos = group_by(filter(data, accuracy > 0), participant, block_number)
### group data to calculate mean accuracy
grouped_data = group_by(data, participant, block_number)
### join the two
data_summary = full_join(summarise(grouped_data_pos, MRT = mean(rt)),
summarise(grouped_data, PC = mean(accuracy)))
### add condition variable for plotting
data_summary = full_join(data_summary,
distinct(data, participant, block_number, condition))
### and now calculate the reward rate as in the equation
data_summary = mutate(data_summary,
reward_rate = PC/(...))
When you are done, you can plot the reward rate as you did in Part 1.4 for response times and accuracy (so recreate the full Figure 2 from the paper), and also plot the mean reward rate per condition.
We now want to run 2 separate ANOVA models to test the three main effects (block number, block type, and feedback type) on response times and accuracy (if you calculated the reward rate, you have the option of also calculate the ANOVA on the reward rates).
This includes 3 steps:
Finally, we want to run a multilevel regression to the test the three
main effects (block number, block type, and feedback type) on response
times, without aggregating the data. This means that
you will have repeated measurements for each participant, block number,
block type, and feedback type and that you have to include
participant
as a grouping/hierarchical variable in your
multilevel model.
This includes 2 steps: