In this WPA, you will analyze data from another fake study. In this fake study the researchers were interested in whether playing video games had cognitive benefits compared to other leisure activities. In the study, 90 University students were asked to do one of 3 leisure activities for 1 hour a day for the next month. 30 participants were asked to play visio games, 30 to read and 30 to juggle. At the end of the month each participant did 3 cognitive tests, a problem solving test (logic) and a reflex/response test (reflex) and a written comprehension test (comprehension).
The data file has 90 rows and 7 columns. Here are the columns
id
: The participant ID
age
: The age of the participant
gender
: The gender of the particiant
activity
: Which leisure activity the participant was
assigned for the last month (“reading”, “juggling”, “gaming”)
logic
: Score out of 120 on a problem solving task.
Higher is better.
reflex
: Score out of 25 on a reflex test. Higher
indicates faster reflexes.
comprehension
: Score out of 100 on a reading
comprehension test. Higher is better.
Task A
data_wpa7.txt
dataset in R (find them on
Github) and save it as a new object called leisure
. Inspect
the dataset first.library(tidyverse)
leisure = read_delim('https://raw.githubusercontent.com/laurafontanesi/r-seminar22/master/data/data_wpa7.txt', delim='\t')
##
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## index = col_double(),
## id = col_double(),
## age = col_double(),
## gender = col_character(),
## activity = col_character(),
## logic = col_double(),
## reflex = col_double(),
## comprehension = col_double()
## )
head(leisure)
## # A tibble: 6 x 8
## index id age gender activity logic reflex comprehension
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1 26 m reading 88 13.7 72
## 2 2 2 31 m reading 85 11.8 83
## 3 3 3 38 m reading 82 5.8 67
## 4 4 4 24 m reading 102 18 66
## 5 5 5 30 f reading 48 14 62
## 6 6 6 31 m reading 61 14.1 58
feed_me()
that takes a string
food
as an argument, and returns (in case
food = 'pizza'
) the sentence “I love to eat pizza”. Try
your function by running feed_me("apples")
(it should then
return “I love to eat apples”).feed_me = function(food) {
print(paste("I love to eat", food))
}
feed_me("apples")
## [1] "I love to eat apples"
Without using the mean()
function, calculate the
mean of the vector vec_1 = seq(1, 100, 5)
.
Write a function called my_mean()
that takes a
vector x
as an argument, and returns the mean of the vector
x
. Use your code for task A3 as your starting point. Test
it on the vector from task A3.
vec_1 = seq(1, 100, 5)
my_mean = function(x) {
return(sum(x)/length(x))
}
my_mean(vec_1)
## [1] 48.5
my_mean()
function to calculate the mean
‘logic’ rating of participants in the leisure
dataset and
compare the result to the built-in mean()
function (using
==
) to make sure you get the same result.my_mean(leisure$logic) == mean(leisure$logic)
## [1] TRUE
for (i in 1:10) {
print(i**2)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
squares
. You’ll need to pre-create a vector,
and use indexing to update it.squares = rep(NA, 10)
for (i in 1:10) {
squares[i] = i**2
}
print(squares)
## [1] 1 4 9 16 25 36 49 64 81 100
Task B
standardize
, that, given an
input vector, returns its standardized version. Remember that to
normalize a score, also called z-transforming it, you first subtract the
mean score from the individual scores and then divide by the standard
deviation.standardize = function(x) {
demeaned = x - mean(x, na.rm = TRUE)
st_deviation = sd(x, na.rm = TRUE)
z_score = demeaned/st_deviation
return (z_score)
}
leisure
dataset. Call this copy
z_leisure
. Normalise the logic
,
reflex
, age
and comprehension
columns using the standardize
function using a
for
loop. In each iteration of the loop, you should
standardize one of these 4 columns. You can create a vector first,
called columns_to_standardize
where you store these columns
and use them later in the loop. You should not add them as additional
columns, but overwrite the original columns.z_leisure = leisure
columns_to_standardize = c("logic", "reflex", "age", "comprehension")
for (col in columns_to_standardize) {
z_leisure[,col] = standardize(pull(leisure[,col])) # NOTE: the pull function is the safest way to get a vector out of a tibble
}
head(z_leisure)
## # A tibble: 6 x 8
## index id age gender activity logic reflex comprehension
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1 -0.683 m reading 1.14 -0.424 0.576
## 2 2 2 0.190 m reading 0.962 -1.02 1.29
## 3 3 3 1.41 m reading 0.785 -2.89 0.252
## 4 4 4 -1.03 m reading 1.97 0.920 0.187
## 5 5 5 0.0155 f reading -1.23 -0.330 -0.0728
## 6 6 6 0.190 m reading -0.457 -0.299 -0.332
Task C
age
and reflex
of
participants in the leisure
datset. Cutomise it and add a
regression line.ggplot(data = leisure, mapping = aes(x = age, y = reflex)) +
geom_point(alpha = 0.2, size= 3) +
geom_smooth(method = lm, color='indianred') +
labs(x='Age', y='Reflex')
## `geom_smooth()` using formula 'y ~ x'
my_plot()
that takes arguments
x
and y
and returns a customised scatterplot
with your customizations and the regression line.my_plot = function(x, y) {
g = ggplot(mapping = aes(x = x, y = y)) +
geom_point(alpha = 0.2, size= 3) +
geom_smooth(method = lm, color='indianred')
return(g)
}
my_plot()
function on the
age
and reflex
of participants in the
leisure
dataset.my_plot(leisure$age, leisure$reflex)
## `geom_smooth()` using formula 'y ~ x'
Task D
1:10
.
(i.e. Don’t use the existing sum
function).final_sum = 0
for (i in 1:10) {
final_sum = final_sum + i
print(final_sum)
}
## [1] 1
## [1] 3
## [1] 6
## [1] 10
## [1] 15
## [1] 21
## [1] 28
## [1] 36
## [1] 45
## [1] 55
final_sum
## [1] 55
my_sum
that
returns the sum of any vector x. Test it on the logic
ratings.my_sum = function(x) {
final_sum = 0
for (i in x){
final_sum = final_sum + i
}
return(final_sum)
}
my_sum(leisure$logic)
## [1] 6186
sum(leisure$logic)
## [1] 6186
my_mean2
and
compare it to both the my_mean
function you created, and
the in-built mean
function. (Bonus: Can you also think of a
way to do this without using the the length function)my_mean2 = function(x) {
final_sum = 0
final_length = 0
for (i in x){
final_sum = final_sum + i
final_length = final_length + 1
}
return(final_sum/final_length)
}
my_mean(leisure$logic)
## [1] 68.73333
my_mean2(leisure$logic)
## [1] 68.73333
mean(leisure$logic)
## [1] 68.73333
Task E
p_values
with 100 NA
values.p_values
.p_values
with
100 p-values.p_values
and calculate the
proportion of p-values that are significant at the .05 level.p_values = rep(NA, 100)
sample = rnorm(mean=0, sd=1, n=10)
p_values[1] = t.test(sample)$p.value
p_values[1]
## [1] 0.7385729
p_values = rep(NA, 100)
for (i in 1:100) {
sample = rnorm(mean=0, sd=1, n=10)
p_values[i] = t.test(sample)$p.value
}
p_simulation
with 4 arguments:
sim
: the number of simulations, samplesize
:
the sample size, mu_true
: the true mean, and
sd_true
: the true standard deviation. Your function should
repeat the simulation from the previous question with the given
arguments. That is, it should calculate sim
p-values
testing whether samplesize
samples from a normal
distribution with mean = mu_true
and standard deviation =
sd_true
is significantly different from 0. The function
should return a vector of p-values.Note: to get the p-value of a t-test: p_value = t.test(x)$p.value # Calculate the p.vale for the sample x
p_simulation = function(sim, samplesize, mu_true, sd_true) {
p_values = rep(NA, sim)
for (i in 1:sim) {
sample = rnorm(mean=mu_true, sd=sd_true, n=samplesize)
p_values[i] = t.test(sample)$p.value
}
return(p_values)
}
p_values = p_simulation(sim=1000, samplesize=100, mu_true=0, sd_true=1)
mean(p_values < .05)*100
## [1] 5.2