Open RStudio.
Open a new R script in R and save it as
wpa_4_LastFirst.R
(where Last and First is your last and
first name).
Careful about: capitalizing, last and first name order, and using
_
instead of -
.
At the top of your script, write the following (with appropriate changes):
# Assignment: WPA 4
# Name: Laura Fontanesi
# Date: 5 April 2022
Download the data ccam_original.sav
from the data
folder on Github and load them in R.
library(tidyverse)
library(haven)
# Load data in R
survey_data_raw = read_spss("data/ccam_original.sav")
Data wrangling is the process of transforming data from a raw format (in which they were collected) into another format with the intent of making it more appropriate and valuable for exploratory and confirmatory data analyses.
Today’s class is based on specific functions in the tidyverse that will serve exactly this purpuse.
Virtually all you need to know is in this cheatsheet.
The more frequent functions you will use are:
%>%
: The pipe operator, to chain functions (not
recommended at the beginning).
arrange
: order the dataframe based on values of
variables.
rename
: to rename specific columns.
filter
and slice
: to select subsets of
rows.
select
: to select subsets of columns.
mutate
: make new columns based on modifications of
existing ones.
mutate_all
: make the same modification on all
columns.
mutate_if
: make the same modification on columns
satisfying specific conditions.
Look here for other kinds of mutate.
summarize
: to calculate summary statistics.
summarize_all
: to calculate summary statistics on
all columns.
summarize_if
: to calculate summary statistics on
columns satisfying specific conditions.
Look here for other kinds of summarize.
group_by
: to group the data based on some variables,
so that subsequent calculations are done on such variables.
full_join
, left_join
,
right_join
: to join separate dataframes.
bind_rows
and bind_cols
: to append
dataframes vertically or horizontally.
gather
and spread
: bring wide form to
long form and viceversa.
unite
: make a column from multiple columns.
separate
: make multiple columns from one
column.
The first thing you want to do when importing data, is to inspect it
visually, to get to know the information it contains. You can always
click on the data in the Environment tab, but there are also other
useful functions that allow us to see the data types
(glimpse
), to sort the data based on the values of one or
more clomuns (arrange
), and to visualize the first or the
last n rows (head
and tail
, respectively).
glimpse(survey_data_raw)
## Rows: 20,024
## Columns: 54
## $ case_ID <dbl> 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26…
## $ wave <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ year <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ weight_wave <dbl> 0.54, 0.85, 0.49, 0.29, 1.29, 2.56, 0.23, 0.82, 0.38, 0.35, 0.91, 1.01, 0.82, 0…
## $ weight_aggregate <dbl> 0.29392628, 0.46266174, 0.26671088, 0.15784930, 0.70215723, 1.39342829, 0.12519…
## $ happening <dbl+lbl> 3, 2, 2, 3, 3, 2, 3, 1, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, …
## $ cause_original <dbl+lbl> 1, 1, 2, 2, 1, 2, 1, 1, 4, 2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 3, 1, 2, 2, 2, 1, …
## $ cause_other_text <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "both of the above", ""…
## $ cause_recoded <dbl+lbl> 6, 6, 4, 4, 6, 4, 6, 6, 3, 4, 6, 6, 6, 6, 5, 4, 6, 6, 6, 5, 6, 4, 4, 4, 6, …
## $ sci_consensus <dbl+lbl> 4, 1, 2, 4, 2, 2, 2, 3, 2, 4, 1, 2, 4, 4, 4, 2, 2, 1, 4, 4, 4, 2, 2, 2, 4, …
## $ worry <dbl+lbl> 3, 2, 1, 3, 3, 2, 3, 2, 1, 3, 2, 3, 3, 3, 4, 2, 4, 2, 4, 1, 3, 1, 2, 2, 2, …
## $ harm_personally <dbl+lbl> 2, 2, 1, 2, 0, 0, 2, 3, 1, 0, 0, 1, 3, 2, 4, 2, 3, 3, 3, 1, 3, 1, 2, 2, 1, …
## $ harm_US <dbl+lbl> 3, -1, 1, 2, 0, 0, 3, 3, 1, 0, 0, 0, 3, 2, 4, 2, 3, 3, 3,…
## $ harm_dev_countries <dbl+lbl> 4, 2, 1, 3, 0, 0, 4, 3, 1, 0, 0, 0, 4, 2, 4, 2, 4, 0, 3, 1, 4, 1, 2, 2, 3, …
## $ harm_future_gen <dbl+lbl> 4, 3, 1, 3, 0, 0, 4, 3, 1, 4, 0, 0, 4, 3, 4, 3, 4, 0, 4,…
## $ harm_plants_animals <dbl+lbl> 4, 3, 1, 3, 3, 0, 4, 3, 1, 4, 0, 3, 4, 2, 4, 3, 4, 3, 4, 1, 4, 1, 2, 3, 3, …
## $ when_harm_US <dbl+lbl> 5, 3, 1, 4, 2, 2, 5, 4, 1, 3, 3, 3, 5, 4, 5, 2, 5, 6, 6, 1, 6, 1, 2, 2, 4, …
## $ reg_CO2_pollutant <dbl+lbl> 4, 3, 2, 3, 3, 1, 3, 2, 2, 3, 3, 4, 4, 3, 3, 2, 3, 4, 4,…
## $ reg_utilities <dbl+lbl> 4, 3, 1, 4, 1, 1, 2, 2, 2, 4, 3, 1, 4, 3, 2, 2, 3, 1, 4,…
## $ fund_research <dbl+lbl> 4, 3, 1, 4, 4, 3, 4, 2, 3, 4, 3, 1, 3, 3, 4, 3, 4, 3, 4,…
## $ reg_coal_emissions <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ discuss_GW <dbl+lbl> 3, 2, 1, 2, 1, 2, 3, 2, 3, 3, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1, 4, 1, 2, 2, 2, …
## $ hear_GW_media <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gender <dbl+lbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, …
## $ age <dbl> 78, 45, 54, 71, 26, 29, 29, 34, 46, 60, 65, 67, 68, 60, 63, 32, 44, 76, 38, 45,…
## $ age_category <dbl+lbl> 3, 2, 2, 3, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 1, 2, 3, 2, 2, 2, 3, 2, 1, 3, …
## $ generation <dbl+lbl> 5, 4, 4, 5, 2, 3, 3, 3, 4, 4, 5, 5, 5, 4, 5, 3, 4, 5, 3, 4, 3, 5, 3, 3, 5, …
## $ educ <dbl+lbl> 9, 6, 14, 13, 10, 12, 12, 12, 12, 12, 9, 14, 9, 10, 9, 13, 12, 9, 12,…
## $ educ_category <dbl+lbl> 2, 1, 4, 4, 3, 4, 4, 4, 4, 4, 2, 4, 2, 3, 2, 4, 4, 2, 4, 3, 4, 2, 4, 2, 4, …
## $ income <dbl+lbl> 12, 9, 9, 16, 13, 14, 3, 12, 16, 17, 8, 7, 16, 1, 7, 16, 16, 7, 8,…
## $ income_category <dbl+lbl> 2, 1, 1, 3, 2, 2, 1, 2, 3, 3, 1, 1, 3, 1, 1, 3, 3, 1, 1, 2, 3, 1, 2, 2, 1, …
## $ race <dbl+lbl> 1, 1, 4, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ideology <dbl+lbl> 4, 3, 4, 4, 4, 5, 4, 2, 4, 5, 3, 4, 3, 3, 1, 4, 2, 3, 2, 4, 2, 5, 5, 3, 1, …
## $ party <dbl+lbl> 1, 5, 1, 3, 1, 3, 1, 3, 3, 1, 2, 1, 3, 2, 2, 1, 2, 2, 3, 5, 2, 1, 1, 3, 2, …
## $ party_w_leaners <dbl+lbl> 1, 4, 1, 1, 1, 1, 1, 3, 3, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 4, 2, 1, 1, 1, 2, …
## $ party_x_ideo <dbl+lbl> 5, -2, 5, 5, 5, 5, 5, 3, 3, 5, 2, 5, 2, 2, 1, 5, 1, 2, 1,…
## $ registered_voter <dbl+lbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, …
## $ region9 <dbl+lbl> 5, 3, 8, 5, 6, 9, 8, 8, 2, 7, 5, 7, 2, 4, 7, 7, 4, 2, 1, 5, 8, 5, 3, 7, 5, …
## $ region4 <dbl+lbl> 3, 2, 4, 3, 3, 4, 4, 4, 1, 3, 3, 3, 1, 2, 3, 3, 2, 1, 1, 3, 4, 3, 2, 3, 3, …
## $ religion <dbl+lbl> 2, 2, 4, 2, 1, 11, 4, 15, 2, 1, 2, 2, 5, 3, 1, 11, 2, 2, 15,…
## $ religion_other_nonchristian <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ evangelical <dbl+lbl> 2, 3, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, …
## $ service_attendance <dbl+lbl> 5, 2, 5, 2, 5, 5, 5, 1, 5, 4, 3, 2, 2, 1, 5, 5, 5, 3, 1, 5, 3, 6, 5, 2, 2, …
## $ marit_status <dbl+lbl> 2, 5, 1, 1, 1, 1, 1, 3, 1, 1, 2, 3, 1, 3, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 3, …
## $ employment <dbl+lbl> 5, 6, 4, 5, 1, 2, 1, 1, 2, 7, 1, 5, 5, 5, 1, 7, 1, 5, 1, 1, 1, 5, 1, 1, 5, …
## $ house_head <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ house_size <dbl> 3, 2, 2, 2, 2, 3, 2, 2, 2, 4, 1, 1, 2, 1, 2, 3, 4, 2, 1, 5, 3, 3, 3, 4, 1, 1, 2…
## $ house_ages0to1 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages2to5 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages6to12 <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 2, 0, 0, 0…
## $ house_ages13to17 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages18plus <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 1, 3, 2, 3, 2, 2, 1, 1, 2…
## $ house_type <dbl+lbl> 1, 4, 1, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 3, …
## $ house_own <dbl+lbl> 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
# sort by one variable, and only show the first 6 values
head(arrange(survey_data_raw, wave), 6)
## # A tibble: 6 x 54
## case_ID wave year weight_wave weight_aggregate happening cause_original cause_other_text cause_recoded
## <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl+lbl>
## 1 2 1 [Nov 2008] 1 [2008] 0.54 0.294 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 2 3 1 [Nov 2008] 1 [2008] 0.85 0.463 2 [Don't… 1 [Caused mos… "" 6 [Caused mo…
## 3 5 1 [Nov 2008] 1 [2008] 0.49 0.267 2 [Don't… 2 [Caused mos… "" 4 [Caused mo…
## 4 6 1 [Nov 2008] 1 [2008] 0.29 0.158 3 [Yes] 2 [Caused mos… "" 4 [Caused mo…
## 5 7 1 [Nov 2008] 1 [2008] 1.29 0.702 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 6 8 1 [Nov 2008] 1 [2008] 2.56 1.39 2 [Don't… 2 [Caused mos… "" 4 [Caused mo…
## # … with 45 more variables: sci_consensus <dbl+lbl>, worry <dbl+lbl>, harm_personally <dbl+lbl>, harm_US <dbl+lbl>,
## # harm_dev_countries <dbl+lbl>, harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## # reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, reg_coal_emissions <dbl+lbl>,
## # discuss_GW <dbl+lbl>, hear_GW_media <dbl+lbl>, gender <dbl+lbl>, age <dbl>, age_category <dbl+lbl>,
## # generation <dbl+lbl>, educ <dbl+lbl>, educ_category <dbl+lbl>, income <dbl+lbl>, income_category <dbl+lbl>,
## # race <dbl+lbl>, ideology <dbl+lbl>, party <dbl+lbl>, party_w_leaners <dbl+lbl>, party_x_ideo <dbl+lbl>,
## # registered_voter <dbl+lbl>, region9 <dbl+lbl>, region4 <dbl+lbl>, religion <dbl+lbl>, …
# sort by 2 variables (one in opposite order), and only show the last 10 values
tail(arrange(survey_data_raw, wave, desc(year)), 10)
## # A tibble: 10 x 54
## case_ID wave year weight_wave weight_aggregate happening cause_original cause_other_text cause_recoded
## <dbl> <dbl+lbl> <dbl+lbl> <dbl> <dbl> <dbl+lbl> <dbl+lbl> <chr> <dbl+lbl>
## 1 35442 17 [Oct 2017] 9 [2017] 1.29 1.17 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 2 35443 17 [Oct 2017] 9 [2017] 0.502 0.454 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 3 35445 17 [Oct 2017] 9 [2017] 0.879 0.794 3 [Yes] 2 [Caused mos… "" 4 [Caused mo…
## 4 35446 17 [Oct 2017] 9 [2017] 0.759 0.686 3 [Yes] 2 [Caused mos… "" 4 [Caused mo…
## 5 35448 17 [Oct 2017] 9 [2017] 1.39 1.25 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 6 35450 17 [Oct 2017] 9 [2017] 1.35 1.22 3 [Yes] 2 [Caused mos… "" 4 [Caused mo…
## 7 35451 17 [Oct 2017] 9 [2017] 0.911 0.823 3 [Yes] 3 [Other (Ple… "combination of… 5 [Caused by…
## 8 35452 17 [Oct 2017] 9 [2017] 1.04 0.936 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## 9 35454 17 [Oct 2017] 9 [2017] 0.606 0.547 2 [Don't… 2 [Caused mos… "" 4 [Caused mo…
## 10 35455 17 [Oct 2017] 9 [2017] 1.18 1.06 3 [Yes] 1 [Caused mos… "" 6 [Caused mo…
## # … with 45 more variables: sci_consensus <dbl+lbl>, worry <dbl+lbl>, harm_personally <dbl+lbl>, harm_US <dbl+lbl>,
## # harm_dev_countries <dbl+lbl>, harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## # reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, reg_coal_emissions <dbl+lbl>,
## # discuss_GW <dbl+lbl>, hear_GW_media <dbl+lbl>, gender <dbl+lbl>, age <dbl>, age_category <dbl+lbl>,
## # generation <dbl+lbl>, educ <dbl+lbl>, educ_category <dbl+lbl>, income <dbl+lbl>, income_category <dbl+lbl>,
## # race <dbl+lbl>, ideology <dbl+lbl>, party <dbl+lbl>, party_w_leaners <dbl+lbl>, party_x_ideo <dbl+lbl>,
## # registered_voter <dbl+lbl>, region9 <dbl+lbl>, region4 <dbl+lbl>, religion <dbl+lbl>, …
In basic R, we saw that we can select specific columns and rows using
indexing. In tidyverse, we have select
(for columns based
on column labels, similar to indexing with character vectors or with the
$
operator), filter
(for rows based on values,
similar to indexing with logical vectors) and slice
(for
rows based on row numbers, similar to indexing with a numerical
vector).
interesting_columns = c('wave', 'year', 'happening', 'cause_recoded', 'sci_consensus', 'worry', 'harm_personally',
'harm_US', 'harm_dev_countries', 'harm_future_gen', 'harm_plants_animals', 'when_harm_US',
'reg_CO2_pollutant', 'reg_utilities', 'fund_research', 'discuss_GW',
'gender', 'age_category', 'educ_category', 'income_category', 'race', 'party_x_ideo', 'region4', 'employment')
columns_to_drop = c('case_ID', 'weight_wave', 'weight_aggregate', 'cause_original', 'cause_other_text',
'reg_coal_emissions', 'hear_GW_media', 'age', 'generation', 'educ', 'income', 'ideology',
'party', 'party_w_leaners', 'registered_voter', 'region9', 'religion', 'religion_other_nonchristian',
'evangelical', 'service_attendance', 'marit_status', 'house_head', 'house_size',
'house_ages0to1', 'house_ages2to5', 'house_ages6to12', 'house_ages13to17', 'house_ages18plus',
'house_type', 'house_own')
# use all_of to inlcude columns
survey_data = select(survey_data_raw, all_of(interesting_columns))
# use -all_of to exclude columns
survey_data = select(survey_data_raw, -all_of(columns_to_drop))
# equivalent to survey_data[200:210,]
slice(survey_data, 200:210)
## # A tibble: 11 x 24
## wave year happening cause_recoded sci_consensus worry harm_personally harm_US harm_dev_countr…
## <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+lbl> <dbl+l> <dbl+lbl> <dbl+l> <dbl+lbl>
## 1 1 [Nov 2008] 1 [2008] 3 [Yes] 5 [Caused by… 2 [There is … 1 [Not… 1 [Not at all] 3 [A m… 3 [A moderate a…
## 2 1 [Nov 2008] 1 [2008] 2 [Don't know] 4 [Caused mo… 2 [There is … 3 [Som… 0 [Don't know] 0 [Don… 0 [Don't know]
## 3 1 [Nov 2008] 1 [2008] 1 [No] 4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all] 1 [Not… 1 [Not at all]
## 4 1 [Nov 2008] 1 [2008] 2 [Don't know] 4 [Caused mo… 2 [There is … 2 [Not… 1 [Not at all] 1 [Not… 1 [Not at all]
## 5 1 [Nov 2008] 1 [2008] 1 [No] 4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all] 3 [A m… 1 [Not at all]
## 6 1 [Nov 2008] 1 [2008] 3 [Yes] 6 [Caused mo… 4 [Most scie… 3 [Som… 3 [A moderate … 3 [A m… 3 [A moderate a…
## 7 1 [Nov 2008] 1 [2008] 3 [Yes] 6 [Caused mo… 4 [Most scie… 4 [Ver… 2 [Only a litt… 3 [A m… 4 [A great deal]
## 8 1 [Nov 2008] 1 [2008] 3 [Yes] 5 [Caused by… 4 [Most scie… 3 [Som… 3 [A moderate … 3 [A m… 3 [A moderate a…
## 9 1 [Nov 2008] 1 [2008] 3 [Yes] 6 [Caused mo… 4 [Most scie… 1 [Not… 1 [Not at all] 1 [Not… 1 [Not at all]
## 10 1 [Nov 2008] 1 [2008] 3 [Yes] 6 [Caused mo… 4 [Most scie… 3 [Som… 3 [A moderate … 4 [A g… 4 [A great deal]
## 11 1 [Nov 2008] 1 [2008] 1 [No] 4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all] 1 [Not… 1 [Not at all]
## # … with 15 more variables: harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## # reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, discuss_GW <dbl+lbl>,
## # gender <dbl+lbl>, age_category <dbl+lbl>, educ_category <dbl+lbl>, income_category <dbl+lbl>, race <dbl+lbl>,
## # party_x_ideo <dbl+lbl>, region4 <dbl+lbl>, employment <dbl+lbl>
# eliminate people (i.e., rows) who didn't reply in some variables of interest (an absence of reply was coded as -1 here)
survey_data = filter(survey_data,
happening > 0,
cause_recoded > 0,
sci_consensus > 0,
worry > 0)
In basic R, we saw that we can change information in the dataset by
indexing specific columns and/or rows and reassigning values to them. In
tidyverse we have specific functions that do this directly: the
mutate
functions. mutate_all
applies one
transforming function to all columns, while mutate
allows
to to do specific changes to single columns.
# convert all data to integers type
survey_data = mutate_all(survey_data,
as.integer)
Note that when you use an existing column label, the column is replaced with the new values. If you want to simply add a column based on the values of an existing one, use a non-existing label instead.
survey_data = mutate(survey_data,
happening_cont = happening + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
worry_cont = worry + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
year=recode(year,
`1` = 2008,
`2` = 2010,
`3` = 2011,
`4` = 2012,
`5` = 2013,
`6` = 2014,
`7` = 2015,
`8` = 2016,
`9` = 2017),
happening_labels=recode(happening,
`1` = "no",
`2` = "dont know",
`3` = "yes"),
cause_recoded=recode(cause_recoded,
`1` = "dont know",
`2` = "other",
`3` = "not happening",
`4` = "natural",
`5` = "human",
`6` = "natural and human"),
sci_consensus=recode(sci_consensus,
`1` = "dont know",
`2` = "disagreement",
`3` = "not happening",
`4` = "happening"),
gender=recode(gender,
`1` = "male",
`2` = "female"),
age_category_labels=recode(age_category,
`1` = "18-34",
`2` = "35-54",
`3` = "55+"),
educ_category_labels=recode(educ_category,
`1` = "no highschool",
`2` = "highschool",
`3` = "college",
`4` = "bachelor or higher"),
income_category_labels=recode(income_category,
`1` = "less 50000",
`2` = "50000-99999",
`3` = "more 100000"),
race=recode(race,
`1` = 'white non hisp',
`2` = 'black non hisp',
`3` = 'other non hisp',
`4` = 'hisp'),
party_x_ideo=recode(party_x_ideo,
`-2` = "no interest",
`-1` = "refused",
`1` = "liberal democrat",
`2` = "moderate democrate",
`3` = "independent",
`4` = "moderate republican",
`5` = "conservative republican"),
region4 = recode(region4,
`1` = "northeast",
`2` = "midwest",
`3` = "south",
`4` = "west"),
employment = recode(employment,
`1` = "Working/as a paid employee",
`2` = "Working/selfemploye",
`3` = "Not working/temporary",
`4` = "Not working/looking",
`5` = "Not working/retired",
`6` = "Not working/disabled",
`7` = "Not working/other"))
Another way to inspect the data, is to compute summary statistics
(like mean, SD, quantiles, percentages, etc.) on each column. In
tidyverse, the function summarise
allows you to specify
what to compute for each column:
summarise(survey_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## # A tibble: 1 x 11
## count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happening
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 18514 -0.935 1.74 2.52 3.29 6.24 -0.669 1.93
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
summarise(survey_data,
mean_happening = mean(happening))
## # A tibble: 1 x 1
## mean_happening
## <dbl>
## 1 2.50
By default, summarise
computes statistics on the whole
dataset. However, when you have different experimental conditions and/or
hierarchical data (repeated measures, longitudinal data, multi-site
data, etc.), you might want to compute different stats per group/level
defined by a categorical variable (the one that would specify the
condition/group). In tidyverse, this can be achieved by using the
group_by
function together with the summarise
function:
grouped_data = group_by(survey_data, year)
summarise(grouped_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## # A tibble: 9 x 12
## year count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happening
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 2130 -0.460 1.90 2.64 3.33 5.55 -0.353 2.09
## 2 2010 1993 -0.574 1.63 2.42 3.22 5.51 -0.338 1.65
## 3 2011 1946 -0.935 1.71 2.43 3.19 5.52 -0.215 1.84
## 4 2012 2050 -0.484 1.75 2.50 3.29 5.35 -0.283 2.03
## 5 2013 1858 -0.436 1.67 2.46 3.24 5.83 -0.131 1.85
## 6 2014 2282 -0.304 1.72 2.50 3.30 5.32 -0.334 1.82
## 7 2015 1263 -0.512 1.64 2.40 3.17 5.43 -0.223 1.74
## 8 2016 2427 -0.291 1.76 2.56 3.36 6.24 -0.669 2.11
## 9 2017 2565 -0.499 1.85 2.63 3.42 5.64 -0.360 2.08
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
Sometimes, we get data with categorical variables that contain more
than one variable. For example, in our dataset, in the
employment
column, we have information about whether the
person is working or is unemployed (this is one categorical variable
with 2 levels) and about the working type (this is another categorical
variable with as many levels as there are employment types).
With the separate
function we can indeed separate this
information based on specific text separators (in this case,
/
):
survey_data = separate(survey_data,
col = employment,
into = c('working', 'working_type'),
sep = '/',
remove = FALSE)
head(select(survey_data, employment, working, working_type), 10)
## # A tibble: 10 x 3
## employment working working_type
## <chr> <chr> <chr>
## 1 Not working/retired Not working retired
## 2 Not working/disabled Not working disabled
## 3 Not working/looking Not working looking
## 4 Not working/retired Not working retired
## 5 Working/as a paid employee Working as a paid employee
## 6 Working/selfemploye Working selfemploye
## 7 Working/as a paid employee Working as a paid employee
## 8 Working/as a paid employee Working as a paid employee
## 9 Working/selfemploye Working selfemploye
## 10 Not working/other Not working other
You can also create a new column via mutate
and the
case_when
function for specific categories you are
interested in your analysis:
survey_data = mutate(survey_data,
working_recoded = case_when(working == "Not working" & working_type == 'looking' ~ 0,
working == "Not working" & working_type != 'looking' ~ 1,
working == "Working" ~ 2))
head(select(survey_data, employment, working, working_type, working_recoded), 10)
## # A tibble: 10 x 4
## employment working working_type working_recoded
## <chr> <chr> <chr> <dbl>
## 1 Not working/retired Not working retired 1
## 2 Not working/disabled Not working disabled 1
## 3 Not working/looking Not working looking 0
## 4 Not working/retired Not working retired 1
## 5 Working/as a paid employee Working as a paid employee 2
## 6 Working/selfemploye Working selfemploye 2
## 7 Working/as a paid employee Working as a paid employee 2
## 8 Working/as a paid employee Working as a paid employee 2
## 9 Working/selfemploye Working selfemploye 2
## 10 Not working/other Not working other 1
Once you have these groups, you can calculate the statistics you are interested in:
grouped_data = group_by(survey_data, working)
summarise(grouped_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## # A tibble: 2 x 12
## working count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happen…
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Not working 7996 -0.935 1.72 2.51 3.30 5.83 -0.334 1.91
## 2 Working 10518 -0.798 1.76 2.52 3.28 6.24 -0.669 1.95
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
# first filter, and then get summary statistics per group
not_working_data = filter(survey_data, working == "Not working")
grouped_data = group_by(not_working_data, working_type)
summarise(grouped_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## # A tibble: 5 x 12
## working_type count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happe…
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 disabled 1259 -0.526 1.84 2.66 3.51 5.52 -0.263 2.00
## 2 looking 1112 -0.935 1.88 2.59 3.33 5.83 -0.215 2.02
## 3 other 1370 -0.574 1.62 2.45 3.24 5.64 -0.334 1.80
## 4 retired 4099 -0.470 1.68 2.45 3.24 5.42 -0.260 1.89
## 5 temporary 156 0.157 2.03 2.66 3.32 5.06 -0.0173 1.82
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
As we have done in basic R with the paste
function, we
can also merge the values of two columns to create new groups we might
be interested in. In tidyverse we use unite
:
survey_data = unite(survey_data,
"race_gender",
race,
gender,
sep = "_",
remove=FALSE)
head(select(survey_data, race, gender, race_gender))
## # A tibble: 6 x 3
## race gender race_gender
## <chr> <chr> <chr>
## 1 white non hisp female white non hisp_female
## 2 white non hisp male white non hisp_male
## 3 hisp female hisp_female
## 4 white non hisp male white non hisp_male
## 5 white non hisp female white non hisp_female
## 6 white non hisp male white non hisp_male
grouped_data = group_by(survey_data, race_gender)
summarise(grouped_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## # A tibble: 8 x 12
## race_gender count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happ…
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 black non hisp_female 875 -0.463 1.93 2.62 3.35 5.21 0.267 2.26
## 2 black non hisp_male 724 -0.798 1.94 2.59 3.35 5.20 0.0225 2.16
## 3 hisp_female 875 -0.436 2.14 2.86 3.63 5.64 -0.142 2.21
## 4 hisp_male 919 -0.935 1.92 2.69 3.50 5.22 -0.669 2.12
## 5 other non hisp_female 562 -0.460 2.02 2.72 3.36 5.47 -0.156 2.14
## 6 other non hisp_male 570 -0.204 1.82 2.63 3.49 5.59 -0.0173 2.11
## 7 white non hisp_female 7000 -0.574 1.82 2.56 3.32 5.55 -0.549 1.98
## 8 white non hisp_male 6989 -0.526 1.54 2.35 3.14 6.24 -0.558 1.70
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
Or you can just use 2 grouping variables to compute statistics:
grouped_data = group_by(survey_data, race, gender)
summarise(grouped_data,
count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
max_happening=max(happening_cont))
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
## # A tibble: 8 x 13
## # Groups: race [4]
## race gender count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happ…
## <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 black non hisp female 875 -0.463 1.93 2.62 3.35 5.21 0.267 2.26
## 2 black non hisp male 724 -0.798 1.94 2.59 3.35 5.20 0.0225 2.16
## 3 hisp female 875 -0.436 2.14 2.86 3.63 5.64 -0.142 2.21
## 4 hisp male 919 -0.935 1.92 2.69 3.50 5.22 -0.669 2.12
## 5 other non hisp female 562 -0.460 2.02 2.72 3.36 5.47 -0.156 2.14
## 6 other non hisp male 570 -0.204 1.82 2.63 3.49 5.59 -0.0173 2.11
## 7 white non hisp female 7000 -0.574 1.82 2.56 3.32 5.55 -0.549 1.98
## 8 white non hisp male 6989 -0.526 1.54 2.35 3.14 6.24 -0.558 1.70
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
Until now, we have used the tidyverse functions as all other functions we have used in basic R:
new_object = function1(old_object, some_argument)
new_object = function2(new_object, some_argument)
new_object = function3(new_object, some_argument)
However, in tidyverse, we can compact this notation using pipes:
new_object = old_object %>%
function1(some_argument) %>%
function2(some_argument) %>%
function3(some_argument)
As you see, this allows us to define an analysis pipeline in which each computation is done on the results of the previous computation.
Here is how we could rewrite all previous computations from loading the raw data to the grouped summary statistics:
survey_data_raw = read_spss("data/ccam_original.sav")
survey_data = survey_data_raw %>%
select(all_of(interesting_columns)) %>%
filter(happening > 0,
cause_recoded > 0,
sci_consensus > 0,
worry > 0) %>%
mutate_all(as.integer) %>%
mutate(happening_cont = happening + rnorm(mean=0, sd=.5, n=n()),
worry_cont = worry + rnorm(mean=0, sd=.5, n=n()),
year=recode(year,
`1` = 2008,
`2` = 2010,
`3` = 2011,
`4` = 2012,
`5` = 2013,
`6` = 2014,
`7` = 2015,
`8` = 2016,
`9` = 2017),
happening_labels=recode(happening,
`1` = "no",
`2` = "dont know",
`3` = "yes"),
cause_recoded=recode(cause_recoded,
`1` = "dont know",
`2` = "other",
`3` = "not happening",
`4` = "natural",
`5` = "human",
`6` = "natural and human"),
sci_consensus=recode(sci_consensus,
`1` = "dont know",
`2` = "disagreement",
`3` = "not happening",
`4` = "happening"),
gender=recode(gender,
`1` = "male",
`2` = "female"),
age_category_labels=recode(age_category,
`1` = "18-34",
`2` = "35-54",
`3` = "55+"),
educ_category_labels=recode(educ_category,
`1` = "no highschool",
`2` = "highschool",
`3` = "college",
`4` = "bachelor or higher"),
income_category_labels=recode(income_category,
`1` = "less 50000",
`2` = "50000-99999",
`3` = "more 100000"),
race=recode(race,
`1` = 'white non hisp',
`2` = 'black non hisp',
`3` = 'other non hisp',
`4` = 'hisp'),
party_x_ideo=recode(party_x_ideo,
`-2` = "no interest",
`-1` = "refused",
`1` = "liberal democrat",
`2` = "moderate democrate",
`3` = "independent",
`4` = "moderate republican",
`5` = "conservative republican"),
region4 = recode(region4,
`1` = "northeast",
`2` = "midwest",
`3` = "south",
`4` = "west"),
employment = recode(employment,
`1` = "Working/as a paid employee",
`2` = "Working/selfemploye",
`3` = "Not working/temporary",
`4` = "Not working/looking",
`5` = "Not working/retired",
`6` = "Not working/disabled",
`7` = "Not working/other")) %>%
separate(col = employment,
into = c('working', 'working_type'),
sep = '/') %>%
filter(working == "Not working") %>%
group_by(working_type) %>%
summarise(count=n(),
min_worry=min(worry_cont),
quantile25_worry=quantile(worry_cont, .25),
mean_worry=mean(worry_cont),
quantile75_worry=quantile(worry_cont, .75),
max_worry=max(worry_cont),
min_happening=min(happening_cont),
quantile25_happening=quantile(happening_cont, .25),
mean_happening=mean(happening_cont),
quantile75_happening=quantile(happening_cont, .75),
max_happening=max(happening_cont))
print(survey_data)
## # A tibble: 5 x 12
## working_type count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happe…
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 disabled 1259 -0.0784 1.85 2.65 3.47 5.90 -0.437 2.07
## 2 looking 1112 -0.487 1.84 2.59 3.38 5.24 -0.560 2.04
## 3 other 1370 -0.357 1.64 2.45 3.23 5.37 -0.388 1.77
## 4 retired 4099 -0.646 1.66 2.45 3.23 5.70 -0.472 1.91
## 5 temporary 156 -0.347 2.03 2.68 3.29 5.01 0.325 1.85
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
In basic R, we have seen how we can add separate dataframes using the
cbind
and rbind
functions. The corresponding
functions in tidyverse are bind_cols
and
bind_rows
:
# Create fake data:
N = 15
mean_score_students = runif(N, 5, 10)
fake_data_first_batch = tibble( # same as data.frame but in tidyverse!
student = 1:N,
age = round(rnorm(N, 30, 5)),
score_wpa1 = mean_score_students,
score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)
print(fake_data_first_batch)
## # A tibble: 15 x 6
## student age score_wpa1 score_wpa2 score_wpa3 score_wpa4
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 28 9.84 8.84 5.99 6.79
## 2 2 36 6.27 5.65 3.05 4.58
## 3 3 33 7.55 6.81 2.48 5.15
## 4 4 27 5.48 5.05 2.19 4.03
## 5 5 28 8.41 7.68 3.47 5.58
## 6 6 37 8.70 7.88 6.00 5.92
## 7 7 30 8.52 7.69 3.25 5.82
## 8 8 27 9.58 8.48 3.78 6.55
## 9 9 26 6.49 5.94 1.86 4.29
## 10 10 27 5.68 5.04 2.72 3.86
## 11 11 32 8.73 7.89 4.80 6.38
## 12 12 26 8.64 7.87 4.38 5.89
## 13 13 26 9.80 8.97 4.26 6.95
## 14 14 31 9.06 8.08 5.82 6.41
## 15 15 31 7.09 6.20 3.52 4.87
mean_score_students = runif(N, 5, 10)
fake_data_second_batch = tibble( # same as data.frame but in tidyverse!
student = (N+1):(2*N),
age = round(rnorm(N, 30, 5)),
score_wpa1 = mean_score_students,
score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1)
)
print(fake_data_second_batch)
## # A tibble: 15 x 5
## student age score_wpa1 score_wpa2 score_wpa3
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 16 25 7.92 7.23 2.59
## 2 17 31 9.58 8.61 5.86
## 3 18 35 9.82 8.95 3.48
## 4 19 30 5.28 4.63 3.18
## 5 20 26 7.53 6.90 3.88
## 6 21 35 7.41 6.79 2.47
## 7 22 26 7.05 6.40 3.83
## 8 23 33 9.21 8.25 3.82
## 9 24 32 6.66 5.92 3.02
## 10 25 30 6.66 5.96 2.20
## 11 26 30 9.77 8.57 3.71
## 12 27 38 7.38 6.60 3.33
## 13 28 34 6.80 6.07 2.96
## 14 29 28 9.50 8.49 5.21
## 15 30 22 6.53 5.75 4.40
fake_data = bind_rows(fake_data_first_batch, fake_data_second_batch)
fake_data
## # A tibble: 30 x 6
## student age score_wpa1 score_wpa2 score_wpa3 score_wpa4
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 28 9.84 8.84 5.99 6.79
## 2 2 36 6.27 5.65 3.05 4.58
## 3 3 33 7.55 6.81 2.48 5.15
## 4 4 27 5.48 5.05 2.19 4.03
## 5 5 28 8.41 7.68 3.47 5.58
## 6 6 37 8.70 7.88 6.00 5.92
## 7 7 30 8.52 7.69 3.25 5.82
## 8 8 27 9.58 8.48 3.78 6.55
## 9 9 26 6.49 5.94 1.86 4.29
## 10 10 27 5.68 5.04 2.72 3.86
## # … with 20 more rows
# new variables
new_info = tibble(
student = 1:(2*N),
score_wpa5 = mean_score_students*0.5 + rnorm(2*N, 0, 1),
gender = rbinom(2*N, 1, .5)
)
print(new_info)
## # A tibble: 30 x 3
## student score_wpa5 gender
## <int> <dbl> <int>
## 1 1 3.84 1
## 2 2 6.37 1
## 3 3 3.48 1
## 4 4 4.24 0
## 5 5 4.55 1
## 6 6 5.11 1
## 7 7 2.82 0
## 8 8 5.95 1
## 9 9 2.25 1
## 10 10 1.75 0
## # … with 20 more rows
fake_data = bind_cols(fake_data,
new_info)
## New names:
## * student -> student...1
## * student -> student...7
print(fake_data)
## # A tibble: 30 x 9
## student...1 age score_wpa1 score_wpa2 score_wpa3 score_wpa4 student...7 score_wpa5 gender
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int>
## 1 1 28 9.84 8.84 5.99 6.79 1 3.84 1
## 2 2 36 6.27 5.65 3.05 4.58 2 6.37 1
## 3 3 33 7.55 6.81 2.48 5.15 3 3.48 1
## 4 4 27 5.48 5.05 2.19 4.03 4 4.24 0
## 5 5 28 8.41 7.68 3.47 5.58 5 4.55 1
## 6 6 37 8.70 7.88 6.00 5.92 6 5.11 1
## 7 7 30 8.52 7.69 3.25 5.82 7 2.82 0
## 8 8 27 9.58 8.48 3.78 6.55 8 5.95 1
## 9 9 26 6.49 5.94 1.86 4.29 9 2.25 1
## 10 10 27 5.68 5.04 2.72 3.86 10 1.75 0
## # … with 20 more rows
However, a better way to join dataframes is to use the
join
functions, as these allow us to join dataframes based
on specific ID. This is very important when there are repeating info
and/or missing info:
N = 15
mean_score_students = runif(N, 5, 10)
first_batch = tibble(
student = 1:N,
age = round(rnorm(N, 30, 5)),
score_wpa1 = mean_score_students,
score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)
print(first_batch)
## # A tibble: 15 x 6
## student age score_wpa1 score_wpa2 score_wpa3 score_wpa4
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 26 8.91 7.98 4.33 6.05
## 2 2 32 9.50 8.60 5.62 6.91
## 3 3 40 8.22 7.38 2.69 5.97
## 4 4 38 8.52 7.55 3.60 5.61
## 5 5 37 7.38 6.57 4.90 5.08
## 6 6 25 8.22 7.32 3.42 5.57
## 7 7 30 8.58 7.63 3.11 5.95
## 8 8 36 9.23 8.46 4.29 6.34
## 9 9 29 7.52 6.81 4.50 5.46
## 10 10 27 8.21 7.42 3.01 6.06
## 11 11 32 6.47 5.79 2.71 4.58
## 12 12 26 9.93 8.97 4.55 7.04
## 13 13 24 5.08 4.65 3.27 3.02
## 14 14 23 5.77 5.34 4.19 3.95
## 15 15 30 8.84 7.95 4.42 5.77
M = 5
mean_score_students = runif(M, 8, 10)
second_batch = tibble(
student = (N+1):(N+M),
age = round(rnorm(M, 30, 5)),
score_wpa1 = mean_score_students,
score_wpa2 = mean_score_students*0.9 + rnorm(M, 0, 0.1),
score_wpa5 = mean_score_students*0.5 + rnorm(M, 0, 1)
)
print(second_batch)
## # A tibble: 5 x 5
## student age score_wpa1 score_wpa2 score_wpa5
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 16 34 9.70 8.79 4.40
## 2 17 29 9.24 8.29 4.66
## 3 18 36 9.06 8.18 4.48
## 4 19 30 8.66 8.00 3.09
## 5 20 24 9.42 8.48 5.38
new_info = tibble(
student = 1:(N+M),
score_wpa6 = runif(N+M, 6, 10)*0.5 + rnorm(N+M, 0, 1),
gender = rbinom(N+M, 1, .5)
)
print(new_info)
## # A tibble: 20 x 3
## student score_wpa6 gender
## <int> <dbl> <int>
## 1 1 2.24 0
## 2 2 3.31 1
## 3 3 4.20 0
## 4 4 2.36 0
## 5 5 4.81 0
## 6 6 3.11 0
## 7 7 1.84 0
## 8 8 2.92 1
## 9 9 2.33 1
## 10 10 3.69 1
## 11 11 4.91 1
## 12 12 3.35 1
## 13 13 5.14 0
## 14 14 4.38 0
## 15 15 3.83 0
## 16 16 2.87 0
## 17 17 1.85 1
## 18 18 4.10 0
## 19 19 5.16 1
## 20 20 3.97 1
full_join(first_batch, second_batch, by='student')
## # A tibble: 20 x 10
## student age.x score_wpa1.x score_wpa2.x score_wpa3 score_wpa4 age.y score_wpa1.y score_wpa2.y score_wpa5
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 26 8.91 7.98 4.33 6.05 NA NA NA NA
## 2 2 32 9.50 8.60 5.62 6.91 NA NA NA NA
## 3 3 40 8.22 7.38 2.69 5.97 NA NA NA NA
## 4 4 38 8.52 7.55 3.60 5.61 NA NA NA NA
## 5 5 37 7.38 6.57 4.90 5.08 NA NA NA NA
## 6 6 25 8.22 7.32 3.42 5.57 NA NA NA NA
## 7 7 30 8.58 7.63 3.11 5.95 NA NA NA NA
## 8 8 36 9.23 8.46 4.29 6.34 NA NA NA NA
## 9 9 29 7.52 6.81 4.50 5.46 NA NA NA NA
## 10 10 27 8.21 7.42 3.01 6.06 NA NA NA NA
## 11 11 32 6.47 5.79 2.71 4.58 NA NA NA NA
## 12 12 26 9.93 8.97 4.55 7.04 NA NA NA NA
## 13 13 24 5.08 4.65 3.27 3.02 NA NA NA NA
## 14 14 23 5.77 5.34 4.19 3.95 NA NA NA NA
## 15 15 30 8.84 7.95 4.42 5.77 NA NA NA NA
## 16 16 NA NA NA NA NA 34 9.70 8.79 4.40
## 17 17 NA NA NA NA NA 29 9.24 8.29 4.66
## 18 18 NA NA NA NA NA 36 9.06 8.18 4.48
## 19 19 NA NA NA NA NA 30 8.66 8.00 3.09
## 20 20 NA NA NA NA NA 24 9.42 8.48 5.38
full_join(first_batch, second_batch, by=c('student', 'age', "score_wpa1", "score_wpa2"))
## # A tibble: 20 x 7
## student age score_wpa1 score_wpa2 score_wpa3 score_wpa4 score_wpa5
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 26 8.91 7.98 4.33 6.05 NA
## 2 2 32 9.50 8.60 5.62 6.91 NA
## 3 3 40 8.22 7.38 2.69 5.97 NA
## 4 4 38 8.52 7.55 3.60 5.61 NA
## 5 5 37 7.38 6.57 4.90 5.08 NA
## 6 6 25 8.22 7.32 3.42 5.57 NA
## 7 7 30 8.58 7.63 3.11 5.95 NA
## 8 8 36 9.23 8.46 4.29 6.34 NA
## 9 9 29 7.52 6.81 4.50 5.46 NA
## 10 10 27 8.21 7.42 3.01 6.06 NA
## 11 11 32 6.47 5.79 2.71 4.58 NA
## 12 12 26 9.93 8.97 4.55 7.04 NA
## 13 13 24 5.08 4.65 3.27 3.02 NA
## 14 14 23 5.77 5.34 4.19 3.95 NA
## 15 15 30 8.84 7.95 4.42 5.77 NA
## 16 16 34 9.70 8.79 NA NA 4.40
## 17 17 29 9.24 8.29 NA NA 4.66
## 18 18 36 9.06 8.18 NA NA 4.48
## 19 19 30 8.66 8.00 NA NA 3.09
## 20 20 24 9.42 8.48 NA NA 5.38
left_join(first_batch, new_info, by=c('student'))
## # A tibble: 15 x 8
## student age score_wpa1 score_wpa2 score_wpa3 score_wpa4 score_wpa6 gender
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 26 8.91 7.98 4.33 6.05 2.24 0
## 2 2 32 9.50 8.60 5.62 6.91 3.31 1
## 3 3 40 8.22 7.38 2.69 5.97 4.20 0
## 4 4 38 8.52 7.55 3.60 5.61 2.36 0
## 5 5 37 7.38 6.57 4.90 5.08 4.81 0
## 6 6 25 8.22 7.32 3.42 5.57 3.11 0
## 7 7 30 8.58 7.63 3.11 5.95 1.84 0
## 8 8 36 9.23 8.46 4.29 6.34 2.92 1
## 9 9 29 7.52 6.81 4.50 5.46 2.33 1
## 10 10 27 8.21 7.42 3.01 6.06 3.69 1
## 11 11 32 6.47 5.79 2.71 4.58 4.91 1
## 12 12 26 9.93 8.97 4.55 7.04 3.35 1
## 13 13 24 5.08 4.65 3.27 3.02 5.14 0
## 14 14 23 5.77 5.34 4.19 3.95 4.38 0
## 15 15 30 8.84 7.95 4.42 5.77 3.83 0
right_join(new_info, second_batch, by=c('student'))
## # A tibble: 5 x 7
## student score_wpa6 gender age score_wpa1 score_wpa2 score_wpa5
## <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 16 2.87 0 34 9.70 8.79 4.40
## 2 17 1.85 1 29 9.24 8.29 4.66
## 3 18 4.10 0 36 9.06 8.18 4.48
## 4 19 5.16 1 30 8.66 8.00 3.09
## 5 20 3.97 1 24 9.42 8.48 5.38
Now you will analyze data from Matthews et al. (2016): Why do we overestimate others’ willingness to pay? The purpose of this research was to test if our beliefs about other people’s affluence (i.e.; wealth) affect how much we think they will be willing to pay for items. You can find the full paper at http://journal.sjdm.org/15/15909/jdm15909.pdf.
Variables description:
Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt)
id
: participant id codegender
: participant gender. 1 = male, 2 = femaleage
: participant ageincome
: participant annual household income on
categorical scale with 8 categorical options: Less than 5,000;
15,001–25,000; 25,001–35,000; 35,001–50,000; 50,001–75,000;
75,001–100,000; 100,001–150,000; greater than 150,000.p1-p10
: whether the “typical” survey respondent would
pay more (coded 1) or less (coded 0) than oneself, for each of the 10
productstask
: whether the participant had to judge the
proportion of other people who “have more money than you do” (coded 1)
or the proportion who “have less money than you do” (coded 0)havemore
: participant’s response when task = 1haveless
: participant’s response when task = 0pcmore
: participant’s estimate of the proportion of
people who have more than they do (calculated as 100-haveless when
task=0)Task A
First, download the data_wpa4.csv
and
matthews_demographics.csv
datasets from the data
folder on Github and load them in R.
Note: do not use pipes from 1 to 4.
Currently gender
is coded as 1 and 2. Create a new
dataframe called new_matthews_data
, in which there is a new
column called gender_labels
that codes gender as “male” and
“female”. Do it using mutate
. Then, rename the original
gender
column to gender_binary
using
rename
. Subtract 1 to all values of
gender_binary
, so that it is coded as 0 and 1 instead of 1
and 2 using mutate
again.
In new_matthews_data
, create new column called
income_labels
that codes income based on the data
description above using mutate
. Then, create a new column,
called income_recoded
, where you only have 4 income
categories (coded as numbers from 1 to 4): below 25,000, 25,000-50,000,
50,000-100,000, and above 100,000 using case_when
. How many
observations are there for each of these 4 categories? Use
summarise
to reply.
In new_matthews_data
, transform all numeric columns
into integers numbers using mutate_if
.
From new_matthews_data
, create a summary of the
dataset using summarise
, to answer the following questions:
What percent of participants were female? What was the minimum, mean,
and maximum income
? What was the 25th percentile, median,
and the 75th percentile of age
? Use good names for
columns.
Repeat steps from 1 to 4 (apart from the summarise
in point 2) using pipes and assign the result to
new_matthews_data_summary
.
Task B
From new_matthews_data
, calculate the mean
p1
to p10
across participants using
summarise_all
and select
. Which product scored
the highest? Do it again, grouping the data by gender. Is there a
difference across gender? What is the mean of the mean p1
to p10
across participants? Calculate it on the result of
the previous step. You can do these either using pipes or not.
Transform the data from wide to long format. In particular, you
want 10 rows per subjects, with their responses on the products 1 to 10
in a column called wtp
, and the product label in a column
called product
. Call the resulting dataframe
new_matthews_data_long
. Re-order it by id
.
Print the first 20 cases to check this worked. Check that
new_matthews_data_long
has 10 times more rows than
new_matthews_data
.
# these are the solutions for B2, as we will cover wide to long data transformation later in the seminar
new_matthews_data_long = gather(new_matthews_data,
key='product',
value='wtp',
p1:p10)
new_matthews_data_long = arrange(new_matthews_data_long, id)
Task C
Drop the X1
column in demographics
using select
.
Join new_matthews_data_long
and
demographics
based on the id
, in order to
retain as many rows and columns as possible. Call the resulting
dataframe matthews_data_all
.
Calculate the mean wtp
per subject using
group_by
. You can use pipes or not. Called the resulting
dataframe mean_matthews_data_all
. This should have as many
rows as the number of subjects and 2 columns (id
and mean
wtp). Add as a third and fourth columns heigth
and
race
using one of the join functions.
Using mean_matthews_data_all
, make a barplot showing
the mean wtp
across ethnic groups. Plot confidence
intervals. Give appropriate labels to the plot. Do you think there is a
difference in willingness to pay across groups?
# these are the solutions for B4, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = factor(race), y = mean_wtp)) +
stat_summary(fun = "mean", geom="bar") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar", size=1, width=.4) +
labs(x = 'Race', y = 'Mean WTP') +
theme(axis.text.x = element_text(angle = 90))
mean_matthews_data_all
, make a scatterplot
showing the wtp on the y-axis and the height on the x-axis. Add a
regression line. Do you think height predicts willingness to pay?# these are the solutions for B5, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = height, y = mean_wtp)) +
geom_point(alpha = 0.3, size= 2) +
geom_smooth(method = lm, color='grey') +
labs(x='Height', y='Mean WTP') +
ggtitle("Relationship betweenheight and WTP")
Save and email your script to me at laura.fontanesi@unibas.ch by the end of Friday.