Open RStudio.

Open a new R script in R and save it as wpa_4_LastFirst.R (where Last and First is your last and first name).

Careful about: capitalizing, last and first name order, and using _ instead of -.

At the top of your script, write the following (with appropriate changes):

# Assignment: WPA 4
# Name: Laura Fontanesi
# Date: 5 April 2022

Load data in R

Download the data ccam_original.sav from the data folder on Github and load them in R.

library(tidyverse)
library(haven)

# Load data in R
survey_data_raw = read_spss("data/ccam_original.sav")

1. What is data wrangling?

Data wrangling is the process of transforming data from a raw format (in which they were collected) into another format with the intent of making it more appropriate and valuable for exploratory and confirmatory data analyses.

Today’s class is based on specific functions in the tidyverse that will serve exactly this purpuse.

2. Functions to know and some examples

Virtually all you need to know is in this cheatsheet.

The more frequent functions you will use are:

Look here for other kinds of mutate.

Look here for other kinds of summarize.

Ways to “see” the data: Glimpse, sort, head, and tail

The first thing you want to do when importing data, is to inspect it visually, to get to know the information it contains. You can always click on the data in the Environment tab, but there are also other useful functions that allow us to see the data types (glimpse), to sort the data based on the values of one or more clomuns (arrange), and to visualize the first or the last n rows (head and tail, respectively).

glimpse(survey_data_raw)
## Rows: 20,024
## Columns: 54
## $ case_ID                     <dbl> 2, 3, 5, 6, 7, 8, 9, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26…
## $ wave                        <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ year                        <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ weight_wave                 <dbl> 0.54, 0.85, 0.49, 0.29, 1.29, 2.56, 0.23, 0.82, 0.38, 0.35, 0.91, 1.01, 0.82, 0…
## $ weight_aggregate            <dbl> 0.29392628, 0.46266174, 0.26671088, 0.15784930, 0.70215723, 1.39342829, 0.12519…
## $ happening                   <dbl+lbl> 3, 2, 2, 3, 3, 2, 3, 1, 1, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, …
## $ cause_original              <dbl+lbl> 1, 1, 2, 2, 1, 2, 1, 1, 4, 2, 1, 1, 1, 1, 3, 2, 1, 1, 1, 3, 1, 2, 2, 2, 1, …
## $ cause_other_text            <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "both of the above", ""…
## $ cause_recoded               <dbl+lbl> 6, 6, 4, 4, 6, 4, 6, 6, 3, 4, 6, 6, 6, 6, 5, 4, 6, 6, 6, 5, 6, 4, 4, 4, 6, …
## $ sci_consensus               <dbl+lbl> 4, 1, 2, 4, 2, 2, 2, 3, 2, 4, 1, 2, 4, 4, 4, 2, 2, 1, 4, 4, 4, 2, 2, 2, 4, …
## $ worry                       <dbl+lbl> 3, 2, 1, 3, 3, 2, 3, 2, 1, 3, 2, 3, 3, 3, 4, 2, 4, 2, 4, 1, 3, 1, 2, 2, 2, …
## $ harm_personally             <dbl+lbl> 2, 2, 1, 2, 0, 0, 2, 3, 1, 0, 0, 1, 3, 2, 4, 2, 3, 3, 3, 1, 3, 1, 2, 2, 1, …
## $ harm_US                     <dbl+lbl>  3, -1,  1,  2,  0,  0,  3,  3,  1,  0,  0,  0,  3,  2,  4,  2,  3,  3,  3,…
## $ harm_dev_countries          <dbl+lbl> 4, 2, 1, 3, 0, 0, 4, 3, 1, 0, 0, 0, 4, 2, 4, 2, 4, 0, 3, 1, 4, 1, 2, 2, 3, …
## $ harm_future_gen             <dbl+lbl>  4,  3,  1,  3,  0,  0,  4,  3,  1,  4,  0,  0,  4,  3,  4,  3,  4,  0,  4,…
## $ harm_plants_animals         <dbl+lbl> 4, 3, 1, 3, 3, 0, 4, 3, 1, 4, 0, 3, 4, 2, 4, 3, 4, 3, 4, 1, 4, 1, 2, 3, 3, …
## $ when_harm_US                <dbl+lbl> 5, 3, 1, 4, 2, 2, 5, 4, 1, 3, 3, 3, 5, 4, 5, 2, 5, 6, 6, 1, 6, 1, 2, 2, 4, …
## $ reg_CO2_pollutant           <dbl+lbl>  4,  3,  2,  3,  3,  1,  3,  2,  2,  3,  3,  4,  4,  3,  3,  2,  3,  4,  4,…
## $ reg_utilities               <dbl+lbl>  4,  3,  1,  4,  1,  1,  2,  2,  2,  4,  3,  1,  4,  3,  2,  2,  3,  1,  4,…
## $ fund_research               <dbl+lbl>  4,  3,  1,  4,  4,  3,  4,  2,  3,  4,  3,  1,  3,  3,  4,  3,  4,  3,  4,…
## $ reg_coal_emissions          <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ discuss_GW                  <dbl+lbl> 3, 2, 1, 2, 1, 2, 3, 2, 3, 3, 2, 2, 2, 1, 3, 2, 2, 1, 3, 1, 4, 1, 2, 2, 2, …
## $ hear_GW_media               <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gender                      <dbl+lbl> 2, 1, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 2, 2, …
## $ age                         <dbl> 78, 45, 54, 71, 26, 29, 29, 34, 46, 60, 65, 67, 68, 60, 63, 32, 44, 76, 38, 45,…
## $ age_category                <dbl+lbl> 3, 2, 2, 3, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 1, 2, 3, 2, 2, 2, 3, 2, 1, 3, …
## $ generation                  <dbl+lbl> 5, 4, 4, 5, 2, 3, 3, 3, 4, 4, 5, 5, 5, 4, 5, 3, 4, 5, 3, 4, 3, 5, 3, 3, 5, …
## $ educ                        <dbl+lbl>  9,  6, 14, 13, 10, 12, 12, 12, 12, 12,  9, 14,  9, 10,  9, 13, 12,  9, 12,…
## $ educ_category               <dbl+lbl> 2, 1, 4, 4, 3, 4, 4, 4, 4, 4, 2, 4, 2, 3, 2, 4, 4, 2, 4, 3, 4, 2, 4, 2, 4, …
## $ income                      <dbl+lbl> 12,  9,  9, 16, 13, 14,  3, 12, 16, 17,  8,  7, 16,  1,  7, 16, 16,  7,  8,…
## $ income_category             <dbl+lbl> 2, 1, 1, 3, 2, 2, 1, 2, 3, 3, 1, 1, 3, 1, 1, 3, 3, 1, 1, 2, 3, 1, 2, 2, 1, …
## $ race                        <dbl+lbl> 1, 1, 4, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ideology                    <dbl+lbl> 4, 3, 4, 4, 4, 5, 4, 2, 4, 5, 3, 4, 3, 3, 1, 4, 2, 3, 2, 4, 2, 5, 5, 3, 1, …
## $ party                       <dbl+lbl> 1, 5, 1, 3, 1, 3, 1, 3, 3, 1, 2, 1, 3, 2, 2, 1, 2, 2, 3, 5, 2, 1, 1, 3, 2, …
## $ party_w_leaners             <dbl+lbl> 1, 4, 1, 1, 1, 1, 1, 3, 3, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 4, 2, 1, 1, 1, 2, …
## $ party_x_ideo                <dbl+lbl>  5, -2,  5,  5,  5,  5,  5,  3,  3,  5,  2,  5,  2,  2,  1,  5,  1,  2,  1,…
## $ registered_voter            <dbl+lbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, …
## $ region9                     <dbl+lbl> 5, 3, 8, 5, 6, 9, 8, 8, 2, 7, 5, 7, 2, 4, 7, 7, 4, 2, 1, 5, 8, 5, 3, 7, 5, …
## $ region4                     <dbl+lbl> 3, 2, 4, 3, 3, 4, 4, 4, 1, 3, 3, 3, 1, 2, 3, 3, 2, 1, 1, 3, 4, 3, 2, 3, 3, …
## $ religion                    <dbl+lbl>  2,  2,  4,  2,  1, 11,  4, 15,  2,  1,  2,  2,  5,  3,  1, 11,  2,  2, 15,…
## $ religion_other_nonchristian <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ evangelical                 <dbl+lbl> 2, 3, 2, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 3, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, …
## $ service_attendance          <dbl+lbl> 5, 2, 5, 2, 5, 5, 5, 1, 5, 4, 3, 2, 2, 1, 5, 5, 5, 3, 1, 5, 3, 6, 5, 2, 2, …
## $ marit_status                <dbl+lbl> 2, 5, 1, 1, 1, 1, 1, 3, 1, 1, 2, 3, 1, 3, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 3, …
## $ employment                  <dbl+lbl> 5, 6, 4, 5, 1, 2, 1, 1, 2, 7, 1, 5, 5, 5, 1, 7, 1, 5, 1, 1, 1, 5, 1, 1, 5, …
## $ house_head                  <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ house_size                  <dbl> 3, 2, 2, 2, 2, 3, 2, 2, 2, 4, 1, 1, 2, 1, 2, 3, 4, 2, 1, 5, 3, 3, 3, 4, 1, 1, 2…
## $ house_ages0to1              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages2to5              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages6to12             <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 1, 0, 1, 2, 0, 0, 0…
## $ house_ages13to17            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ house_ages18plus            <dbl> 3, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, 1, 2, 1, 2, 2, 2, 2, 1, 3, 2, 3, 2, 2, 1, 1, 2…
## $ house_type                  <dbl+lbl> 1, 4, 1, 1, 1, 2, 3, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3, 3, 2, 1, 1, 1, 1, 3, …
## $ house_own                   <dbl+lbl> 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
# sort by one variable, and only show the first 6 values
head(arrange(survey_data_raw, wave), 6)
## # A tibble: 6 x 54
##   case_ID         wave      year weight_wave weight_aggregate happening cause_original cause_other_text cause_recoded
##     <dbl>    <dbl+lbl> <dbl+lbl>       <dbl>            <dbl> <dbl+lbl>      <dbl+lbl> <chr>                <dbl+lbl>
## 1       2 1 [Nov 2008]  1 [2008]        0.54            0.294 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
## 2       3 1 [Nov 2008]  1 [2008]        0.85            0.463 2 [Don't… 1 [Caused mos… ""               6 [Caused mo…
## 3       5 1 [Nov 2008]  1 [2008]        0.49            0.267 2 [Don't… 2 [Caused mos… ""               4 [Caused mo…
## 4       6 1 [Nov 2008]  1 [2008]        0.29            0.158 3 [Yes]   2 [Caused mos… ""               4 [Caused mo…
## 5       7 1 [Nov 2008]  1 [2008]        1.29            0.702 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
## 6       8 1 [Nov 2008]  1 [2008]        2.56            1.39  2 [Don't… 2 [Caused mos… ""               4 [Caused mo…
## # … with 45 more variables: sci_consensus <dbl+lbl>, worry <dbl+lbl>, harm_personally <dbl+lbl>, harm_US <dbl+lbl>,
## #   harm_dev_countries <dbl+lbl>, harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## #   reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, reg_coal_emissions <dbl+lbl>,
## #   discuss_GW <dbl+lbl>, hear_GW_media <dbl+lbl>, gender <dbl+lbl>, age <dbl>, age_category <dbl+lbl>,
## #   generation <dbl+lbl>, educ <dbl+lbl>, educ_category <dbl+lbl>, income <dbl+lbl>, income_category <dbl+lbl>,
## #   race <dbl+lbl>, ideology <dbl+lbl>, party <dbl+lbl>, party_w_leaners <dbl+lbl>, party_x_ideo <dbl+lbl>,
## #   registered_voter <dbl+lbl>, region9 <dbl+lbl>, region4 <dbl+lbl>, religion <dbl+lbl>, …
# sort by 2 variables (one in opposite order), and only show the last 10 values
tail(arrange(survey_data_raw, wave, desc(year)), 10)
## # A tibble: 10 x 54
##    case_ID          wave      year weight_wave weight_aggregate happening cause_original cause_other_text cause_recoded
##      <dbl>     <dbl+lbl> <dbl+lbl>       <dbl>            <dbl> <dbl+lbl>      <dbl+lbl> <chr>                <dbl+lbl>
##  1   35442 17 [Oct 2017]  9 [2017]       1.29             1.17  3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
##  2   35443 17 [Oct 2017]  9 [2017]       0.502            0.454 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
##  3   35445 17 [Oct 2017]  9 [2017]       0.879            0.794 3 [Yes]   2 [Caused mos… ""               4 [Caused mo…
##  4   35446 17 [Oct 2017]  9 [2017]       0.759            0.686 3 [Yes]   2 [Caused mos… ""               4 [Caused mo…
##  5   35448 17 [Oct 2017]  9 [2017]       1.39             1.25  3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
##  6   35450 17 [Oct 2017]  9 [2017]       1.35             1.22  3 [Yes]   2 [Caused mos… ""               4 [Caused mo…
##  7   35451 17 [Oct 2017]  9 [2017]       0.911            0.823 3 [Yes]   3 [Other (Ple… "combination of… 5 [Caused by…
##  8   35452 17 [Oct 2017]  9 [2017]       1.04             0.936 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
##  9   35454 17 [Oct 2017]  9 [2017]       0.606            0.547 2 [Don't… 2 [Caused mos… ""               4 [Caused mo…
## 10   35455 17 [Oct 2017]  9 [2017]       1.18             1.06  3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
## # … with 45 more variables: sci_consensus <dbl+lbl>, worry <dbl+lbl>, harm_personally <dbl+lbl>, harm_US <dbl+lbl>,
## #   harm_dev_countries <dbl+lbl>, harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## #   reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, reg_coal_emissions <dbl+lbl>,
## #   discuss_GW <dbl+lbl>, hear_GW_media <dbl+lbl>, gender <dbl+lbl>, age <dbl>, age_category <dbl+lbl>,
## #   generation <dbl+lbl>, educ <dbl+lbl>, educ_category <dbl+lbl>, income <dbl+lbl>, income_category <dbl+lbl>,
## #   race <dbl+lbl>, ideology <dbl+lbl>, party <dbl+lbl>, party_w_leaners <dbl+lbl>, party_x_ideo <dbl+lbl>,
## #   registered_voter <dbl+lbl>, region9 <dbl+lbl>, region4 <dbl+lbl>, religion <dbl+lbl>, …

Select or drop specific columns and rows (indexing)

In basic R, we saw that we can select specific columns and rows using indexing. In tidyverse, we have select (for columns based on column labels, similar to indexing with character vectors or with the $ operator), filter (for rows based on values, similar to indexing with logical vectors) and slice (for rows based on row numbers, similar to indexing with a numerical vector).

interesting_columns = c('wave', 'year', 'happening', 'cause_recoded', 'sci_consensus', 'worry', 'harm_personally', 
                        'harm_US', 'harm_dev_countries', 'harm_future_gen', 'harm_plants_animals', 'when_harm_US',
                        'reg_CO2_pollutant', 'reg_utilities', 'fund_research', 'discuss_GW',
                        'gender', 'age_category', 'educ_category', 'income_category', 'race', 'party_x_ideo', 'region4', 'employment')

columns_to_drop = c('case_ID', 'weight_wave', 'weight_aggregate', 'cause_original', 'cause_other_text',
                    'reg_coal_emissions', 'hear_GW_media', 'age', 'generation', 'educ', 'income', 'ideology',
                    'party', 'party_w_leaners', 'registered_voter', 'region9', 'religion', 'religion_other_nonchristian',
                    'evangelical', 'service_attendance', 'marit_status', 'house_head', 'house_size',
                    'house_ages0to1', 'house_ages2to5', 'house_ages6to12', 'house_ages13to17', 'house_ages18plus',
                    'house_type', 'house_own')

# use all_of to inlcude columns
survey_data = select(survey_data_raw, all_of(interesting_columns))

# use -all_of to exclude columns
survey_data = select(survey_data_raw, -all_of(columns_to_drop))
# equivalent to survey_data[200:210,]
slice(survey_data, 200:210)
## # A tibble: 11 x 24
##            wave      year      happening cause_recoded sci_consensus   worry harm_personally harm_US harm_dev_countr…
##       <dbl+lbl> <dbl+lbl>      <dbl+lbl>     <dbl+lbl>     <dbl+lbl> <dbl+l>       <dbl+lbl> <dbl+l>        <dbl+lbl>
##  1 1 [Nov 2008]  1 [2008] 3 [Yes]        5 [Caused by… 2 [There is … 1 [Not… 1 [Not at all]  3 [A m… 3 [A moderate a…
##  2 1 [Nov 2008]  1 [2008] 2 [Don't know] 4 [Caused mo… 2 [There is … 3 [Som… 0 [Don't know]  0 [Don… 0 [Don't know]  
##  3 1 [Nov 2008]  1 [2008] 1 [No]         4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all]  1 [Not… 1 [Not at all]  
##  4 1 [Nov 2008]  1 [2008] 2 [Don't know] 4 [Caused mo… 2 [There is … 2 [Not… 1 [Not at all]  1 [Not… 1 [Not at all]  
##  5 1 [Nov 2008]  1 [2008] 1 [No]         4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all]  3 [A m… 1 [Not at all]  
##  6 1 [Nov 2008]  1 [2008] 3 [Yes]        6 [Caused mo… 4 [Most scie… 3 [Som… 3 [A moderate … 3 [A m… 3 [A moderate a…
##  7 1 [Nov 2008]  1 [2008] 3 [Yes]        6 [Caused mo… 4 [Most scie… 4 [Ver… 2 [Only a litt… 3 [A m… 4 [A great deal]
##  8 1 [Nov 2008]  1 [2008] 3 [Yes]        5 [Caused by… 4 [Most scie… 3 [Som… 3 [A moderate … 3 [A m… 3 [A moderate a…
##  9 1 [Nov 2008]  1 [2008] 3 [Yes]        6 [Caused mo… 4 [Most scie… 1 [Not… 1 [Not at all]  1 [Not… 1 [Not at all]  
## 10 1 [Nov 2008]  1 [2008] 3 [Yes]        6 [Caused mo… 4 [Most scie… 3 [Som… 3 [A moderate … 4 [A g… 4 [A great deal]
## 11 1 [Nov 2008]  1 [2008] 1 [No]         4 [Caused mo… 2 [There is … 1 [Not… 1 [Not at all]  1 [Not… 1 [Not at all]  
## # … with 15 more variables: harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## #   reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, discuss_GW <dbl+lbl>,
## #   gender <dbl+lbl>, age_category <dbl+lbl>, educ_category <dbl+lbl>, income_category <dbl+lbl>, race <dbl+lbl>,
## #   party_x_ideo <dbl+lbl>, region4 <dbl+lbl>, employment <dbl+lbl>
# eliminate people (i.e., rows) who didn't reply in some variables of interest (an absence of reply was coded as -1 here)
survey_data = filter(survey_data, 
                     happening > 0, 
                     cause_recoded > 0, 
                     sci_consensus > 0, 
                     worry > 0)

Change values or value types (indexing and reassigning)

In basic R, we saw that we can change information in the dataset by indexing specific columns and/or rows and reassigning values to them. In tidyverse we have specific functions that do this directly: the mutate functions. mutate_all applies one transforming function to all columns, while mutate allows to to do specific changes to single columns.

# convert all data to integers type
survey_data = mutate_all(survey_data, 
                         as.integer)

Note that when you use an existing column label, the column is replaced with the new values. If you want to simply add a column based on the values of an existing one, use a non-existing label instead.

survey_data = mutate(survey_data, 
                     happening_cont = happening + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
                     worry_cont = worry + rnorm(mean=0, sd=.5, n=nrow(survey_data)),
                     year=recode(year, 
                                 `1` = 2008,
                                 `2` = 2010,
                                 `3` = 2011,
                                 `4` = 2012, 
                                 `5` = 2013,
                                 `6` = 2014,
                                 `7` = 2015,
                                 `8` = 2016,
                                 `9` = 2017),
                     happening_labels=recode(happening,
                                             `1` = "no",
                                             `2` = "dont know",
                                             `3` = "yes"),
                     cause_recoded=recode(cause_recoded,
                                          `1` = "dont know",
                                          `2` = "other",
                                          `3` = "not happening",
                                          `4` = "natural",
                                          `5` = "human",
                                          `6` = "natural and human"),
                     sci_consensus=recode(sci_consensus,
                                          `1` = "dont know",
                                          `2` = "disagreement",
                                          `3` = "not happening",
                                          `4` = "happening"),
                     gender=recode(gender,
                                   `1` = "male",
                                   `2` = "female"),
                     age_category_labels=recode(age_category,
                                                `1` = "18-34",
                                                `2` = "35-54",
                                                `3` = "55+"),
                     educ_category_labels=recode(educ_category,
                                                 `1` = "no highschool",
                                                 `2` = "highschool",
                                                 `3` = "college",
                                                 `4` = "bachelor or higher"),
                     income_category_labels=recode(income_category,
                                                   `1` = "less 50000",
                                                   `2` = "50000-99999",
                                                   `3` = "more 100000"),
                     race=recode(race,
                                 `1` = 'white non hisp',
                                 `2` = 'black non hisp',
                                 `3` = 'other non hisp',
                                 `4` = 'hisp'),
                     party_x_ideo=recode(party_x_ideo,
                                         `-2` = "no interest",
                                         `-1` = "refused",
                                         `1` = "liberal democrat",
                                         `2` = "moderate democrate",
                                         `3` = "independent",
                                         `4` = "moderate republican",
                                         `5` = "conservative republican"),
                     region4 = recode(region4,
                                      `1` = "northeast",
                                      `2` = "midwest",
                                      `3` = "south",
                                      `4` = "west"),
                     employment = recode(employment,
                                         `1` = "Working/as a paid employee",
                                         `2` = "Working/selfemploye",
                                         `3` = "Not working/temporary",
                                         `4` = "Not working/looking",
                                         `5` = "Not working/retired",
                                         `6` = "Not working/disabled",
                                         `7` = "Not working/other"))

Compute summary statistics

Another way to inspect the data, is to compute summary statistics (like mean, SD, quantiles, percentages, etc.) on each column. In tidyverse, the function summarise allows you to specify what to compute for each column:

summarise(survey_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## # A tibble: 1 x 11
##   count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happening
##   <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>                <dbl>
## 1 18514    -0.935             1.74       2.52             3.29      6.24        -0.669                 1.93
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
summarise(survey_data,
          mean_happening = mean(happening))
## # A tibble: 1 x 1
##   mean_happening
##            <dbl>
## 1           2.50

By default, summarise computes statistics on the whole dataset. However, when you have different experimental conditions and/or hierarchical data (repeated measures, longitudinal data, multi-site data, etc.), you might want to compute different stats per group/level defined by a categorical variable (the one that would specify the condition/group). In tidyverse, this can be achieved by using the group_by function together with the summarise function:

grouped_data = group_by(survey_data, year)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## # A tibble: 9 x 12
##    year count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happening
##   <dbl> <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>                <dbl>
## 1  2008  2130    -0.460             1.90       2.64             3.33      5.55        -0.353                 2.09
## 2  2010  1993    -0.574             1.63       2.42             3.22      5.51        -0.338                 1.65
## 3  2011  1946    -0.935             1.71       2.43             3.19      5.52        -0.215                 1.84
## 4  2012  2050    -0.484             1.75       2.50             3.29      5.35        -0.283                 2.03
## 5  2013  1858    -0.436             1.67       2.46             3.24      5.83        -0.131                 1.85
## 6  2014  2282    -0.304             1.72       2.50             3.30      5.32        -0.334                 1.82
## 7  2015  1263    -0.512             1.64       2.40             3.17      5.43        -0.223                 1.74
## 8  2016  2427    -0.291             1.76       2.56             3.36      6.24        -0.669                 2.11
## 9  2017  2565    -0.499             1.85       2.63             3.42      5.64        -0.360                 2.08
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>

Split character columns

Sometimes, we get data with categorical variables that contain more than one variable. For example, in our dataset, in the employment column, we have information about whether the person is working or is unemployed (this is one categorical variable with 2 levels) and about the working type (this is another categorical variable with as many levels as there are employment types).

With the separate function we can indeed separate this information based on specific text separators (in this case, /):

survey_data = separate(survey_data,
                       col = employment,
                       into = c('working', 'working_type'),
                       sep = '/',
                       remove = FALSE)
head(select(survey_data, employment, working, working_type), 10)
## # A tibble: 10 x 3
##    employment                 working     working_type      
##    <chr>                      <chr>       <chr>             
##  1 Not working/retired        Not working retired           
##  2 Not working/disabled       Not working disabled          
##  3 Not working/looking        Not working looking           
##  4 Not working/retired        Not working retired           
##  5 Working/as a paid employee Working     as a paid employee
##  6 Working/selfemploye        Working     selfemploye       
##  7 Working/as a paid employee Working     as a paid employee
##  8 Working/as a paid employee Working     as a paid employee
##  9 Working/selfemploye        Working     selfemploye       
## 10 Not working/other          Not working other

You can also create a new column via mutate and the case_when function for specific categories you are interested in your analysis:

survey_data = mutate(survey_data,
                     working_recoded = case_when(working == "Not working" & working_type == 'looking' ~ 0,
                                                 working == "Not working" & working_type != 'looking' ~ 1,
                                                 working == "Working" ~ 2))
                     
head(select(survey_data, employment, working, working_type, working_recoded), 10)
## # A tibble: 10 x 4
##    employment                 working     working_type       working_recoded
##    <chr>                      <chr>       <chr>                        <dbl>
##  1 Not working/retired        Not working retired                          1
##  2 Not working/disabled       Not working disabled                         1
##  3 Not working/looking        Not working looking                          0
##  4 Not working/retired        Not working retired                          1
##  5 Working/as a paid employee Working     as a paid employee               2
##  6 Working/selfemploye        Working     selfemploye                      2
##  7 Working/as a paid employee Working     as a paid employee               2
##  8 Working/as a paid employee Working     as a paid employee               2
##  9 Working/selfemploye        Working     selfemploye                      2
## 10 Not working/other          Not working other                            1

Once you have these groups, you can calculate the statistics you are interested in:

grouped_data = group_by(survey_data, working)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## # A tibble: 2 x 12
##   working     count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happen…
##   <chr>       <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>              <dbl>
## 1 Not working  7996    -0.935             1.72       2.51             3.30      5.83        -0.334               1.91
## 2 Working     10518    -0.798             1.76       2.52             3.28      6.24        -0.669               1.95
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>
# first filter, and then get summary statistics per group
not_working_data = filter(survey_data, working == "Not working")

grouped_data = group_by(not_working_data, working_type)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## # A tibble: 5 x 12
##   working_type count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happe…
##   <chr>        <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>             <dbl>
## 1 disabled      1259    -0.526             1.84       2.66             3.51      5.52       -0.263               2.00
## 2 looking       1112    -0.935             1.88       2.59             3.33      5.83       -0.215               2.02
## 3 other         1370    -0.574             1.62       2.45             3.24      5.64       -0.334               1.80
## 4 retired       4099    -0.470             1.68       2.45             3.24      5.42       -0.260               1.89
## 5 temporary      156     0.157             2.03       2.66             3.32      5.06       -0.0173              1.82
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>

Merge character columns

As we have done in basic R with the paste function, we can also merge the values of two columns to create new groups we might be interested in. In tidyverse we use unite:

survey_data = unite(survey_data,
                    "race_gender",
                    race, 
                    gender,
                    sep = "_",
                    remove=FALSE)

head(select(survey_data, race, gender, race_gender))
## # A tibble: 6 x 3
##   race           gender race_gender          
##   <chr>          <chr>  <chr>                
## 1 white non hisp female white non hisp_female
## 2 white non hisp male   white non hisp_male  
## 3 hisp           female hisp_female          
## 4 white non hisp male   white non hisp_male  
## 5 white non hisp female white non hisp_female
## 6 white non hisp male   white non hisp_male
grouped_data = group_by(survey_data, race_gender)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## # A tibble: 8 x 12
##   race_gender           count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happ…
##   <chr>                 <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>            <dbl>
## 1 black non hisp_female   875    -0.463             1.93       2.62             3.35      5.21        0.267              2.26
## 2 black non hisp_male     724    -0.798             1.94       2.59             3.35      5.20        0.0225             2.16
## 3 hisp_female             875    -0.436             2.14       2.86             3.63      5.64       -0.142              2.21
## 4 hisp_male               919    -0.935             1.92       2.69             3.50      5.22       -0.669              2.12
## 5 other non hisp_female   562    -0.460             2.02       2.72             3.36      5.47       -0.156              2.14
## 6 other non hisp_male     570    -0.204             1.82       2.63             3.49      5.59       -0.0173             2.11
## 7 white non hisp_female  7000    -0.574             1.82       2.56             3.32      5.55       -0.549              1.98
## 8 white non hisp_male    6989    -0.526             1.54       2.35             3.14      6.24       -0.558              1.70
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>

Or you can just use 2 grouping variables to compute statistics:

grouped_data = group_by(survey_data, race, gender)

summarise(grouped_data, 
          count=n(), 
          min_worry=min(worry_cont),
          quantile25_worry=quantile(worry_cont, .25, na.rm = TRUE),
          mean_worry=mean(worry_cont), 
          quantile75_worry=quantile(worry_cont, .75, na.rm = TRUE),
          max_worry=max(worry_cont),
          min_happening=min(happening_cont),
          quantile25_happening=quantile(happening_cont, .25, na.rm = TRUE),
          mean_happening=mean(happening_cont),
          quantile75_happening=quantile(happening_cont, .75, na.rm = TRUE),
          max_happening=max(happening_cont))
## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.
## # A tibble: 8 x 13
## # Groups:   race [4]
##   race           gender count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happ…
##   <chr>          <chr>  <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>            <dbl>
## 1 black non hisp female   875    -0.463             1.93       2.62             3.35      5.21        0.267              2.26
## 2 black non hisp male     724    -0.798             1.94       2.59             3.35      5.20        0.0225             2.16
## 3 hisp           female   875    -0.436             2.14       2.86             3.63      5.64       -0.142              2.21
## 4 hisp           male     919    -0.935             1.92       2.69             3.50      5.22       -0.669              2.12
## 5 other non hisp female   562    -0.460             2.02       2.72             3.36      5.47       -0.156              2.14
## 6 other non hisp male     570    -0.204             1.82       2.63             3.49      5.59       -0.0173             2.11
## 7 white non hisp female  7000    -0.574             1.82       2.56             3.32      5.55       -0.549              1.98
## 8 white non hisp male    6989    -0.526             1.54       2.35             3.14      6.24       -0.558              1.70
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>

3. Pipes

Until now, we have used the tidyverse functions as all other functions we have used in basic R:

new_object = function1(old_object, some_argument)
new_object = function2(new_object, some_argument)
new_object = function3(new_object, some_argument)

However, in tidyverse, we can compact this notation using pipes:

new_object = old_object %>%
  function1(some_argument) %>%
  function2(some_argument) %>%
  function3(some_argument)

As you see, this allows us to define an analysis pipeline in which each computation is done on the results of the previous computation.

Here is how we could rewrite all previous computations from loading the raw data to the grouped summary statistics:

survey_data_raw = read_spss("data/ccam_original.sav")

survey_data = survey_data_raw %>%
    select(all_of(interesting_columns)) %>%
    filter(happening > 0, 
           cause_recoded > 0, 
           sci_consensus > 0, 
           worry > 0)  %>%
    mutate_all(as.integer) %>%
    mutate(happening_cont = happening + rnorm(mean=0, sd=.5, n=n()),
           worry_cont = worry + rnorm(mean=0, sd=.5, n=n()),
           year=recode(year, 
                       `1` = 2008, 
                       `2` = 2010,
                       `3` = 2011,
                       `4` = 2012, 
                       `5` = 2013,
                       `6` = 2014,
                       `7` = 2015,
                       `8` = 2016,
                       `9` = 2017),
           happening_labels=recode(happening,
                                   `1` = "no",
                                   `2` = "dont know",
                                   `3` = "yes"),
           cause_recoded=recode(cause_recoded,
                                `1` = "dont know",
                                `2` = "other",
                                `3` = "not happening",
                                `4` = "natural",
                                `5` = "human",
                                `6` = "natural and human"),
           sci_consensus=recode(sci_consensus,
                                `1` = "dont know",
                                `2` = "disagreement",
                                `3` = "not happening",
                                `4` = "happening"),
           gender=recode(gender,
                         `1` = "male",
                         `2` = "female"),
           age_category_labels=recode(age_category,
                                      `1` = "18-34",
                                      `2` = "35-54",
                                      `3` = "55+"),
           educ_category_labels=recode(educ_category,
                                       `1` = "no highschool",
                                       `2` = "highschool",
                                       `3` = "college",
                                       `4` = "bachelor or higher"),
           income_category_labels=recode(income_category,
                        `1` = "less 50000",
                        `2` = "50000-99999",
                        `3` = "more 100000"),
           race=recode(race,
                      `1` = 'white non hisp',
                      `2` = 'black non hisp',
                      `3` = 'other non hisp',
                      `4` = 'hisp'),
           party_x_ideo=recode(party_x_ideo,
                               `-2` = "no interest",
                               `-1` = "refused",
                               `1` = "liberal democrat",
                               `2` = "moderate democrate",
                               `3` = "independent",
                               `4` = "moderate republican",
                               `5` = "conservative republican"),
           region4 = recode(region4,
                            `1` = "northeast",
                            `2` = "midwest",
                            `3` = "south",
                            `4` = "west"),
           employment = recode(employment,
                               `1` = "Working/as a paid employee",
                               `2` = "Working/selfemploye",
                               `3` = "Not working/temporary",
                               `4` = "Not working/looking",
                               `5` = "Not working/retired",
                               `6` = "Not working/disabled",
                               `7` = "Not working/other")) %>%
    separate(col = employment,
             into = c('working', 'working_type'),
             sep = '/') %>%
    filter(working == "Not working")  %>%
    group_by(working_type) %>%
    summarise(count=n(), 
              min_worry=min(worry_cont),
              quantile25_worry=quantile(worry_cont, .25),
              mean_worry=mean(worry_cont), 
              quantile75_worry=quantile(worry_cont, .75),
              max_worry=max(worry_cont),
              min_happening=min(happening_cont),
              quantile25_happening=quantile(happening_cont, .25),
              mean_happening=mean(happening_cont),
              quantile75_happening=quantile(happening_cont, .75),
              max_happening=max(happening_cont))
print(survey_data)
## # A tibble: 5 x 12
##   working_type count min_worry quantile25_worry mean_worry quantile75_worry max_worry min_happening quantile25_happe…
##   <chr>        <int>     <dbl>            <dbl>      <dbl>            <dbl>     <dbl>         <dbl>             <dbl>
## 1 disabled      1259   -0.0784             1.85       2.65             3.47      5.90        -0.437              2.07
## 2 looking       1112   -0.487              1.84       2.59             3.38      5.24        -0.560              2.04
## 3 other         1370   -0.357              1.64       2.45             3.23      5.37        -0.388              1.77
## 4 retired       4099   -0.646              1.66       2.45             3.23      5.70        -0.472              1.91
## 5 temporary      156   -0.347              2.03       2.68             3.29      5.01         0.325              1.85
## # … with 3 more variables: mean_happening <dbl>, quantile75_happening <dbl>, max_happening <dbl>

4. Join dataframes

In basic R, we have seen how we can add separate dataframes using the cbind and rbind functions. The corresponding functions in tidyverse are bind_cols and bind_rows:

# Create fake data:
N = 15

mean_score_students = runif(N, 5, 10)
fake_data_first_batch = tibble( # same as data.frame but in tidyverse!
    student = 1:N,
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
    score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)
print(fake_data_first_batch)
## # A tibble: 15 x 6
##    student   age score_wpa1 score_wpa2 score_wpa3 score_wpa4
##      <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1       1    28       9.84       8.84       5.99       6.79
##  2       2    36       6.27       5.65       3.05       4.58
##  3       3    33       7.55       6.81       2.48       5.15
##  4       4    27       5.48       5.05       2.19       4.03
##  5       5    28       8.41       7.68       3.47       5.58
##  6       6    37       8.70       7.88       6.00       5.92
##  7       7    30       8.52       7.69       3.25       5.82
##  8       8    27       9.58       8.48       3.78       6.55
##  9       9    26       6.49       5.94       1.86       4.29
## 10      10    27       5.68       5.04       2.72       3.86
## 11      11    32       8.73       7.89       4.80       6.38
## 12      12    26       8.64       7.87       4.38       5.89
## 13      13    26       9.80       8.97       4.26       6.95
## 14      14    31       9.06       8.08       5.82       6.41
## 15      15    31       7.09       6.20       3.52       4.87
mean_score_students = runif(N, 5, 10)
fake_data_second_batch = tibble( # same as data.frame but in tidyverse!
    student = (N+1):(2*N),
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1)
)
print(fake_data_second_batch)
## # A tibble: 15 x 5
##    student   age score_wpa1 score_wpa2 score_wpa3
##      <int> <dbl>      <dbl>      <dbl>      <dbl>
##  1      16    25       7.92       7.23       2.59
##  2      17    31       9.58       8.61       5.86
##  3      18    35       9.82       8.95       3.48
##  4      19    30       5.28       4.63       3.18
##  5      20    26       7.53       6.90       3.88
##  6      21    35       7.41       6.79       2.47
##  7      22    26       7.05       6.40       3.83
##  8      23    33       9.21       8.25       3.82
##  9      24    32       6.66       5.92       3.02
## 10      25    30       6.66       5.96       2.20
## 11      26    30       9.77       8.57       3.71
## 12      27    38       7.38       6.60       3.33
## 13      28    34       6.80       6.07       2.96
## 14      29    28       9.50       8.49       5.21
## 15      30    22       6.53       5.75       4.40
fake_data = bind_rows(fake_data_first_batch, fake_data_second_batch)
fake_data
## # A tibble: 30 x 6
##    student   age score_wpa1 score_wpa2 score_wpa3 score_wpa4
##      <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1       1    28       9.84       8.84       5.99       6.79
##  2       2    36       6.27       5.65       3.05       4.58
##  3       3    33       7.55       6.81       2.48       5.15
##  4       4    27       5.48       5.05       2.19       4.03
##  5       5    28       8.41       7.68       3.47       5.58
##  6       6    37       8.70       7.88       6.00       5.92
##  7       7    30       8.52       7.69       3.25       5.82
##  8       8    27       9.58       8.48       3.78       6.55
##  9       9    26       6.49       5.94       1.86       4.29
## 10      10    27       5.68       5.04       2.72       3.86
## # … with 20 more rows
# new variables
new_info = tibble(
    student = 1:(2*N),
    score_wpa5 = mean_score_students*0.5 + rnorm(2*N, 0, 1),
    gender = rbinom(2*N, 1, .5)
)

print(new_info)
## # A tibble: 30 x 3
##    student score_wpa5 gender
##      <int>      <dbl>  <int>
##  1       1       3.84      1
##  2       2       6.37      1
##  3       3       3.48      1
##  4       4       4.24      0
##  5       5       4.55      1
##  6       6       5.11      1
##  7       7       2.82      0
##  8       8       5.95      1
##  9       9       2.25      1
## 10      10       1.75      0
## # … with 20 more rows
fake_data = bind_cols(fake_data,
                      new_info)
## New names:
## * student -> student...1
## * student -> student...7
print(fake_data)
## # A tibble: 30 x 9
##    student...1   age score_wpa1 score_wpa2 score_wpa3 score_wpa4 student...7 score_wpa5 gender
##          <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>       <int>      <dbl>  <int>
##  1           1    28       9.84       8.84       5.99       6.79           1       3.84      1
##  2           2    36       6.27       5.65       3.05       4.58           2       6.37      1
##  3           3    33       7.55       6.81       2.48       5.15           3       3.48      1
##  4           4    27       5.48       5.05       2.19       4.03           4       4.24      0
##  5           5    28       8.41       7.68       3.47       5.58           5       4.55      1
##  6           6    37       8.70       7.88       6.00       5.92           6       5.11      1
##  7           7    30       8.52       7.69       3.25       5.82           7       2.82      0
##  8           8    27       9.58       8.48       3.78       6.55           8       5.95      1
##  9           9    26       6.49       5.94       1.86       4.29           9       2.25      1
## 10          10    27       5.68       5.04       2.72       3.86          10       1.75      0
## # … with 20 more rows

However, a better way to join dataframes is to use the join functions, as these allow us to join dataframes based on specific ID. This is very important when there are repeating info and/or missing info:

N = 15
mean_score_students = runif(N, 5, 10)
first_batch = tibble(
    student = 1:N,
    age = round(rnorm(N, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(N, 0, 0.1),
    score_wpa3 = mean_score_students*0.5 + rnorm(N, 0, 1),
    score_wpa4 = mean_score_students*0.7 + rnorm(N, 0, 0.2)
)
print(first_batch)
## # A tibble: 15 x 6
##    student   age score_wpa1 score_wpa2 score_wpa3 score_wpa4
##      <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1       1    26       8.91       7.98       4.33       6.05
##  2       2    32       9.50       8.60       5.62       6.91
##  3       3    40       8.22       7.38       2.69       5.97
##  4       4    38       8.52       7.55       3.60       5.61
##  5       5    37       7.38       6.57       4.90       5.08
##  6       6    25       8.22       7.32       3.42       5.57
##  7       7    30       8.58       7.63       3.11       5.95
##  8       8    36       9.23       8.46       4.29       6.34
##  9       9    29       7.52       6.81       4.50       5.46
## 10      10    27       8.21       7.42       3.01       6.06
## 11      11    32       6.47       5.79       2.71       4.58
## 12      12    26       9.93       8.97       4.55       7.04
## 13      13    24       5.08       4.65       3.27       3.02
## 14      14    23       5.77       5.34       4.19       3.95
## 15      15    30       8.84       7.95       4.42       5.77
M = 5
mean_score_students = runif(M, 8, 10)
second_batch = tibble( 
    student = (N+1):(N+M),
    age = round(rnorm(M, 30, 5)),
    score_wpa1 = mean_score_students,
    score_wpa2 = mean_score_students*0.9 + rnorm(M, 0, 0.1),
    score_wpa5 = mean_score_students*0.5 + rnorm(M, 0, 1)
)
print(second_batch)
## # A tibble: 5 x 5
##   student   age score_wpa1 score_wpa2 score_wpa5
##     <int> <dbl>      <dbl>      <dbl>      <dbl>
## 1      16    34       9.70       8.79       4.40
## 2      17    29       9.24       8.29       4.66
## 3      18    36       9.06       8.18       4.48
## 4      19    30       8.66       8.00       3.09
## 5      20    24       9.42       8.48       5.38
new_info = tibble(
    student = 1:(N+M),
    score_wpa6 = runif(N+M, 6, 10)*0.5 + rnorm(N+M, 0, 1),
    gender = rbinom(N+M, 1, .5)
)
print(new_info)
## # A tibble: 20 x 3
##    student score_wpa6 gender
##      <int>      <dbl>  <int>
##  1       1       2.24      0
##  2       2       3.31      1
##  3       3       4.20      0
##  4       4       2.36      0
##  5       5       4.81      0
##  6       6       3.11      0
##  7       7       1.84      0
##  8       8       2.92      1
##  9       9       2.33      1
## 10      10       3.69      1
## 11      11       4.91      1
## 12      12       3.35      1
## 13      13       5.14      0
## 14      14       4.38      0
## 15      15       3.83      0
## 16      16       2.87      0
## 17      17       1.85      1
## 18      18       4.10      0
## 19      19       5.16      1
## 20      20       3.97      1
full_join(first_batch, second_batch, by='student')
## # A tibble: 20 x 10
##    student age.x score_wpa1.x score_wpa2.x score_wpa3 score_wpa4 age.y score_wpa1.y score_wpa2.y score_wpa5
##      <int> <dbl>        <dbl>        <dbl>      <dbl>      <dbl> <dbl>        <dbl>        <dbl>      <dbl>
##  1       1    26         8.91         7.98       4.33       6.05    NA        NA           NA         NA   
##  2       2    32         9.50         8.60       5.62       6.91    NA        NA           NA         NA   
##  3       3    40         8.22         7.38       2.69       5.97    NA        NA           NA         NA   
##  4       4    38         8.52         7.55       3.60       5.61    NA        NA           NA         NA   
##  5       5    37         7.38         6.57       4.90       5.08    NA        NA           NA         NA   
##  6       6    25         8.22         7.32       3.42       5.57    NA        NA           NA         NA   
##  7       7    30         8.58         7.63       3.11       5.95    NA        NA           NA         NA   
##  8       8    36         9.23         8.46       4.29       6.34    NA        NA           NA         NA   
##  9       9    29         7.52         6.81       4.50       5.46    NA        NA           NA         NA   
## 10      10    27         8.21         7.42       3.01       6.06    NA        NA           NA         NA   
## 11      11    32         6.47         5.79       2.71       4.58    NA        NA           NA         NA   
## 12      12    26         9.93         8.97       4.55       7.04    NA        NA           NA         NA   
## 13      13    24         5.08         4.65       3.27       3.02    NA        NA           NA         NA   
## 14      14    23         5.77         5.34       4.19       3.95    NA        NA           NA         NA   
## 15      15    30         8.84         7.95       4.42       5.77    NA        NA           NA         NA   
## 16      16    NA        NA           NA         NA         NA       34         9.70         8.79       4.40
## 17      17    NA        NA           NA         NA         NA       29         9.24         8.29       4.66
## 18      18    NA        NA           NA         NA         NA       36         9.06         8.18       4.48
## 19      19    NA        NA           NA         NA         NA       30         8.66         8.00       3.09
## 20      20    NA        NA           NA         NA         NA       24         9.42         8.48       5.38
full_join(first_batch, second_batch, by=c('student', 'age', "score_wpa1", "score_wpa2"))
## # A tibble: 20 x 7
##    student   age score_wpa1 score_wpa2 score_wpa3 score_wpa4 score_wpa5
##      <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1       1    26       8.91       7.98       4.33       6.05      NA   
##  2       2    32       9.50       8.60       5.62       6.91      NA   
##  3       3    40       8.22       7.38       2.69       5.97      NA   
##  4       4    38       8.52       7.55       3.60       5.61      NA   
##  5       5    37       7.38       6.57       4.90       5.08      NA   
##  6       6    25       8.22       7.32       3.42       5.57      NA   
##  7       7    30       8.58       7.63       3.11       5.95      NA   
##  8       8    36       9.23       8.46       4.29       6.34      NA   
##  9       9    29       7.52       6.81       4.50       5.46      NA   
## 10      10    27       8.21       7.42       3.01       6.06      NA   
## 11      11    32       6.47       5.79       2.71       4.58      NA   
## 12      12    26       9.93       8.97       4.55       7.04      NA   
## 13      13    24       5.08       4.65       3.27       3.02      NA   
## 14      14    23       5.77       5.34       4.19       3.95      NA   
## 15      15    30       8.84       7.95       4.42       5.77      NA   
## 16      16    34       9.70       8.79      NA         NA          4.40
## 17      17    29       9.24       8.29      NA         NA          4.66
## 18      18    36       9.06       8.18      NA         NA          4.48
## 19      19    30       8.66       8.00      NA         NA          3.09
## 20      20    24       9.42       8.48      NA         NA          5.38
left_join(first_batch, new_info, by=c('student'))
## # A tibble: 15 x 8
##    student   age score_wpa1 score_wpa2 score_wpa3 score_wpa4 score_wpa6 gender
##      <int> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>  <int>
##  1       1    26       8.91       7.98       4.33       6.05       2.24      0
##  2       2    32       9.50       8.60       5.62       6.91       3.31      1
##  3       3    40       8.22       7.38       2.69       5.97       4.20      0
##  4       4    38       8.52       7.55       3.60       5.61       2.36      0
##  5       5    37       7.38       6.57       4.90       5.08       4.81      0
##  6       6    25       8.22       7.32       3.42       5.57       3.11      0
##  7       7    30       8.58       7.63       3.11       5.95       1.84      0
##  8       8    36       9.23       8.46       4.29       6.34       2.92      1
##  9       9    29       7.52       6.81       4.50       5.46       2.33      1
## 10      10    27       8.21       7.42       3.01       6.06       3.69      1
## 11      11    32       6.47       5.79       2.71       4.58       4.91      1
## 12      12    26       9.93       8.97       4.55       7.04       3.35      1
## 13      13    24       5.08       4.65       3.27       3.02       5.14      0
## 14      14    23       5.77       5.34       4.19       3.95       4.38      0
## 15      15    30       8.84       7.95       4.42       5.77       3.83      0
right_join(new_info, second_batch, by=c('student'))
## # A tibble: 5 x 7
##   student score_wpa6 gender   age score_wpa1 score_wpa2 score_wpa5
##     <int>      <dbl>  <int> <dbl>      <dbl>      <dbl>      <dbl>
## 1      16       2.87      0    34       9.70       8.79       4.40
## 2      17       1.85      1    29       9.24       8.29       4.66
## 3      18       4.10      0    36       9.06       8.18       4.48
## 4      19       5.16      1    30       8.66       8.00       3.09
## 5      20       3.97      1    24       9.42       8.48       5.38

6. Now it’s your turn

Now you will analyze data from Matthews et al. (2016): Why do we overestimate others’ willingness to pay? The purpose of this research was to test if our beliefs about other people’s affluence (i.e.; wealth) affect how much we think they will be willing to pay for items. You can find the full paper at http://journal.sjdm.org/15/15909/jdm15909.pdf.

Variables description:

Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt)

Task A

First, download the data_wpa4.csv and matthews_demographics.csv datasets from the data folder on Github and load them in R.

Note: do not use pipes from 1 to 4.

  1. Currently gender is coded as 1 and 2. Create a new dataframe called new_matthews_data, in which there is a new column called gender_labels that codes gender as “male” and “female”. Do it using mutate. Then, rename the original gender column to gender_binary using rename. Subtract 1 to all values of gender_binary, so that it is coded as 0 and 1 instead of 1 and 2 using mutate again.

  2. In new_matthews_data, create new column called income_labels that codes income based on the data description above using mutate. Then, create a new column, called income_recoded, where you only have 4 income categories (coded as numbers from 1 to 4): below 25,000, 25,000-50,000, 50,000-100,000, and above 100,000 using case_when. How many observations are there for each of these 4 categories? Use summarise to reply.

  3. In new_matthews_data, transform all numeric columns into integers numbers using mutate_if.

  4. From new_matthews_data, create a summary of the dataset using summarise, to answer the following questions: What percent of participants were female? What was the minimum, mean, and maximum income? What was the 25th percentile, median, and the 75th percentile of age? Use good names for columns.

  5. Repeat steps from 1 to 4 (apart from the summarise in point 2) using pipes and assign the result to new_matthews_data_summary.

Task B

  1. From new_matthews_data, calculate the mean p1 to p10 across participants using summarise_all and select. Which product scored the highest? Do it again, grouping the data by gender. Is there a difference across gender? What is the mean of the mean p1 to p10 across participants? Calculate it on the result of the previous step. You can do these either using pipes or not.

  2. Transform the data from wide to long format. In particular, you want 10 rows per subjects, with their responses on the products 1 to 10 in a column called wtp, and the product label in a column called product. Call the resulting dataframe new_matthews_data_long. Re-order it by id. Print the first 20 cases to check this worked. Check that new_matthews_data_long has 10 times more rows than new_matthews_data.

# these are the solutions for B2, as we will cover wide to long data transformation later in the seminar
new_matthews_data_long = gather(new_matthews_data,
                                key='product',
                                value='wtp',
                                p1:p10)

new_matthews_data_long = arrange(new_matthews_data_long, id)

Task C

  1. Drop the X1 column in demographics using select.

  2. Join new_matthews_data_long and demographics based on the id, in order to retain as many rows and columns as possible. Call the resulting dataframe matthews_data_all.

  3. Calculate the mean wtp per subject using group_by. You can use pipes or not. Called the resulting dataframe mean_matthews_data_all. This should have as many rows as the number of subjects and 2 columns (id and mean wtp). Add as a third and fourth columns heigth and race using one of the join functions.

  4. Using mean_matthews_data_all, make a barplot showing the mean wtp across ethnic groups. Plot confidence intervals. Give appropriate labels to the plot. Do you think there is a difference in willingness to pay across groups?

# these are the solutions for B4, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = factor(race), y = mean_wtp)) +
    stat_summary(fun = "mean", geom="bar") +
    stat_summary(fun.data = mean_cl_normal, geom = "errorbar", size=1, width=.4) +
    labs(x = 'Race', y = 'Mean WTP') + 
    theme(axis.text.x = element_text(angle = 90))
  1. Using mean_matthews_data_all, make a scatterplot showing the wtp on the y-axis and the height on the x-axis. Add a regression line. Do you think height predicts willingness to pay?
# these are the solutions for B5, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = height, y = mean_wtp)) + 
    geom_point(alpha = 0.3, size= 2) +
    geom_smooth(method = lm, color='grey') +
    labs(x='Height', y='Mean WTP') +
    ggtitle("Relationship betweenheight and WTP")

Submit your assignment

Save and email your script to me at laura.fontanesi@unibas.ch by the end of Friday.