6. Now it’s your turn

Now you will analyze data from Matthews et al. (2016): Why do we overestimate others’ willingness to pay? The purpose of this research was to test if our beliefs about other people’s affluence (i.e.; wealth) affect how much we think they will be willing to pay for items. You can find the full paper at http://journal.sjdm.org/15/15909/jdm15909.pdf.

Variables description:

Here are descriptions of the data variables (taken from the author’s dataset notes available at http://journal.sjdm.org/15/15909/Notes.txt)

Task A

First, download the data_wpa4.csv and matthews_demographics.csv datasets from the data folder on Github and load them in R.

library(tidyverse)

matthews_data = read_csv('https://raw.githubusercontent.com/laurafontanesi/r-seminar22/master/data/data_wpa4.csv')
## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   gender = col_double(),
##   age = col_double(),
##   income = col_double(),
##   p1 = col_double(),
##   p2 = col_double(),
##   p3 = col_double(),
##   p4 = col_double(),
##   p5 = col_double(),
##   p6 = col_double(),
##   p7 = col_double(),
##   p8 = col_double(),
##   p9 = col_double(),
##   p10 = col_double(),
##   task = col_double(),
##   havemore = col_double(),
##   haveless = col_double(),
##   pcmore = col_double()
## )
demographics = read_csv("https://raw.githubusercontent.com/laurafontanesi/r-seminar22/master/data/matthews_demographics.csv")
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   id = col_character(),
##   height = col_double(),
##   race = col_character()
## )
glimpse(matthews_data)
## Rows: 190
## Columns: 18
## $ id       <chr> "R_3PtNn51LmSFdLNM", "R_2AXrrg62pgFgtMV", "R_cwEOX3HgnMeVQHL", "R_d59iPwL4W6BH8qx", "R_1f3K2HrGzFG…
## $ gender   <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 1, 2, 1,…
## $ age      <dbl> 26, 32, 25, 33, 24, 22, 47, 26, 29, 32, 29, 28, 31, 24, 25, 28, 20, 41, 44, 23, 49, 26, 19, 57, 30…
## $ income   <dbl> 7, 4, 2, 5, 1, 2, 3, 4, 1, 7, 4, 3, 2, 2, 6, 3, 2, 2, 1, 3, 3, 6, 4, 5, 5, 2, 3, 2, 5, 4, 6, 7, 5,…
## $ p1       <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,…
## $ p2       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ p3       <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ p4       <dbl> 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ p5       <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ p6       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ p7       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,…
## $ p8       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ p9       <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1,…
## $ p10      <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ task     <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,…
## $ havemore <dbl> NA, NA, NA, NA, 99, NA, 95, NA, 70, 25, 50, 50, NA, NA, NA, 95, NA, NA, 85, NA, 45, 25, NA, 50, 50…
## $ haveless <dbl> 50, 25, 10, 50, NA, 20, NA, 30, NA, NA, NA, NA, 44, 20, 65, NA, 50, 2, NA, 85, NA, NA, 25, NA, NA,…
## $ pcmore   <dbl> 50, 75, 90, 50, 99, 80, 95, 70, 70, 25, 50, 50, 56, 80, 35, 95, 50, 98, 85, 15, 45, 25, 75, 50, 50…
glimpse(demographics)
## Rows: 190
## Columns: 4
## $ X1     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2…
## $ id     <chr> "R_2zOA03wY3kZPHKj", "R_1f30AbYQR0D1wpt", "R_2YgEERKijE2Hz6K", "R_1CdL2n22srUR26d", "R_1HnsqBRimnKph…
## $ height <dbl> 182, 184, 175, 177, 187, 178, 168, 193, 180, 159, 160, 172, 165, 162, 169, 169, 167, 168, 178, 167, …
## $ race   <chr> "black", "white", "hispanic", "white", "white", "white", "white", "white", "white", "hispanic", "whi…

Note: do not use pipes from 1 to 4.

  1. Currently gender is coded as 1 and 2. Create a new dataframe called new_matthews_data, in which there is a new column called gender_labels that codes gender as “male” and “female”. Do it using mutate. Then, rename the original gender column to gender_binary using rename. Subtract 1 to all values of gender_binary, so that it is coded as 0 and 1 instead of 1 and 2 using mutate again.
new_matthews_data = mutate(matthews_data,
                           gender_labels = recode(gender,
                                                  `1`="male",
                                                  `2`="female"))
new_matthews_data = rename(new_matthews_data,
                           gender_binary = gender)

new_matthews_data = mutate(new_matthews_data,
                           gender_binary = gender_binary - 1)
head(select(new_matthews_data, gender_binary, gender_labels))
## # A tibble: 6 x 2
##   gender_binary gender_labels
##           <dbl> <chr>        
## 1             1 female       
## 2             1 female       
## 3             0 male         
## 4             0 male         
## 5             0 male         
## 6             0 male
  1. In new_matthews_data, create new column called income_labels that codes income based on the data description above using mutate. Then, create a new column, called income_recoded, where you only have 4 income categories (coded as numbers from 1 to 4): below 25,000, 25,000-50,000, 50,000-100,000, and above 100,000 using case_when. How many observations are there for each of these 4 categories? Use summarise to reply.
new_matthews_data = mutate(new_matthews_data,
                           income_lables = recode(income,
                                                  `1`='less5,000',
                                                  `2`='15,001–25,000',
                                                  `3`='25,001–35,000',
                                                  `4`='35,001–50,000',
                                                  `5`='50,001–75,000',
                                                  `6`='75,001–100,000',
                                                  `7`='100,001–150,000',
                                                  `8`='greater150,000'
                                                 ),
                           income_recoded = case_when(income < 3 ~ 1,
                                                      income >=3 & income < 5 ~ 2,
                                                      income >= 5 & income < 7 ~ 3,
                                                      income >=7 ~ 4))

summarise(group_by(new_matthews_data, income_recoded), 
          n = n())
## # A tibble: 4 x 2
##   income_recoded     n
##            <dbl> <int>
## 1              1    72
## 2              2    56
## 3              3    52
## 4              4    10
  1. In new_matthews_data, transform all numeric columns into integers numbers using mutate_if.
new_matthews_data = mutate_if(new_matthews_data,
                              is.numeric, 
                              as.integer)
head(new_matthews_data)
## # A tibble: 6 x 21
##   id            gender_binary   age income    p1    p2    p3    p4    p5    p6    p7    p8    p9   p10  task havemore
##   <chr>                 <int> <int>  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>    <int>
## 1 R_3PtNn51LmS…             1    26      7     1     1     1     1     1     1     1     1     1     1     0       NA
## 2 R_2AXrrg62pg…             1    32      4     1     1     1     1     1     1     1     1     1     1     0       NA
## 3 R_cwEOX3HgnM…             0    25      2     0     1     1     1     1     1     1     1     0     0     0       NA
## 4 R_d59iPwL4W6…             0    33      5     1     1     1     1     1     1     1     1     1     1     0       NA
## 5 R_1f3K2HrGzF…             0    24      1     1     1     0     1     1     1     1     1     1     1     1       99
## 6 R_3oN5ijzTfo…             0    22      2     1     1     0     0     1     1     1     1     0     1     0       NA
## # … with 5 more variables: haveless <int>, pcmore <int>, gender_labels <chr>, income_lables <chr>,
## #   income_recoded <int>
  1. From new_matthews_data, create a summary of the dataset using summarise, to answer the following questions: What percent of participants were female? What was the minimum, mean, and maximum income? What was the 25th percentile, median, and the 75th percentile of age? Use good names for columns.
summarise(new_matthews_data,
          perc_female = mean(gender_binary)*100,
          min_income = min(income),
          mean_income = mean(income),
          max_income = max(income),
          perc25_age = quantile(age, .25),
          median_age = quantile(age, .5),
          perc75_age = quantile(age, .75)
          )
## # A tibble: 1 x 7
##   perc_female min_income mean_income max_income perc25_age median_age perc75_age
##         <dbl>      <int>       <dbl>      <int>      <dbl>      <dbl>      <dbl>
## 1        37.4          1        3.53          8         25         30         35
  1. Repeat steps from 1 to 4 (apart from the summarise in point 2) using pipes and assign the result to new_matthews_data_summary.
new_matthews_data_summary = matthews_data %>% 
    mutate(gender_labels = recode(gender,
                                 `1`="male",
                                 `2`="female"))  %>% 
    rename(gender_binary = gender)  %>% 
    mutate(gender_binary = gender_binary - 1) %>%
    mutate(income_lables = recode(income,
                                  `1`='less5,000',
                                  `2`='15,001–25,000',
                                  `3`='25,001–35,000',
                                  `4`='35,001–50,000',
                                  `5`='50,001–75,000',
                                  `6`='75,001–100,000',
                                  `7`='100,001–150,000',
                                  `8`='greater150,000'),
           income_recoded = case_when(income < 3 ~ 1,
                                      income >=3 & income < 5 ~ 2,
                                      income >= 5 & income < 7 ~ 3,
                                      income >=7 ~ 4)) %>%
    mutate_if(is.numeric, 
              as.integer) %>%
    summarise(perc_female = mean(gender_binary)*100,
              min_income = min(income),
              mean_income = mean(income),
              max_income = max(income),
              perc25_age = quantile(age, .25),
              median_age = quantile(age, .5),
              perc75_age = quantile(age, .75))

print(new_matthews_data_summary)
## # A tibble: 1 x 7
##   perc_female min_income mean_income max_income perc25_age median_age perc75_age
##         <dbl>      <int>       <dbl>      <int>      <dbl>      <dbl>      <dbl>
## 1        37.4          1        3.53          8         25         30         35

Task B

  1. From new_matthews_data, calculate the mean p1 to p10 across participants using summarise_all and select. Which product scored the highest? Do it again, grouping the data by gender. Is there a difference across gender? What is the mean of the mean p1 to p10 across participants? Calculate it on the result of the previous step. You can do these either using pipes or not.
# with pipes
new_matthews_data %>%
    group_by(gender_binary) %>%
    select(p1:p10) %>%
    summarise_all(mean)
## Adding missing grouping variables: `gender_binary`
## # A tibble: 2 x 11
##   gender_binary    p1    p2    p3    p4    p5    p6    p7    p8    p9   p10
##           <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1             0 0.723 0.849 0.630 0.857 0.790 0.782 0.866 0.874 0.689 0.706
## 2             1 0.648 0.887 0.831 0.845 0.746 0.831 0.859 0.944 0.648 0.775
# without pipes
selected_data = select(new_matthews_data, p1:p10, gender_binary)
grouped_data = group_by(selected_data, gender_binary)
summarise_all(grouped_data, mean)
## # A tibble: 2 x 11
##   gender_binary    p1    p2    p3    p4    p5    p6    p7    p8    p9   p10
##           <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1             0 0.723 0.849 0.630 0.857 0.790 0.782 0.866 0.874 0.689 0.706
## 2             1 0.648 0.887 0.831 0.845 0.746 0.831 0.859 0.944 0.648 0.775
  1. Transform the data from wide to long format. In particular, you want 10 rows per subjects, with their responses on the products 1 to 10 in a column called wtp, and the product label in a column called product. Call the resulting dataframe new_matthews_data_long. Re-order it by id. Print the first 20 cases to check this worked. Check that new_matthews_data_long has 10 times more rows than new_matthews_data.
# these are the solutions for B2, as we will cover wide to long data transformation later in the seminar
new_matthews_data_long = gather(new_matthews_data,
                                key='product',
                                value='wtp',
                                p1:p10)

new_matthews_data_long = arrange(new_matthews_data_long, id)

head(new_matthews_data_long)
## # A tibble: 6 x 13
##   id     gender_binary   age income  task havemore haveless pcmore gender_labels income_lables income_recoded product
##   <chr>          <int> <int>  <int> <int>    <int>    <int>  <int> <chr>         <chr>                  <int> <chr>  
## 1 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p1     
## 2 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p2     
## 3 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p3     
## 4 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p4     
## 5 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p5     
## 6 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p6     
## # … with 1 more variable: wtp <int>

Task C

  1. Drop the X1 column in demographics using select.
demographics = select(demographics, -c(X1))

head(demographics)
## # A tibble: 6 x 3
##   id                height race    
##   <chr>              <dbl> <chr>   
## 1 R_2zOA03wY3kZPHKj    182 black   
## 2 R_1f30AbYQR0D1wpt    184 white   
## 3 R_2YgEERKijE2Hz6K    175 hispanic
## 4 R_1CdL2n22srUR26d    177 white   
## 5 R_1HnsqBRimnKphfg    187 white   
## 6 R_12PUXbJhfNF9GAE    178 white
  1. Join new_matthews_data_long and demographics based on the id, in order to retain as many rows and columns as possible. Call the resulting dataframe matthews_data_all.
matthews_data_all = full_join(new_matthews_data_long, demographics, by="id")

head(matthews_data_all)
## # A tibble: 6 x 15
##   id     gender_binary   age income  task havemore haveless pcmore gender_labels income_lables income_recoded product
##   <chr>          <int> <int>  <int> <int>    <int>    <int>  <int> <chr>         <chr>                  <int> <chr>  
## 1 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p1     
## 2 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p2     
## 3 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p3     
## 4 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p4     
## 5 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p5     
## 6 R_0JL…             1    45      4     1       80       NA     80 female        35,001–50,000              2 p6     
## # … with 3 more variables: wtp <int>, height <dbl>, race <chr>
  1. Calculate the mean wtp per subject using group_by. You can use pipes or not. Called the resulting dataframe mean_matthews_data_all. This should have as many rows as the number of subjects and 2 columns (id and mean wtp). Add as a third and fourth columns heigth and race using one of the join functions.
mean_matthews_data_all = matthews_data_all %>%
    group_by(id) %>%
    summarise(mean_wtp=mean(wtp))

mean_matthews_data_all = full_join(mean_matthews_data_all, demographics, by="id")

head(mean_matthews_data_all)
## # A tibble: 6 x 4
##   id                mean_wtp height race    
##   <chr>                <dbl>  <dbl> <chr>   
## 1 R_0JLtfRpOyh8pOCN      0.9    176 black   
## 2 R_0O52mja29pBTphb      0.6    171 asian   
## 3 R_0OnZ5ZwmVAU6w2V      0.8    150 hispanic
## 4 R_0oXJeSTrYNzQlRH      0.8    180 asian   
## 5 R_10CVQqqmFiczGPS      0.7    173 asian   
## 6 R_10HKEOMi0HhWe7T      0.7    177 asian
  1. Using mean_matthews_data_all, make a barplot showing the mean wtp across ethnic groups. Plot confidence intervals. Give appropriate labels to the plot. Do you think there is a difference in willingness to pay across groups?
# these are the solutions for B4, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = factor(race), y = mean_wtp)) +
    stat_summary(fun = "mean", geom="bar") +
    stat_summary(fun.data = mean_cl_normal, geom = "errorbar", size=1, width=.4) +
    labs(x = 'Race', y = 'Mean WTP') + 
    theme(axis.text.x = element_text(angle = 90))

  1. Using mean_matthews_data_all, make a scatterplot showing the wtp on the y-axis and the height on the x-axis. Add a regression line. Do you think height predicts willingness to pay?
# these are the solutions for B5, as we will cover plotting later in the seminar
ggplot(data = mean_matthews_data_all, mapping = aes(x = height, y = mean_wtp)) + 
    geom_point(alpha = 0.3, size= 2) +
    geom_smooth(method = lm, color='grey') +
    labs(x='Height', y='Mean WTP') +
    ggtitle("Relationship betweenheight and WTP")
## `geom_smooth()` using formula 'y ~ x'

Submit your assignment

Save and email your script to me at laura.fontanesi@unibas.ch by the end of Friday.