1. Importing data
- Different functions to read tabular data from file:
- Different functions to read other data formats:
2. File paths for local files
3. File on a website: load and save
4. Get raw files from GitHub
5. Now it’s your turn
Submit your assignment

Open RStudio.

Open a new R script in R and save it as wpa_3_LastFirst.R (where Last and First is your last and first name).

Careful about: capitalizing, last and first name order, and using _ instead of -.

At the top of your script, write the following (with appropriate changes):

# Assignment: WPA 3
# Name: Laura Fontanesi
# Date: 29 March 2022

1. Importing data

Up to this point, I gave you the code to load datasets in R.

Say instead you have your own data saved on your computer or somewhere online. How can you analyze this data in R?

You have two main ways to do it: - using the “Import Dataset” button in the “Environment” tab in R Studio - using code

Today, we will learn how to import data using code within the the tidyverse package. This allow us to import data directly into tidyverse objects (i.e., tibbles, as we will see in the next 2 lessons).

The specific sub-package for importing data is called readr.

Data can come from different sources, e.g.: - text files stored locally - text files from a website

The functions you will use, depend on the specific format the data were written in:

Different functions to read tabular data from file:

read_delim() is the principal and more general means of reading tabular data into R
read_csv() sets the default separator to a comma
read_csv2() is its European cousin, using a comma for decimal places and a semicolon as a separator
read_tsv() import tab-delimited files

Different functions to read other data formats:

Excel files: https://readxl.tidyverse.org/reference/read_excel.html
STATA, SPSS, SAS files: https://haven.tidyverse.org/

2. File paths for local files

At this point, you should have a folder on your laptop for our R course, where you stored all your scripts. Create a subfolder called data.

When you are done, download the content of this folder ‘https://www.dropbox.com/sh/kw6o7ztouwpiawk/AACG5YtjeF58YaKjkK9h428Ka?dl=0’ in your data folder, so that the 5 data files are in your data folder.

To load these files in R, we need to write the path to your data folder. We can do this using code completion (Tab key): - On Mac, you can start from read_delim('~/') (or similar functions for loading data) and press Tab, to start navigating from your home folder - On Windows, you can do the same, but starting from read_delim('C:\Users\')

In my case, this folder was on Dropbox:

library(tidyverse)

data_a = read_delim('~/Dropbox/teaching/r-course22/data/data_to_import_a.txt', delim='\t')

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   index = col_double(),
##   participant = col_double(),
##   gender = col_character(),
##   age = col_double(),
##   options = col_character(),
##   accuracy = col_double(),
##   RT_msec = col_double()
## )

head(data_a)

## # A tibble: 6 x 7
##   index participant gender   age options accuracy RT_msec
##   <dbl>       <dbl> <chr>  <dbl> <chr>      <dbl>   <dbl>
## 1     1           8 male      18 CD             1    2381
## 2     2           8 male      18 CD             1    1730
## 3     3           8 male      18 AB             1    1114
## 4     4           8 male      18 AC             1     600
## 5     5           8 male      18 CD             1     683
## 6     6           8 male      18 AC             0     854

data_b = read_csv('~/Dropbox/teaching/r-course22/data/data_to_import_b.csv')

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   id = col_character(),
##   gender = col_double(),
##   age = col_double(),
##   income = col_double(),
##   p1 = col_double(),
##   p2 = col_double(),
##   p3 = col_double(),
##   p4 = col_double(),
##   p5 = col_double(),
##   p6 = col_double(),
##   p7 = col_double(),
##   p8 = col_double(),
##   p9 = col_double(),
##   p10 = col_double(),
##   task = col_double(),
##   havemore = col_double(),
##   haveless = col_double(),
##   pcmore = col_double()
## )

# same as: data_b = read_delim('~/Dropbox/teaching/r-course22/data/data_to_import_b.csv', delim = ",")

head(data_b)

## # A tibble: 6 x 18
##   id          gender   age income    p1    p2    p3    p4    p5    p6    p7    p8    p9   p10  task havemore haveless
##   <chr>        <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>    <dbl>
## 1 R_3PtNn51L…      2    26      7     1     1     1     1     1     1     1     1     1     1     0       NA       50
## 2 R_2AXrrg62…      2    32      4     1     1     1     1     1     1     1     1     1     1     0       NA       25
## 3 R_cwEOX3Hg…      1    25      2     0     1     1     1     1     1     1     1     0     0     0       NA       10
## 4 R_d59iPwL4…      1    33      5     1     1     1     1     1     1     1     1     1     1     0       NA       50
## 5 R_1f3K2HrG…      1    24      1     1     1     0     1     1     1     1     1     1     1     1       99       NA
## 6 R_3oN5ijzT…      1    22      2     1     1     0     0     1     1     1     1     0     1     0       NA       20
## # … with 1 more variable: pcmore <dbl>

data_c = read_csv2('~/Dropbox/teaching/r-course22/data/data_to_import_c.csv')

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   .default = col_character(),
##   age = col_double(),
##   Medu = col_double(),
##   Fedu = col_double(),
##   traveltime = col_double(),
##   studytime = col_double(),
##   failures = col_double(),
##   famrel = col_double(),
##   freetime = col_double(),
##   goout = col_double(),
##   Dalc = col_double(),
##   Walc = col_double(),
##   health = col_double(),
##   absences = col_double(),
##   G1 = col_double(),
##   G2 = col_double(),
##   G3 = col_double()
## )
## ℹ Use `spec()` for the full column specifications.

head(data_c)

## # A tibble: 6 x 33
##   school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob     reason guardian traveltime studytime failures
##   <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>    <chr>  <chr>         <dbl>     <dbl>    <dbl>
## 1 GP     F        18 U       GT3     A           4     4 at_home  teacher  course mother            2         2        0
## 2 GP     F        17 U       GT3     T           1     1 at_home  other    course father            1         2        0
## 3 GP     F        15 U       LE3     T           1     1 at_home  other    other  mother            1         2        0
## 4 GP     F        15 U       GT3     T           4     2 health   services home   mother            1         3        0
## 5 GP     F        16 U       GT3     T           3     3 other    other    home   father            1         2        0
## 6 GP     M        16 U       LE3     T           4     3 services other    reput… mother            1         2        0
## # … with 18 more variables: schoolsup <chr>, famsup <chr>, paid <chr>, activities <chr>, nursery <chr>,
## #   higher <chr>, internet <chr>, romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>, Walc <dbl>,
## #   health <dbl>, absences <dbl>, G1 <dbl>, G2 <dbl>, G3 <dbl>

library(readxl)

# maybe try first: install.packages("readxl")

data_d = read_excel('~/Dropbox/teaching/r-course22/data/data_to_import_d.xls')

head(data_d)

## # A tibble: 6 x 9
##    Year `Average population` `Live births` Deaths `Natural change` `Crude birth ra… `Crude death ra… `Natural change…
##   <dbl> <chr>                <chr>         <chr>  <chr>                       <dbl>            <dbl>            <dbl>
## 1  1900 3,300,000            94,316        63,606 30,710                       28.6             19.3              9.3
## 2  1901 3,341,000            97,028        60,018 37,010                       29               18               11.1
## 3  1902 3,384,000            96,480        57,702 38,778                       28.5             17.1             11.5
## 4  1903 3,428,000            93,824        59,626 34,198                       27.4             17.4             10  
## 5  1904 3,472,000            94,867        60,857 34,010                       27.3             17.5              9.8
## 6  1905 3,516,000            94,653        61,800 32,853                       26.9             17.6              9.3
## # … with 1 more variable: Total fertility rates <dbl>

library(haven)
# maybe first: install.packages("haven")

data_e = read_sav('~/Dropbox/teaching/r-course22/data/data_to_import_e.sav')

head(data_e)

## # A tibble: 6 x 54
##   case_ID         wave      year weight_wave weight_aggregate happening cause_original cause_other_text cause_recoded
##     <dbl>    <dbl+lbl> <dbl+lbl>       <dbl>            <dbl> <dbl+lbl>      <dbl+lbl> <chr>                <dbl+lbl>
## 1       2 1 [Nov 2008]  1 [2008]        0.54            0.294 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
## 2       3 1 [Nov 2008]  1 [2008]        0.85            0.463 2 [Don't… 1 [Caused mos… ""               6 [Caused mo…
## 3       5 1 [Nov 2008]  1 [2008]        0.49            0.267 2 [Don't… 2 [Caused mos… ""               4 [Caused mo…
## 4       6 1 [Nov 2008]  1 [2008]        0.29            0.158 3 [Yes]   2 [Caused mos… ""               4 [Caused mo…
## 5       7 1 [Nov 2008]  1 [2008]        1.29            0.702 3 [Yes]   1 [Caused mos… ""               6 [Caused mo…
## 6       8 1 [Nov 2008]  1 [2008]        2.56            1.39  2 [Don't… 2 [Caused mos… ""               4 [Caused mo…
## # … with 45 more variables: sci_consensus <dbl+lbl>, worry <dbl+lbl>, harm_personally <dbl+lbl>, harm_US <dbl+lbl>,
## #   harm_dev_countries <dbl+lbl>, harm_future_gen <dbl+lbl>, harm_plants_animals <dbl+lbl>, when_harm_US <dbl+lbl>,
## #   reg_CO2_pollutant <dbl+lbl>, reg_utilities <dbl+lbl>, fund_research <dbl+lbl>, reg_coal_emissions <dbl+lbl>,
## #   discuss_GW <dbl+lbl>, hear_GW_media <dbl+lbl>, gender <dbl+lbl>, age <dbl>, age_category <dbl+lbl>,
## #   generation <dbl+lbl>, educ <dbl+lbl>, educ_category <dbl+lbl>, income <dbl+lbl>, income_category <dbl+lbl>,
## #   race <dbl+lbl>, ideology <dbl+lbl>, party <dbl+lbl>, party_w_leaners <dbl+lbl>, party_x_ideo <dbl+lbl>,
## #   registered_voter <dbl+lbl>, region9 <dbl+lbl>, region4 <dbl+lbl>, religion <dbl+lbl>, …

3. File on a website: load and save

Let’s say we want to load in R some data, directly from a website (without saving it to a file). In this case, we get some data from the website “https://support.spatialkey.com/spatialkey-sample-csv-data/”. Instead of writing a local address, we can simply use the same functions with the web address.

data_transactions = read_csv("https://support.spatialkey.com/wp-content/uploads/2021/02/Sacramentorealestatetransactions.csv")

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   street = col_character(),
##   city = col_character(),
##   zip = col_double(),
##   state = col_character(),
##   beds = col_double(),
##   baths = col_double(),
##   sq__ft = col_double(),
##   type = col_character(),
##   sale_date = col_character(),
##   price = col_double(),
##   latitude = col_double(),
##   longitude = col_double()
## )

head(data_transactions)

## # A tibble: 6 x 12
##   street             city         zip state  beds baths sq__ft type        sale_date         price latitude longitude
##   <chr>              <chr>      <dbl> <chr> <dbl> <dbl>  <dbl> <chr>       <chr>             <dbl>    <dbl>     <dbl>
## 1 3526 HIGH ST       SACRAMENTO 95838 CA        2     1    836 Residential Wed May 21 00:00… 59222     38.6     -121.
## 2 51 OMAHA CT        SACRAMENTO 95823 CA        3     1   1167 Residential Wed May 21 00:00… 68212     38.5     -121.
## 3 2796 BRANCH ST     SACRAMENTO 95815 CA        2     1    796 Residential Wed May 21 00:00… 68880     38.6     -121.
## 4 2805 JANETTE WAY   SACRAMENTO 95815 CA        2     1    852 Residential Wed May 21 00:00… 69307     38.6     -121.
## 5 6001 MCMAHON DR    SACRAMENTO 95824 CA        2     1    797 Residential Wed May 21 00:00… 81900     38.5     -121.
## 6 5828 PEPPERMILL CT SACRAMENTO 95841 CA        3     1   1122 Condo       Wed May 21 00:00… 89921     38.7     -121.

If we want, we can then save it to file from R, using similar set of functions that start with write_ instead of read_. You can use such functions also to save your data in a different format from the original for later use.

# save it to file
write_csv(data_transactions, file = "~/Dropbox/teaching/r-course22/data/Sacramentorealestatetransactions.csv")

# load it again
data_transactions = read_csv("~/Dropbox/teaching/r-course22/data/Sacramentorealestatetransactions.csv")

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   street = col_character(),
##   city = col_character(),
##   zip = col_double(),
##   state = col_character(),
##   beds = col_double(),
##   baths = col_double(),
##   sq__ft = col_double(),
##   type = col_character(),
##   sale_date = col_character(),
##   price = col_double(),
##   latitude = col_double(),
##   longitude = col_double()
## )

head(data_transactions)

## # A tibble: 6 x 12
##   street             city         zip state  beds baths sq__ft type        sale_date         price latitude longitude
##   <chr>              <chr>      <dbl> <chr> <dbl> <dbl>  <dbl> <chr>       <chr>             <dbl>    <dbl>     <dbl>
## 1 3526 HIGH ST       SACRAMENTO 95838 CA        2     1    836 Residential Wed May 21 00:00… 59222     38.6     -121.
## 2 51 OMAHA CT        SACRAMENTO 95823 CA        3     1   1167 Residential Wed May 21 00:00… 68212     38.5     -121.
## 3 2796 BRANCH ST     SACRAMENTO 95815 CA        2     1    796 Residential Wed May 21 00:00… 68880     38.6     -121.
## 4 2805 JANETTE WAY   SACRAMENTO 95815 CA        2     1    852 Residential Wed May 21 00:00… 69307     38.6     -121.
## 5 6001 MCMAHON DR    SACRAMENTO 95824 CA        2     1    797 Residential Wed May 21 00:00… 81900     38.5     -121.
## 6 5828 PEPPERMILL CT SACRAMENTO 95841 CA        3     1   1122 Condo       Wed May 21 00:00… 89921     38.7     -121.

4. Get raw files from GitHub

Go on the data folder where I load the datasets for the seminar: ‘https://github.com/laurafontanesi/r-seminar22/tree/main/data’.

Click on tdcs.csv.

To be able to load these data in R, we first need to get to the raw data.

You can get them by clickin on View raw. Note that for some files, instead of getting to the raw data page, you can directly dowload them to a local directory. From there, you can simply load them in R using the appropriate read_ function.

Copy the adress of the page containing the raw data. It should be https://raw.githubusercontent.com/laurafontanesi/r-seminar22/master/data/tdcs.csv

You can use now this url with one of our read_ functions:

data_tdcs = read_csv("https://raw.githubusercontent.com/laurafontanesi/r-seminar22/main/data/tdcs.csv")

## 
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   RT = col_double(),
##   acc_spd = col_character(),
##   accuracy = col_double(),
##   angle = col_double(),
##   block = col_double(),
##   coherence = col_double(),
##   dataset = col_character(),
##   id = col_character(),
##   left_right = col_double(),
##   subj_idx = col_double(),
##   tdcs = col_character(),
##   trial_NR = col_double()
## )

head(data_tdcs)

## # A tibble: 6 x 12
##      RT acc_spd accuracy angle block coherence dataset  id    left_right subj_idx tdcs  trial_NR
##   <dbl> <chr>      <dbl> <dbl> <dbl>     <dbl> <chr>    <chr>      <dbl>    <dbl> <chr>    <dbl>
## 1   799 spd            1   180     1     0.417 berkeley S1.1           2        1 sham         1
## 2   613 spd            1   180     1     0.417 berkeley S1.1           2        1 sham         2
## 3   627 spd            1   180     1     0.417 berkeley S1.1           1        1 sham         3
## 4  1280 acc            0   180     1     0.417 berkeley S1.1           1        1 sham         4
## 5   800 spd            1   180     1     0.417 berkeley S1.1           2        1 sham         5
## 6   760 acc            1   180     1     0.417 berkeley S1.1           2        1 sham         6

5. Now it’s your turn

Task A

From the data folder on Github, get the data sets in the list below. Load them in R giving the respective names: qualtrics_data, data_f, data_g, data_h. Inspect them using head() or glimpse(). Finally, save them to your local data directory (that you should have as a sub-directory in your R course directory) as csv files.

20180321_qualtrics_managers_historical_social_comparisons.dta
data_to_import_f.csv
data_to_import_g.csv
data_to_import_h.csv

Task B

Go to this website: https://www.britishelectionstudy.com/data-objects/cross-sectional-data/ (you can register for free).

Download the 2017 Face-to-face Post-election Survey Version 1.5 SPSS file in your local data directory (see above). Then, load it in R assigning it to the name british_cross_sectional_data using the appropriate function for SPSS files and inspect it using head() or glimpse().

Submit your assignment

Save and email your script to me at laura.fontanesi@unibas.ch by the end of Friday.