2.8 How to read and wrangle data

The following example shows how to use the tidyverse to read in data (with the readr package) and easily manipulate it (using the dplyr and lubridate packages). We will walk through these steps during our meeting.

library(tidyverse)
library(lubridate)

url <- "http://bit.ly/raw-train-data-csv"

all_stations <- 
  # Step 1: Read in the data.
  readr::read_csv(url) %>% 
  # Step 2: filter columns and rename stationname
  dplyr::select(station = stationname, date, rides) %>% 
  # Step 3: Convert the character date field to a date encoding.
  # Also, put the data in units of 1K rides
  dplyr::mutate(date = lubridate::mdy(date), rides = rides / 1000) %>% 
  # Step 4: Summarize the multiple records using the maximum.
  dplyr::group_by(date, station) %>% 
  dplyr::summarize(rides = max(rides), .groups = "drop")
head(all_stations, 10)
## # A tibble: 10 × 3
##    date       station             rides
##    <date>     <chr>               <dbl>
##  1 2001-01-01 18th                0    
##  2 2001-01-01 35-Bronzeville-IIT  0.448
##  3 2001-01-01 35th/Archer         0.318
##  4 2001-01-01 43rd                0.211
##  5 2001-01-01 47th-Dan Ryan       0.787
##  6 2001-01-01 47th-South Elevated 0.427
##  7 2001-01-01 51st                0.364
##  8 2001-01-01 54th/Cermak         0    
##  9 2001-01-01 63rd-Dan Ryan       1.37 
## 10 2001-01-01 69th                2.37

“This pipeline of operations illustrates why the tidyverse is popular. A series of data manipulations is used that have simple and easy to understand user interfaces; the series is bundled together in a streamlined and readable way. The focus is on how the user interacts with the software. This approach enables more people to learn R and achieve their analysis goals, and adopting these same principles for modeling in R has the same benefits.”

- Max Kuhn and Julia Silge in Tidy Modeling with R