2.8 How to read and wrangle data
The following example shows how to use the tidyverse
to read in data (with the readr
package) and easily manipulate it (using the dplyr
and lubridate
packages). We will walk through these steps during our meeting.
library(tidyverse)
library(lubridate)
<- "http://bit.ly/raw-train-data-csv"
url
<-
all_stations # Step 1: Read in the data.
::read_csv(url) %>%
readr# Step 2: filter columns and rename stationname
::select(station = stationname, date, rides) %>%
dplyr# Step 3: Convert the character date field to a date encoding.
# Also, put the data in units of 1K rides
::mutate(date = lubridate::mdy(date), rides = rides / 1000) %>%
dplyr# Step 4: Summarize the multiple records using the maximum.
::group_by(date, station) %>%
dplyr::summarize(rides = max(rides), .groups = "drop") dplyr
head(all_stations, 10)
## # A tibble: 10 × 3
## date station rides
## <date> <chr> <dbl>
## 1 2001-01-01 18th 0
## 2 2001-01-01 35-Bronzeville-IIT 0.448
## 3 2001-01-01 35th/Archer 0.318
## 4 2001-01-01 43rd 0.211
## 5 2001-01-01 47th-Dan Ryan 0.787
## 6 2001-01-01 47th-South Elevated 0.427
## 7 2001-01-01 51st 0.364
## 8 2001-01-01 54th/Cermak 0
## 9 2001-01-01 63rd-Dan Ryan 1.37
## 10 2001-01-01 69th 2.37
“This pipeline of operations illustrates why the tidyverse is popular. A series of data manipulations is used that have simple and easy to understand user interfaces; the series is bundled together in a streamlined and readable way. The focus is on how the user interacts with the software. This approach enables more people to learn R and achieve their analysis goals, and adopting these same principles for modeling in R has the same benefits.”
- Max Kuhn and Julia Silge in Tidy Modeling with R