Day 3 Lab Data Wrangling and Viz

```{r}
library(tidyverse)
```
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

NYC Citi Bikes Dataset

The dataset can be found in the data folder. However, take a few minutes to visit the Citi-Bikes website (https://citibikenyc.com/system-data) to learn about their data.

```{r}
citi <- read_csv("data/nyc_bikes.csv")
```
Rows: 16843 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): Start Time, Stop Time, Start Station Name, End Station Name, User T...
dbl (8): Start Station ID, End Station ID, Bike ID, Birth Year, Age, Trip Du...
num (1): Trip_Duration_in_min

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Examining our data

Question 1

Begin by reading in the data. Use glimpse() to get a preview of the data and view the columns.

```{r}
glimpse(citi)
```
Rows: 16,843
Columns: 17
$ `Start Time`         <chr> "10-3-17 17:39", "16-01-17 16:36", "17-02-17 22:0…
$ `Stop Time`          <chr> "13-03-17 08:12", "18-1-17 08:58", "19-2-17 14:27…
$ `Start Station ID`   <dbl> 3185, 3203, 3267, 3202, 3186, 3203, 3192, 3269, 3…
$ `Start Station Name` <chr> "City Hall", "Hamilton Park", "Morris Canal", "Ne…
$ `End Station ID`     <dbl> 3277, 3213, 3183, 3185, 3192, 3205, 3275, 3273, 3…
$ `End Station Name`   <chr> "Communipaw & Berry Lane", "Van Vorst Park", "Exc…
$ `Bike ID`            <dbl> 24462, 24716, 24418, 24495, 26170, 26254, 26250, …
$ `User Type`          <chr> "Subscriber", "Subscriber", "Subscriber", "Subscr…
$ `Birth Year`         <dbl> 1995, 1983, 1941, 1970, 1957, 1989, 1984, 1984, 1…
$ Age                  <dbl> 26, 38, 80, 51, 64, 32, 37, 37, 37, 40, 37, 36, 2…
$ `Age Groups`         <chr> "25-34", "35-44", "75+", "45-54", "55-64", "25-34…
$ `Trip Duration`      <dbl> 221604, 145342, 145296, 90394, 89479, 85624, 8461…
$ Trip_Duration_in_min <dbl> 3693, 2422, 2422, 1507, 1491, 1427, 1410, 1331, 1…
$ Month                <dbl> 3, 1, 2, 1, 3, 1, 2, 2, 3, 2, 3, 2, 2, 3, 1, 3, 3…
$ Season               <chr> "Spring", "Winter", "Winter", "Winter", "Spring",…
$ Temperature          <dbl> 19, 14, 15, 14, 16, 14, 11, 15, 16, 11, 19, 11, 1…
$ Weekday              <chr> "Friday", "Monday", "Friday", "Sunday", "Monday",…

Clean the names so that they are consistent with tidyverse conventions.

```{r}
citi <- janitor::clean_names(citi)
```

How many rows and columns does the dataset have? What do they represent?

Question 2

Let’s now learn something about the data.

  1. We have not learned much about Date manipulation in R. However, in this dataset, there should be enough information to figure out what time period the dataset covers. Can you tell which time period this dataset covers?

    The data spans from January of 2017 to March of 2017

```{r}
citi |>
  summarize(
    min_time = min(start_time),
    max_time = max(stop_time),
  )
```
# A tibble: 1 × 2
  min_time       max_time    
  <chr>          <chr>       
1 01-01-17 00:38 9-3-17 21:30
  1. What are the most popular pick-up locations across the city for NY Citi Bike rental during this time period? What is the post popular drop-off location?
```{r}
# two possible solutions
citi |>
  group_by(start_station_name) |> 
  summarize(number_of_trips = n()) |>
  ungroup() |>
  arrange(desc(number_of_trips))


citi |>
  group_by(start_station_name) |> 
  summarize(number_of_trips = n()) |>
  arrange(desc(number_of_trips),.by_group = T)
```
# A tibble: 50 × 2
   start_station_name number_of_trips
   <chr>                        <int>
 1 Grove St PATH                 2115
 2 Exchange Place                1224
 3 Sip Ave                       1084
 4 Hamilton Park                 1069
 5 Morris Canal                   710
 6 Newport PATH                   651
 7 City Hall                      576
 8 Van Vorst Park                 530
 9 Newark Ave                     510
10 Warren St                      481
# ℹ 40 more rows
# A tibble: 50 × 2
   start_station_name number_of_trips
   <chr>                        <int>
 1 Grove St PATH                 2115
 2 Exchange Place                1224
 3 Sip Ave                       1084
 4 Hamilton Park                 1069
 5 Morris Canal                   710
 6 Newport PATH                   651
 7 City Hall                      576
 8 Van Vorst Park                 530
 9 Newark Ave                     510
10 Warren St                      481
# ℹ 40 more rows
```{r}
citi |>
  group_by(end_station_name) |> 
  summarize(number_of_trips = n()) |>
  arrange(desc(number_of_trips),.by_group = T)
```
# A tibble: 55 × 2
   end_station_name number_of_trips
   <chr>                      <int>
 1 Grove St PATH               2743
 2 Exchange Place              1254
 3 Sip Ave                      955
 4 Hamilton Park                949
 5 Newport PATH                 683
 6 City Hall                    672
 7 Morris Canal                 591
 8 Warren St                    519
 9 Newark Ave                   505
10 Paulus Hook                  469
# ℹ 45 more rows
  1. Which age group rents the most bikes?
```{r}
count(citi, age_groups,sort=T)
```
# A tibble: 7 × 2
  age_groups     n
  <chr>      <int>
1 35-44       7697
2 25-34       4002
3 45-54       2973
4 55-64       1447
5 65-74        615
6 75+           55
7 18-24         54
  1. What is the longest trip recorder? Where did this user start from and end to?
```{r}
citi |> 
  slice_max(trip_duration, n=5) |>
  dplyr::select(start_station_name, end_station_name,trip_duration_in_min) |>
  mutate(trip_duration_in_hours = trip_duration_in_min/60)
```
# A tibble: 5 × 4
  start_station_name end_station_name        trip_duration_in_min
  <chr>              <chr>                                  <dbl>
1 City Hall          Communipaw & Berry Lane                 3693
2 Hamilton Park      Van Vorst Park                          2422
3 Morris Canal       Exchange Place                          2422
4 Newport PATH       City Hall                               1507
5 Grove St PATH      Liberty Light Rail                      1491
# ℹ 1 more variable: trip_duration_in_hours <dbl>

Question 3

Let’s learn some information about this data. a. What are the most popular pick-up locations across the city for NY Citi Bike rental? (see before)

  1. Does this change between depending on whether the trip happens on a weekday or a weekend?

Hint: to figure this out, you will first need to create a new column using mutate(). This will need to classify a day in weekday or weekend.

```{r}
citi |>
  mutate(is_weekend = ifelse(weekday %in% c("Saturday","Sunday"),T,F)) |>
  group_by(is_weekend, start_station_name) |> 
  summarize(number_of_trips = n()) |>
  slice_max(number_of_trips,n=1) 
```
`summarise()` has grouped output by 'is_weekend'. You can override using the
`.groups` argument.
# A tibble: 2 × 3
# Groups:   is_weekend [2]
  is_weekend start_station_name number_of_trips
  <lgl>      <chr>                        <int>
1 FALSE      Grove St PATH                 1760
2 TRUE       Grove St PATH                  355

The station does not change, but the number of trips does significantly!!

Question 4

Let’s look only at trips that happen in the weekend. How does the average trip duration vary across different age groups in these days?

Do this using the filter(), group_by(), and summarise() functions.

```{r}
citi |>
  filter((weekday %in% c("Saturday","Sunday")))

# alternatively
citi |>
  mutate(is_weekend = ifelse(weekday %in% c("Saturday","Sunday"),1,0)) |>
  filter(is_weekend == T) |>
  group_by(age_groups) |>
  summarize(avg_trip_length = mean(trip_duration_in_min))
```
# A tibble: 3,155 × 17
   start_time     stop_time   start_station_id start_station_name end_station_id
   <chr>          <chr>                  <dbl> <chr>                       <dbl>
 1 15-01-17 17:20 16-01-17 1…             3202 Newport PATH                 3185
 2 15-1-17 11:39  16-01-17 1…             3203 Hamilton Park                3205
 3 25-02-17 13:14 26-2-17 12…             3192 Liberty Light Rail           3275
 4 25-03-17 14:18 26-03-17 1…             3203 Hamilton Park                3206
 5 25-02-17 16:53 26-2-17 12…             3185 City Hall                    3278
 6 25-02-17 15:32 26-2-17 04…             3209 Brunswick St                 3215
 7 14-01-17 11:21 14-01-17 1…             3199 Newport Pkwy                 3199
 8 25-02-17 15:03 25-2-17 20…             3186 Grove St PATH                3273
 9 25-03-17 17:48 25-03-17 2…             3203 Hamilton Park                3196
10 25-02-17 14:34 25-02-17 1…             3192 Liberty Light Rail           3192
# ℹ 3,145 more rows
# ℹ 12 more variables: end_station_name <chr>, bike_id <dbl>, user_type <chr>,
#   birth_year <dbl>, age <dbl>, age_groups <chr>, trip_duration <dbl>,
#   trip_duration_in_min <dbl>, month <dbl>, season <chr>, temperature <dbl>,
#   weekday <chr>
# A tibble: 7 × 2
  age_groups avg_trip_length
  <chr>                <dbl>
1 18-24                 8.73
2 25-34                10.8 
3 35-44                12.9 
4 45-54                13.0 
5 55-64                 8.56
6 65-74                 8.34
7 75+                   4.7 

In the weekdays :

```{r}
citi |>
  mutate(is_weekend = ifelse(weekday %in% c("Saturday","Sunday"),1,0)) |>
  filter(is_weekend == F) |>
  group_by(age_groups) |>
  summarize(avg_trip_length = mean(trip_duration_in_min))
```
# A tibble: 7 × 2
  age_groups avg_trip_length
  <chr>                <dbl>
1 18-24                12.8 
2 25-34                 8.71
3 35-44                 8.56
4 45-54                 7.34
5 55-64                 9.57
6 65-74                 7.25
7 75+                  58.7 
Trouble with filter

There are some other packages with functions named filter() or select() and if those packages are loaded most recently then sometimes you can get problems. In the future when you are doing data cleaning if filter() or select() are not working but you are confident you have called them correctly, check to make sure tidyverse is your most recently loaded package.

Question 5

For the most popular start location, find what are the 3 top drop off station. For each one of them, find the average length of the trip between the start and the end location.

```{r}
top_pickup <- citi |> group_by(start_station_name) |> summarize(number_of_trips =n()) |> slice_max(number_of_trips,n=1) |> pull(start_station_name)

citi |>
  filter(start_station_name == top_pickup) |>
  group_by(end_station_name) |>
  summarize(n_trips = n(), avg_trip_duration_in_min = mean(trip_duration)) |>
  slice_max(n_trips, n=3)
```
# A tibble: 3 × 3
  end_station_name n_trips avg_trip_duration_in_min
  <chr>              <int>                    <dbl>
1 Hamilton Park        273                     327.
2 Dixon Mills          202                     235.
3 Brunswick St         182                     290.

Data Visualization

Using the skills you have learned during lecture, try making plots that answer the following questions.

Question 6

How does bike rental vary across the two user groups (one-time users vs subscribers) on different days of the week?

```{r}
citi |>
  group_by(user_type,weekday) |>
  summarize(n_trips = n()) |>
  ggplot() +
  geom_col(aes(y=n_trips,x = weekday,fill=user_type),position="dodge") 
```
`summarise()` has grouped output by 'user_type'. You can override using the
`.groups` argument.

What’s wrong with this graph and how can we improve it?

```{r}
citi |>
  group_by(user_type,weekday) |>
  summarize(n_trips = n()) |>
  mutate(weekday = factor(weekday, levels=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) |>
  ggplot() +
  geom_col(aes(y=n_trips,x = weekday,fill=user_type),position = "dodge") + 
  ylab("Number of Trips") + 
  xlab("") +
  theme_bw() +
  scale_fill_brewer(palette = "Set2",name = "User Type")
```
`summarise()` has grouped output by 'user_type'. You can override using the
`.groups` argument.

Question 7

Filter only the trips that last less than 30 minutes. Create a visualization to see whether the rider’s age is correlated with the duration trip (in minutes). Look at the same information for trips that happen during the weekend and during the weekday.

[As always, there is multiple possible solutions to this question. Here, I look at a scatterplot between the two variables. age and trip_duration_in_min.]

```{r}
citi |> 
  mutate(is_weekend = ifelse(weekday %in% c("Saturday","Sunday"),"Yes","No")) |>
  filter(trip_duration_in_min<30) |>
  ggplot() +
  geom_point(aes(x=age,y = trip_duration_in_min,color=is_weekend)) +
  ylab("Tripe Duration (min)") + 
  xlab("Age of rider") + 
  ggtitle("Age vs Trip Duration",subtitle ="Citi Bike Trips Jan-Mar 2017" ) + 
  scale_color_brewer(palette = "Set2",name = "Weekend?") +
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5),plot.subtitle = element_text(hjust=0.5)) 
```

Question 9 (Extra exercises for practice)

  1. Create a bar plot illustrating the number of trips that start at each of the 10 most popular pick up locations. Make sure to include titles, labels, and to make the colors clear.
```{r}
citi |>
  group_by(start_station_name) |> 
  summarize(number_of_trips = n()) |>
  slice_max(number_of_trips,n=10) |>
  mutate(start_station_name = factor(start_station_name)) |>
  ggplot() + 
  geom_col(aes(x=reorder(start_station_name,-number_of_trips),y=number_of_trips),col="black",fill="gray") +
  xlab("") +
  ylab("Number of Trips") + 
  ggtitle("Top 10 Pick-Up Stations") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 25, hjust=1)) 
```

  1. Earlier, you found the average trip length from the original station to the top three drop off - now we are interested instead in visualizing the distribution of these trip lengths. As always, there are many ways you can represent this! Try out of few, with boxplots, histograms, or density plots (geom_density) Note , you can filter out outliers if it will make your plot more understandable.
```{r}
top_three_dropoffs<- citi |>
  filter(start_station_name == top_pickup) |>
  group_by(end_station_name) |>
  summarize(n_trips = n(), avg_trip_duration_in_min = mean(trip_duration)) |>
  slice_max(n_trips, n=3) |>
  pull(end_station_name)

citi |>
  filter(start_station_name == top_pickup, end_station_name %in% top_three_dropoffs) |>
  filter(trip_duration_in_min < 30) |>
  ggplot()+
  geom_histogram(aes(trip_duration_in_min),fill="gray",col="black",bins=10) + 
  facet_wrap(~end_station_name) + 
  theme_bw() + xlab("Trip Duration Time") + 
  labs(title="Trip duration times from Grove Street to:",caption = "Trips lasting less than 30 minutes only")
```