March 5, 2019

Tidy New York: Wrangling Pedestrian Count Data

I’m a huge fan of open data. I am especially fascinated by data related to transportation and urbanism, anyhting that I can put on a map that would represent some part of how we move and live in cities. That is why I was super excited to find that New York City collects and shares data on pedestrian counts in more than 100 locations across the city. However, the CSV dataset I downloaded needed a lot of love and care -in the form of wrangling- before it reached tidy formatting that could allow me to easily explore it by plotting it in maps and charts. In this post, I will share this tidying process.

To start, I’ll load the needed packages:

library(tidyverse)
library(reshape2)
library(lubridate)

#These last two are needed for building the post itself, rather than loading and cleaning the data
library(knitr)
library(kableExtra)

Now, let’s load the data (which I downloaded from the NYC Open Data portal as a CSV):

pedcount <- read_csv("PedCount.csv")
Take a look at the first few rows of the dataset in its original form:
the_geom Loc OBJECTID Borough Street_Nam From_Stree To_Street Index May07_D May07_AM May07_PM May07_D2 May07_MD Sept07_D Sept07_AM Sept07_PM Sept07_D2 Sept07_MD May08_D May08_AM May08_PM May08_D2 May08_MD Sept08_D Sept08_AM Sept08_PM Sept08_D2 Sept08_MD May09_D May09_AM May09_PM May09_D2 May09_MD Sept09_D Sept09_AM Sept09_PM Sept09_D2 Sept09_MD May10_D May10_AM May10_PM May10_D2 May10_MD Sept10_D Sept10_AM Sept10_PM Sept10_D2 Sept10_MD May11_D May11_AM May11_PM May11_D2 May11_MD Sept11_D Sept11_AM Sept11_PM Sept11_D2 Sept11_MD May12_D May12_AM May12_PM May12_D2 May12_MD Sept12_D Sept12_AM Sept12_PM Sept12_D2 Sept12_MD May13_D May13_AM May13_PM May13_D2 May13_MD Sept13_D Sept13_AM Sept13_PM Sept13_D2 Sept13_MD May14_D May14_AM May14_PM May14_D2 May14_MD Sept14_D Sept14_AM Sept14_PM Sept14_D2 Sept14_MD May15_D May15_AM May15_PM May15_D2 May15_MD Sept15_D Sept15_AM Sept15_PM Sept15_D2 Sept15_MD May16_D May16_AM May16_PM May16_D2 May16_MD Sept16_D Sept16_AM Sept16_PM Sept16_D2 Sept16_MD May17_D May17_AM May17_PM May17_D3 May17_MD Sept17_D Sept17_AM Sept17_PM Sept17_D2 Sept17_MD
POINT (-73.90459140730678 40.87919896648571) 1 1 Bronx Broadway West 231st Street Naples Terrace N 5/10/2007 1189 4094 5/5/2007 2508 9/25/2007 734 2646 9/29/2007 2939 5/1/2008 802 4015 5/17/2008 2631 9/23/2008 1125 4310 10/4/2008 3420 5/5/2009 1001 3475 5/2/2009 2832 9/10/2009 991 4262 9/26/2009 2469 5/6/2010 1010 3609 5/1/2010 3128 9/8/2010 863 4119 9/11/2010 2217 5/10/2011 997 4440 4/30/2011 2687 9/8/2011 1328 3820 9/10/2011 2428 5/3/2012 1288 3328 6/2/2012 3365 9/12/2012 1268 4315 9/15/2012 2276 5/7/2013 1210 4710 5/18/2013 3825 10/1/2013 1206 4590 10/5/2013 3008 5/8/2014 1220 4384 5/17/2014 2641 9/4/2014 1450 4646 9/6/2014 2996 5/13/2015 1788 4980 5/16/2015 3033 9/17/2015 1204 4520 9/12/2015 2999 05/05/16 1246 4531 05/07/16 2686 09/27/16 1309 3642 09/10/16 2830 05/09/17 1916 5893 05/06/17 2776 09/12/17 1111 4044 09/23/17 2731
POINT (-73.92188432870218 40.82662794123289) 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y 5/10/2007 1511 3184 5/5/2007 1971 9/25/2007 1855 3754 9/29/2007 1183 5/6/2008 1136 2638 5/17/2008 1522 9/23/2008 1939 3283 10/4/2008 1383 5/5/2009 1351 3111 5/2/2009 1304 9/10/2009 1227 3137 9/26/2009 2762 5/6/2010 2077 3283 5/1/2010 1409 9/14/2010 1007 3069 9/11/2010 1477 5/5/2011 1734 3333 4/30/2011 1772 9/14/2011 2051 3525 9/10/2011 1752 5/3/2012 1233 1875 6/2/2012 1912 9/12/2012 2113 4099 9/22/2012 1970 5/1/2013 2278 4215 5/4/2013 2288 9/18/2013 2071 3890 9/21/2013 1832 5/8/2014 2206 4363 5/17/2014 2315 9/11/2014 1949 4435 9/13/2014 2388 5/19/2015 2318 4589 5/16/2015 2483 9/17/2015 2005 4790 9/26/2015 2512 05/10/16 2053 4721 05/14/16 2311 09/20/16 2109 5485 09/10/16 2548 05/09/17 1848 4920 05/06/17 2143 09/16/17 2389 5952 09/23/17 2832
POINT (-73.89535781584335 40.86215460031514) 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y 5/10/2007 1832 12311 5/5/2007 14391 9/25/2007 1829 9215 9/29/2007 15065 5/1/2008 1061 8013 5/17/2008 7927 9/23/2008 2497 10010 10/4/2008 8013 5/5/2009 1689 8374 5/2/2009 7472 9/10/2009 2387 10655 9/26/2009 6614 5/6/2010 2285 10943 5/1/2010 10424 9/8/2010 2807 12658 9/11/2010 7714 5/10/2011 2289 15890 4/30/2011 9167 9/8/2011 2679 13548 9/10/2011 8140 5/3/2012 2125 7635 6/9/2012 7022 9/11/2012 2704 10486 9/15/2012 7421 5/1/2013 2710 12723 5/4/2013 8872 10/1/2013 2651 10613 10/5/2013 8241 5/8/2014 2699 12490 5/17/2014 6339 9/4/2014 2752 10408 9/6/2014 8266 5/13/2015 2382 11216 5/16/2015 7574 9/17/2015 2733 10586 9/12/2015 9072 05/05/16 2540 11272 05/07/16 8422 09/27/16 2422 11145 09/10/16 9775 05/09/17 2557 12125 05/06/17 6499 09/12/17 2783 12388 09/23/17 7076

Now, there are a number of different ways that this table doesn’t really fit the format I need to explore the data in the way that I want. So let’s tackle them one by one.

To start, I know that the way I will map the datapoints will require me to have the latitude and longitude coordinates as separate numeric variables. So, I’ll separate the string that is stored as the_geom and then I’ll change the datatype to numeric.

pedcount <- pedcount %>%
  separate(the_geom, c("temp", "lon", "lat"), sep = " ") %>%
             select(-temp)
pedcount <- mutate(pedcount, lon = as.numeric(substr(pedcount$lon,2,nchar(pedcount$lon))),
         lat = as.numeric(substr(pedcount$lat,1,nchar(pedcount$lat)-1)))
Here’s what our table looks like now:
lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index May07_D May07_AM May07_PM May07_D2 May07_MD Sept07_D Sept07_AM Sept07_PM Sept07_D2 Sept07_MD May08_D May08_AM May08_PM May08_D2 May08_MD Sept08_D Sept08_AM Sept08_PM Sept08_D2 Sept08_MD May09_D May09_AM May09_PM May09_D2 May09_MD Sept09_D Sept09_AM Sept09_PM Sept09_D2 Sept09_MD May10_D May10_AM May10_PM May10_D2 May10_MD Sept10_D Sept10_AM Sept10_PM Sept10_D2 Sept10_MD May11_D May11_AM May11_PM May11_D2 May11_MD Sept11_D Sept11_AM Sept11_PM Sept11_D2 Sept11_MD May12_D May12_AM May12_PM May12_D2 May12_MD Sept12_D Sept12_AM Sept12_PM Sept12_D2 Sept12_MD May13_D May13_AM May13_PM May13_D2 May13_MD Sept13_D Sept13_AM Sept13_PM Sept13_D2 Sept13_MD May14_D May14_AM May14_PM May14_D2 May14_MD Sept14_D Sept14_AM Sept14_PM Sept14_D2 Sept14_MD May15_D May15_AM May15_PM May15_D2 May15_MD Sept15_D Sept15_AM Sept15_PM Sept15_D2 Sept15_MD May16_D May16_AM May16_PM May16_D2 May16_MD Sept16_D Sept16_AM Sept16_PM Sept16_D2 Sept16_MD May17_D May17_AM May17_PM May17_D3 May17_MD Sept17_D Sept17_AM Sept17_PM Sept17_D2 Sept17_MD
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N 5/10/2007 1189 4094 5/5/2007 2508 9/25/2007 734 2646 9/29/2007 2939 5/1/2008 802 4015 5/17/2008 2631 9/23/2008 1125 4310 10/4/2008 3420 5/5/2009 1001 3475 5/2/2009 2832 9/10/2009 991 4262 9/26/2009 2469 5/6/2010 1010 3609 5/1/2010 3128 9/8/2010 863 4119 9/11/2010 2217 5/10/2011 997 4440 4/30/2011 2687 9/8/2011 1328 3820 9/10/2011 2428 5/3/2012 1288 3328 6/2/2012 3365 9/12/2012 1268 4315 9/15/2012 2276 5/7/2013 1210 4710 5/18/2013 3825 10/1/2013 1206 4590 10/5/2013 3008 5/8/2014 1220 4384 5/17/2014 2641 9/4/2014 1450 4646 9/6/2014 2996 5/13/2015 1788 4980 5/16/2015 3033 9/17/2015 1204 4520 9/12/2015 2999 05/05/16 1246 4531 05/07/16 2686 09/27/16 1309 3642 09/10/16 2830 05/09/17 1916 5893 05/06/17 2776 09/12/17 1111 4044 09/23/17 2731
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y 5/10/2007 1511 3184 5/5/2007 1971 9/25/2007 1855 3754 9/29/2007 1183 5/6/2008 1136 2638 5/17/2008 1522 9/23/2008 1939 3283 10/4/2008 1383 5/5/2009 1351 3111 5/2/2009 1304 9/10/2009 1227 3137 9/26/2009 2762 5/6/2010 2077 3283 5/1/2010 1409 9/14/2010 1007 3069 9/11/2010 1477 5/5/2011 1734 3333 4/30/2011 1772 9/14/2011 2051 3525 9/10/2011 1752 5/3/2012 1233 1875 6/2/2012 1912 9/12/2012 2113 4099 9/22/2012 1970 5/1/2013 2278 4215 5/4/2013 2288 9/18/2013 2071 3890 9/21/2013 1832 5/8/2014 2206 4363 5/17/2014 2315 9/11/2014 1949 4435 9/13/2014 2388 5/19/2015 2318 4589 5/16/2015 2483 9/17/2015 2005 4790 9/26/2015 2512 05/10/16 2053 4721 05/14/16 2311 09/20/16 2109 5485 09/10/16 2548 05/09/17 1848 4920 05/06/17 2143 09/16/17 2389 5952 09/23/17 2832
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y 5/10/2007 1832 12311 5/5/2007 14391 9/25/2007 1829 9215 9/29/2007 15065 5/1/2008 1061 8013 5/17/2008 7927 9/23/2008 2497 10010 10/4/2008 8013 5/5/2009 1689 8374 5/2/2009 7472 9/10/2009 2387 10655 9/26/2009 6614 5/6/2010 2285 10943 5/1/2010 10424 9/8/2010 2807 12658 9/11/2010 7714 5/10/2011 2289 15890 4/30/2011 9167 9/8/2011 2679 13548 9/10/2011 8140 5/3/2012 2125 7635 6/9/2012 7022 9/11/2012 2704 10486 9/15/2012 7421 5/1/2013 2710 12723 5/4/2013 8872 10/1/2013 2651 10613 10/5/2013 8241 5/8/2014 2699 12490 5/17/2014 6339 9/4/2014 2752 10408 9/6/2014 8266 5/13/2015 2382 11216 5/16/2015 7574 9/17/2015 2733 10586 9/12/2015 9072 05/05/16 2540 11272 05/07/16 8422 09/27/16 2422 11145 09/10/16 9775 05/09/17 2557 12125 05/06/17 6499 09/12/17 2783 12388 09/23/17 7076

We are ready now to deal with a much thornier issue: several different values about dates and types of measurement are stored as column names, which makes it harder to deal with the data. These columns are named with the first letters of the month (May or Sept), an underscore, and either D or D2, where D is used for weekday measurements and D2 for weekend. We’ll use the gather() function to create a different row for each observation and keep in a new variable the values that are currently as column names.

pedcount <- pedcount %>%
  gather(key = "date_coding", value = "date", contains("_D")) %>%
  select(lon, lat, Loc, OBJECTID, Borough, Street_Nam, From_Stree, To_Street, Index, date_coding, date, everything()) %>%
    gather(key = "measurement_coding", value = "ped_count", 12:77)
Let’s take a look again:
lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index date_coding date measurement_coding ped_count
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N May07_D 5/10/2007 May07_AM 1189
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y May07_D 5/10/2007 May07_AM 1511
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y May07_D 5/10/2007 May07_AM 1832

These changes help, but it’s still pretty confusing… How do I make good use of my new date_coding and measurement_coding variables? Also, I now have a ton of duplicates, because I really only need one combination of Location, date and pedcount, but how do I know which one?

If I look at the table, I can identify some extra rows that I know are redundant by looking at contradictions in the date information that date_coding and the actual date for that row. So, to get rid of these I need to be able to identify matches and filter by them.

For example, I should only keep rows where the the first six characters of date_coding match the first six characters of measurement_coding.

pedcount <- pedcount %>%
  filter(str_sub(date_coding, 1, 6) == str_sub(measurement_coding, 1, 6))
Now our table looks like this:
lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index date_coding date measurement_coding ped_count
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N May07_D 5/10/2007 May07_AM 1189
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y May07_D 5/10/2007 May07_AM 1511
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y May07_D 5/10/2007 May07_AM 1832

YAY, we got rid of most of the duplicates. However, each value is still in two different rows, only one of which has properly matching date and measurement coding. We shouldn’t have two ‘May07_AM’ rows for the same location with he same value; the reason we do is that the dates are different, and one is a weekday measurement and the other a weekend measurement. How do we know which is the one to keep? Well, weekdays have one AM and one PM measurement, and weekends have only one measurement, coded MD. So, in this case, we need to keep the row where date_coding is _D. In general, we need to keep all rows where _D in date_coding meets _AM OR _PM in measurement_coding and all rows where _D2 in date_coding matches _MD in measurement coding.

pedcount <- pedcount %>%
  filter((str_sub(date_coding, -1, -1) == "2" & str_sub(measurement_coding, -1, -1) == "D") | (str_sub(date_coding, -1, -1) == "D" & str_sub(measurement_coding, -1, -1) == "M"))

Take a look:

lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index date_coding date measurement_coding ped_count
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N May07_D 5/10/2007 May07_AM 1189
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y May07_D 5/10/2007 May07_AM 1511
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y May07_D 5/10/2007 May07_AM 1832

Done! Now we need to extract from date_coding and measurement_coding the information that is relevant and not redundant. We’ll do this by recoding the variables to a human-readable, plain English ‘translation’.

pedcount <- pedcount %>%
  mutate(
   type_of_measurement = case_when(
     str_sub(measurement_coding, -1, -1) == "D" ~ "Weekend Noon",
     str_sub(measurement_coding, -2, -1) == "AM" ~ "Weekday Morning",
     str_sub(measurement_coding, -2, -1) == "PM" ~ "Weekday Evening"),
   season = case_when(
     str_sub(date_coding, 1, 3) == "May" ~ "Spring",
     str_sub(date_coding, 1, 3) == "Sep" ~ "Fall"))

We’re getting really close now!

lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index date_coding date measurement_coding ped_count type_of_measurement season
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N May07_D 5/10/2007 May07_AM 1189 Weekday Morning Spring
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y May07_D 5/10/2007 May07_AM 1511 Weekday Morning Spring
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y May07_D 5/10/2007 May07_AM 1832 Weekday Morning Spring

However, if we take a closer look we’ll notice that both the pedcount and the date variables need some clean-up:

class(pedcount$ped_count)
## [1] "character"
pedcount$ped_count <- as.numeric(pedcount$ped_count)
class(pedcount$ped_count)
## [1] "numeric"
class(pedcount$date)
## [1] "character"
pedcount$date <- mdy(pedcount$date)
class(pedcount$date)
## [1] "Date"

Finally, we drop the redundant ‘coding’ columns and we are all set.

pedcount <- select(pedcount, -c('measurement_coding', 'date_coding'))

There we have it! Our final tidy version of the original dataset:

lon lat Loc OBJECTID Borough Street_Nam From_Stree To_Street Index date ped_count type_of_measurement season
-73.90459 40.87920 1 1 Bronx Broadway West 231st Street Naples Terrace N 2007-05-10 1189 Weekday Morning Spring
-73.92188 40.82663 2 2 Bronx East 161st Street Grand Concourse Sheridan Avenue Y 2007-05-10 1511 Weekday Morning Spring
-73.89536 40.86215 3 3 Bronx East Fordham Road Valentine Avenue Tiebout Avenue Y 2007-05-10 1832 Weekday Morning Spring

We are ready now to use this data to map and explore pedestrian counts in New York City over time. But that’s a topic for an entirely different post. Stay tuned!

© Alejandra Gerosa 2019

Powered by Hugo & Kiss.