A report containing information about our process in this project can be found here.

Data Overview

The downloaded data from the TLC contained over a million observations for each service dataset, so we decided to limit the amount of data for our analysis. First, we filtered our data to only contain taxis or hired vehicles going into Manhattan. Our logic behind this was that this would reflect the general habits of New Yorkers, as most of the main attractions of New York City reside in the borough of Manhattan.

We also decided to sample the data to limit the number of observations for each type of taxi service. We used a ten percent sample for both yellow and for-hire taxis, and a twenty percent sample for green taxis since there was less data available. We merged all of these samples into one dataset that would represent data for the entirety of Valentine’s Day.

Upon an exploratory analysis of this dataset, we found that there were few rides coming from Staten Island into Manhattan. Therefore, we decided to focus on the other four boroughs for analysis, since there was not enough data from Staten Island to come up with meaningful interpretations.

With this cleaning, we made graphical representations, such as heatmaps, time distribution graphs and inferences on congestion to help illustrate where people are going and how far they are willing to travel for a night out on Valentine’s Day.