Motivation

We wanted to look at what variables were significant predictors for fare, with the final goal of building a taxi fare estimator that could take values on these selected key variables and produce an estimated fare for users, together with a 95% prediction interval.

Exploratory graphs

First, we did some exploratory analysis to provide a brief overview on the outcome variable fare, and how it was distributed across boroughs, neighborhoods, type of taxis, and time of day.

We can see that the highest total fares were observed during morning rush hours (6-9am), evening rush hours (4pm-6pm), dinner time (6pm-9pm), and tapering off a little at night (after 9pm) on Valentine’s Day. This high amount of aggregate fares show that people either traveled in high volumes to drop-off locations in Manhattan, or took longer trips from other boroughs to Manhattan (longer distance travelled) during these hours. Furthermore, yellow taxis constituted the most rides, which suggested that these trips took place mostly below East 96th and West 110th Street.

You might also be interested in the neighborhoods in Manhattan with the highest average taxi fares (which suggests they are popular (or maybe they’re just far from downtown!) If so, you can check it out in the Shiny app!

The distribution of the outcome variable, fare amount, can be found below.

Since the data looked heavily right skewed, we decided to drop fares that are above $60, based on our assumption that most of the fares above 60 were mostly negotiated fares.

Qualitatively, the variables that might be reasonably associated with fare amount include: trip duration, trip distance, time of day, tolls amount, taxi type, pick-up borough, and extra fees. So a regression model with the abovementioned as predictors can be our original expanded model. However, it might also be a good idea to look at the correlation plot between fare and other continuous variables.

Model Building

Looking at the correlation plot, we saw that outcome variable fare was highly correlated with trip distance and tolls amount, as well as duration, so we included these as predictors for our second model. Qualitatively, we also added time of day to this model.

We also used stepwise regression with AIC as the criterion to potentially get a more parsimonious model. Stepwise regression did not suggest leaving any variables out of the model (stick with the original expanded model). However, we wanted to see if a very parsimonious models (only with trip distance and duration as predictors) would perform better.

Next, we fitted the expanded model, as the stepwise regression result suggested. Model diagnostics suggested that observation 123 and 16214 were highly influential points (based on crossing Cook’s distance cut-off value), so we removed those.

We refitted the model, and below is the stepwise regression summary output for this first model.

term estimate std.error statistic p.value
(Intercept) 3.8684367 0.3042991 12.712614 0.0000000
trip_distance 2.9118704 0.0105459 276.113652 0.0000000
duration 0.0026267 0.0002564 10.244701 0.0000000
as.factor(time_of_day)early morning -1.0207754 0.1566061 -6.518105 0.0000000
as.factor(time_of_day)morning rush 0.8995413 0.0653084 13.773748 0.0000000
as.factor(time_of_day)others 1.6313873 0.0613872 26.575351 0.0000000
as.factor(time_of_day)lunch 1.6366897 0.0755285 21.669832 0.0000000
as.factor(time_of_day)evening rush 1.3087173 0.0704371 18.579935 0.0000000
as.factor(time_of_day)dinner time 0.4881317 0.0648335 7.529000 0.0000000
extra 0.0418498 0.0143383 2.918741 0.0035176
tolls_amount 0.3184096 0.0346663 9.184997 0.0000000
as.factor(type)yellow 0.1630431 0.0763189 2.136341 0.0326613
as.factor(pu_boro)Brooklyn 0.8848642 0.3345617 2.644846 0.0081779
as.factor(pu_boro)Manhattan -0.4008831 0.2996303 -1.337926 0.1809328
as.factor(pu_boro)Queens -2.6402839 0.3276419 -8.058443 0.0000000
r.squared adj.r.squared
0.8806238 0.8805572

Fitting the most parsimonious model (with duration and distance as predictors) gave the regression outputs below:

term estimate std.error statistic p.value
(Intercept) 4.6918487 0.0234144 200.38332 0
trip_distance 2.8695304 0.0067880 422.73319 0
duration 0.0028756 0.0002572 11.18014 0
r.squared adj.r.squared
0.8748377 0.8748281

And below is the “moderate” model (with duration and distance as predictors) and its regression outputs:

term estimate std.error statistic p.value
(Intercept) 3.7237533 0.0532879 69.879943 0
trip_distance 2.8929383 0.0068470 422.511520 0
duration 0.0026141 0.0002574 10.157020 0
as.factor(time_of_day)early morning -1.0635439 0.1572182 -6.764762 0
as.factor(time_of_day)morning rush 0.8860394 0.0650083 13.629643 0
as.factor(time_of_day)others 1.6079726 0.0613598 26.205621 0
as.factor(time_of_day)lunch 1.6095612 0.0754878 21.322144 0
as.factor(time_of_day)evening rush 1.3060180 0.0701718 18.611727 0
as.factor(time_of_day)dinner time 0.4814640 0.0650064 7.406410 0
r.squared adj.r.squared
0.8795327 0.8794943

Cross-validation

Now, we have 3 models that we wanted to cross-validate and compare cross-validated prediction error RMSE.

This plot above suggests that although the moderate model performs only marginal better than the stepwise and parsimonious model, it seems to be the best choice given a balance of both parsimony and better predictive ability. This model also has an R-squared of 88%.

Check for multicollinearity

GVIF Df GVIF^(1/(2*Df))
trip_distance 1.019654 1 1.009779
duration 1.017731 1 1.008827
as.factor(time_of_day) 1.003549 6 1.000295

Since VIF for all predictors are below 5, we don’t need to worry about multicollinearity.

In the end, we decided to go with the model below for fare prediction:

\(\hat{Fare} = \hat{\beta_{0}} + \hat{\beta_{1}} \times Duration + \hat{\beta_{2}} \times Distance + \hat{\beta_3} \times I(time of day = early morning) + \hat{\beta_4} \times I(time of day = morning rush) +\) \(\hat{\beta_5} \times I(time of day = lunch) + \hat{\beta_6} \times I(time of day = evening rush) + \hat{\beta_7} \times I(time of day = dinner time) + \hat{\beta_8} \times I(time of day = night)\)

Fare prediction

The data only has fare and duration data for taxi’s, so we only looked at Yellow and Green taxi’s observations (whose fares are at most $60 and excluding trips in Staten Island).

We used the model obtained above to create a Shiny app that helps predict taxi fare based on the three predictors in the final selected model:

  1. Distance (in miles) – variable “trip_distance” in the dataset

  2. Duration (in minutes) – new variable created by taking the time difference between “pick-up time” and “drop-off time”

  3. Time of day – we categorized this continuous variable into a factor with 6 levels that might sound more intuitive. Specifically, they are:

  • 6am-9am: morning rush

  • 11am-1pm: lunch

  • 4pm-6pm: evening rush

  • 6pm-9pm: dinner time

  • 9pm-12am and 12am-2am: night

  • 9am-11am and 1pm-4pm: others

Please feel free to use the app here to see how much it might cost you to travel from your current neighborhood to your desired neighborhood!

(Note that this inference might only be valid for Valentine’s Day and for prices that are less than $60)