New York Taxi

A machine learning project predicting NYC taxi fares using Apache Spark for distributed data processing, achieving 81% accuracy with SVM regression.

The Challenge

NYC taxi fares vary significantly based on distance, time, and location. Passengers lack reliable fare estimates before trips, and drivers have limited insight into demand patterns across the city.

The Solution

Processed June 2019 taxi trip data using Apache Spark, engineered temporal and spatial features, and compared multiple regression models. Support Vector Machines achieved the best performance with R² of 0.81 and RMSE of 6.47.

Key Features

Distributed data processing with Apache Spark for large-scale trip data
Feature engineering: hour of day, day of week, zone name conversions
Model comparison: SVM, Ridge Regression, Random Forest, Gradient Boosting
Cyclic transformations for temporal features
Trip distance identified as strongest fare predictor

Tech Stack

Python

Jupyter Notebook environment for iterative model development and visualization

Apache Spark

Distributed processing of Kaggle NYC taxi dataset via spark.read.csv()

Scikit-learn

Model training with SVM achieving best results (R²=0.81, RMSE=6.47)

Back to Projects