
A machine learning project predicting NYC taxi fares using Apache Spark for distributed data processing, achieving 81% accuracy with SVM regression.
The Challenge
NYC taxi fares vary significantly based on distance, time, and location. Passengers lack reliable fare estimates before trips, and drivers have limited insight into demand patterns across the city.
The Solution
Processed June 2019 taxi trip data using Apache Spark, engineered temporal and spatial features, and compared multiple regression models. Support Vector Machines achieved the best performance with R² of 0.81 and RMSE of 6.47.
Key Features
- Distributed data processing with Apache Spark for large-scale trip data
- Feature engineering: hour of day, day of week, zone name conversions
- Model comparison: SVM, Ridge Regression, Random Forest, Gradient Boosting
- Cyclic transformations for temporal features
- Trip distance identified as strongest fare predictor
Tech Stack
Jupyter Notebook environment for iterative model development and visualization
Distributed processing of Kaggle NYC taxi dataset via spark.read.csv()
Model training with SVM achieving best results (R²=0.81, RMSE=6.47)