New York Taxi

New York Taxi

A machine learning project predicting NYC taxi fares using Apache Spark for distributed data processing, achieving 81% accuracy with SVM regression.

NYC taxi fares vary significantly based on distance, time, and location. Passengers lack reliable fare estimates before trips, and drivers have limited insight into demand patterns across the city.

Processed June 2019 taxi trip data using Apache Spark, engineered temporal and spatial features, and compared multiple regression models. Support Vector Machines achieved the best performance with R² of 0.81 and RMSE of 6.47.

  • Distributed data processing with Apache Spark for large-scale trip data
  • Feature engineering: hour of day, day of week, zone name conversions
  • Model comparison: SVM, Ridge Regression, Random Forest, Gradient Boosting
  • Cyclic transformations for temporal features
  • Trip distance identified as strongest fare predictor
Python

Jupyter Notebook environment for iterative model development and visualization

Apache Spark

Distributed processing of Kaggle NYC taxi dataset via spark.read.csv()

Scikit-learn

Model training with SVM achieving best results (R²=0.81, RMSE=6.47)