An In-depth Analysis of NYC Taxi Data using SAS Studio

In this demonstration, we’ll explore the process of data preparation, cleaning, and visualization using SAS Viya by analyzing a real-world dataset of New York City Yellow Taxi Trips. This dataset, sourced from Kaggle, contains trip records from 2024, including details on fares, passenger counts, and pickup and drop-off locations. We will begin by loading two separate aggregated CSV files into our SAS environment and then perform crucial data manipulation tasks to ensure the data is in the correct format for analysis.

The initial raw data is unwieldy and requires several preprocessing steps before we can extract meaningful insights. We will demonstrate how to address formatting issues, such as converting character dates to a proper SAS date format, rounding numerical values for clarity, and mapping numerical payment codes to their descriptive text labels. Furthermore, we will show how to create a subset of the data focusing on the month of January to reduce the data set size, making our analysis more efficient. This pre-analysis preparation is vital for creating accurate and reliable visualizations and is a foundational step in any data-driven project.

About the Data

The data used for this demonstration will be from Kaggle using the NYC Yellow Taxi Trips of aggregated data for 2024. The NYC Yellow Taxi dataset is a publicly accessible compilation of yellow trip records from New York City's five boroughs. Trip distances, itemized rates, passenger counts, payment methods, and pickup and drop-off times and locations are among the facts it records for every journey. Due to the monthly publication of the trip data, the files are big and dispersed over the course of a year.

The data set consist of 2024 raw monthly files that were combined and aggregated into this dataset to make it reasonable in size. The trip logs were arranged according to the pickup and drop-off boroughs, date, time of ride, number of passengers, and mode of payment.

Loading the Data

The first step of this demonstration is to load in our data into SAS Viya environment, in this case we will be loading in two separate datasets and making some minor alterations to the data as well.

/* Creating the Libname statement*/;

libname RC "/innovationlab-export/innovationlab/homes/Dee.McKoy@sas.com/demo/output3";


/* Loading the data for the Aggregated Taxi Trips */;
proc import
    datafile="/innovationlab-export/innovationlab/homes/Dee.McKoy@sas.com/demo/data/aggregated_nyc_yellow_taxi_2024.csv"
    out=RC.taxi_trips_Agg dbms=csv;
run;

/* Trip Count by Zone for the Aggregated Taxi Trips*/;
proc import
    datafile="/innovationlab-export/innovationlab/homes/Dee.McKoy@sas.com/demo/data/tripCount_byZone_nyc_yellow_taxi_2024.csv"
    out=RC.taxi_trips_Counts dbms=csv;
run;

From the above snippet of code, we have created a libname statement to store our output data. We named the data library as “RC”. Following the libname statement we load both the NYC Yellow Taxi aggregate and Trip Count by Zone dataset using the PROC IMPORT to load in our datafiles, we specify the type of file and the output location of the file once it has been loaded. Next, we will visualize the RC.taxi_trips_Aggs dataset to get a better understanding of what the data looks like.

Figure 1. RC.taxi_trips_Aggs

The figure displays the first 10 observations of the NYC Aggregated dataset. We convert the "date" column to the SAS date9 format and reduce the "duration_sum" decimal place by two.

Figure 2. RC.taxi_trips_Counts

We want to change the date column to date9. format and the “duration_sum” column we want to round down the decimal place to make the column more readable. Now that we have looked at our data we can start working on the pre-processing stage. We will need to format the payment types as they are listed as numbers from 1 to 6. The next part of the preprocessing stage will consist of us handling the date format mentioned earlier. Let’s move forward and clean up some of these columns before moving to feature engineering of the datasets.

proc format;
value payment_type_format 0='Flex Fare Trip' 1='Credit Card' 2='Cash' 3=
'No Charge' 4='Dispute' 5='Unknown' 6='Voided Trip/ Cancelled Trip';
run;

data RC.taxi_trips_Counts;
set RC.taxi_trips_counts;
drop date;
/*** ADD FORMAT FOR COUNT FOR MONTH COLUMN*/
run;

data RC.taxi_trips_Aggs;
set RC.taxi_trips_Agg;
format date date9.;
duration_sum=round(duration_sum,0.01);
format payment_type payment_type_format.;
run;

proc print data=RC.taxi_trips_Aggs (obs=5);
run;

proc print data=RC.taxi_trips_counts (obs=5);
run;

In the above figure, we can see the changes were made to the “Date” column changing the format to SAS date9 format. For the NYC TAX Aggregated dataset we need to round the decimal place down to 2 places for conventional reasons. We can also see the changes that were made to the PAYMENT_TYPE column where we formatted the column to remove the integers and replace their character values that correspond to the method of payment for the trip.

In the next section, we want to look within the NYC Taxi Aggregated dataset to have better understanding of monthly revenue, and a lot of revenue was generated for that month. After we select the month we want to investigate, we then will create a few different visualizations showing the distribution of payment types, number of passengers per ride, the top 10 pick-up locations and the top 10 drop-off locations.

Investigation of 1-Month of Taxicab Transportation

data RC.Jan_aggs_trips;
set RC.taxi_trips_Aggs;
where date between '01JAN2024'd and '31JAN2024'd;
run;

In the above snippet of code, we create a DATA step to create a table that only shows the month of January. We use a WHERE statement and the BETWEEN function to isolate all that data points for the month of January.

In the above figure, we display the DATE column of our January dates only. This also helps reduce the number of observations by creating a new dataset to gain some insight from. Now, let see the number observations using PROC CONTENT command.

In the above figure, we have some initial information regarding the RC>JAN_AGG_TRIPS dataset. At the beginning of the demonstration we started off with 921,373 observations and 18 variables; after isolating the month of January we now are only looking at 71,126 observations. The PROC CONTENT shows the contents of a CAS table and prints the directory of the caslib, allowing for the user to know when the dataset was created, last time it was modified, data representation, and the encoding of the dataset as well.

data RC.jan_aggs_trips;
set RC.jan_aggs_trips;
if passenger_count >= 5 then delete;
run;

Now, we need to run another DATA step to eliminate passenger count more than 5. In New York City, taxi cabs are prohibited from transporting more than 5 passengers including the driver. In the above DATA step, we call January dataset and write an IF THEN statement to delete any observations that have more than 5 passengers for the year of 2024.

proc contents data=RC.jan_aggs_trips; run;

After removing the passenger count over 5, we notice that we had another drop in the number of observations down to 66,869 data points. Now, we can move forward with creating visualizations for the month of January.

Using SAS Macro to Create Pie Chart Visualizations

To build our pie chart faster, we will use a SAS macro to store our RC.jan_aggs_trips dataset. This will allow us to call the dataset using the SAS macro faster and focus more on what type of visualization we want to display based on the variables within the dataset.

%let data_set=RC.jan_aggs_trips;
%let data_trips = RC.taxi_trips_Counts; 
proc sgplot data=&data_set; 
title'Payment Type Distribution per Ride'; 
vbar payment_type / datalabel datalabelattrs=all; 
run;

From the above figure, we show the type of payment that was made per ride. We use the PROC SGPLOT statement to produce the bar chart. Majority of riders paid using a credit card with payment transaction of around 33,000 riders, and the second largest form of payment is the “cash” with around 22,000 rider using this type of payment method.

proc sgplot data=&data_set;
title'Number of Passenger Count Distribution';vbar passenger_count / datalabel datalabelattrs=all;run;

In the above figure, we see the number of passengers count distribution. Majority of the taxi rides were with 1 person 25,952 out of 66,869 observations. The next largest set of passengers is 2 riders, followed by 3 riders per ride.

Top 10 Borough Pick-Up Location

proc freq data=&data_set noprint;
tables PU_Borough / out=pu_borough_counts(keep=PU_Borough count percent);
run;

/*************Sort by count to get the top boroughs***********/;
proc sort data=pu_borough_counts;
by descending count;
run;

proc sgpie data=pu_borough_counts(obs=10); /** Display only the top 10 **/;
title "TOP 10 Pick-Up Boroughs";
pie PU_Borough / response=count datalabeldisplay=all datalabelattrs=outside
datalabelattrs=(size=8);
run;

The next visualization we want to show is the top 10 pick-up boroughs in New York City. First, we use the PROC FREQ statement to create pick-up table labeled as “pu_borough_count” that will provide the count and percentage. In the code we want to make sure we sort the dataset. We perform the PROC SORT statement and order the data in descending order. We now can plot the table that was created in the PROC FREQ statement.

In the above figure, we see the pick-up borough with the top number of pick-ups is 45% with 30,071 riders beginning picked up in Manhattan. The next highest pick-up location is Queens with around 36% (24,086 riders ) of pick-ups in that borough.

Top 10 Borough Drop-Off Location

proc freq data=&data_set noprint;
tables DO_Borough / out=do_borough_counts(keep=DO_Borough count percent);
run;

/********* Sort by count to get the top drop-off boroughs************/;
proc sort data=do_borough_counts;
by descending count;
run;

proc sgpie data=do_borough_counts(obs=10); ;
title "TOP 10 Drop-Off Boroughs";
pie DO_Borough / response=count datalabeldisplay=all datalabelattrs=outside
datalabelattrs=(size=8);
run;

The next visualization we want to show is the top 10 pick-up boroughs in New York City. First, we use the PROC FREQ statement to create drop-off table labeled as “do_borough_count” that will provide the count and percentage. In the code we want to make sure we sort the dataset. We perform the PROC SORT statement and order the data in descending order. We now can plot the table that was created in the PROC FREQ statement

In the above figure, the top drop-off borough is Manhattan with about 35 % of riders being dropped off in this borough. The next two boroughs with similar numbers of drop off are Brooklyn and Queens. With Brooklyn having about 19.5% ( 13,031) of riders and in Queens 23% (15,406) of riders were dropped off at this location.

Conclusion

Through this process, we successfully transformed raw, complex data into a clean, insightful dataset ready for analysis. We demonstrated the importance of data preprocessing in SAS, including formatting dates, rounding numeric variables, and applying custom formats to categorical data. By isolating a single month and filtering out invalid trip records, we were able to create a more manageable dataset for targeted analysis. Our visualizations, including pie charts for payment types, passenger counts, and top pickup/drop-off locations, revealed key patterns in the taxi trip data. This analysis highlights how fundamental data cleaning and visualization techniques in SAS can provide clear, actionable insights from even the most challenging real-world datasets.

For information:

Find more articles from SAS Global Enablement and Learning here.