Hello All,
Background
I am new to SAS programming and general statistics and am working on a personal project where I want to analyze the used cars listing on Craigslist. After some research, I found Austin Reese's Used Cars dataset on Kaggle to be beneficial to my project, as shown here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data. I am using the latest dataset available on Kaggle as of June 16, 2021.
Problem
The problem with the dataset is that there are tens and hundreds of thousands of missing values for the string variables such as condition, cylinders, drive, paint_color, etc. I was able to trim the latest dataset available on Kaggle to 349,188 from 426,880 and clear all missing values for the quantitative variables such as the price, odometer, year, and posting date. I do NOT want to delete any rows that have missing values since that would induce bias into my analysis and reduce the overall sample size. Can anyone please offer me guidance on how I can replace all the missing values in the string variables in SAS so I can start performing my analysis? If needed, I can attach my current code file and any relevant SAS code output to this question. Any assistance is much appreciated.
if condition^='Salvage' then do;
if year >= 2021 then condition = 'New';
else if year >= 2019 then condition = 'Like New';
else condition = 'Used';
end;
I think it would be wiser to not store Salvage in the same variable as New, Like New or Used, it seems like you might want separate variables here.
I have attached the first 20 observations after dropping the extreme observations from the price and odometer variables along with filtering the year variable. I have also attached my SAS code for reference.
There are likely some attributes you can infer with expert knowledge, and don't forget you could correct some obvious mistakes as well.
For example all Ford F-150 models could be labelled type - pickups and size - full-size.
You can't expertly infer some attributes like condition or color though and you could easily introduce bias by populating missing values incorrectly such as @Kurt_Bremser 's color pink joke.
I suggest you try analysing after intelligently inferring the missing values that make sense first, before trying to use other techniques.
if year >= 2021 then condition = 'New ';
else if year >= 2019 then condition = 'Like New';
else condition = 'Used';
Thank you! I have a follow-up question. How would I preserve the used vehicles that condition is listed as salvage? There are some used vehicles where their title status is clean, salvage, rebuilt, and parts only and I want to keep the condition as salvage (i.e., the condition does not change for these cars). Is it possible to NOT modify the condition of these used vehicles and leave them alone after or before running the if-else statement as shown below?
if condition^='Salvage' then do;
if year >= 2021 then condition = 'New';
else if year >= 2019 then condition = 'Like New';
else condition = 'Used';
end;
I think it would be wiser to not store Salvage in the same variable as New, Like New or Used, it seems like you might want separate variables here.
What would you want to replace missings with?
Most SAS procedures deal just fine with missings - you might want to consult the docu for the Stat procedure you want to use to understand how missings are treated (i.e. if there is a MISSING keyword to include or exclude missings from analysis).
Hello Patrick,
The missing observations that I am concerned with are the following string/char variables:
I do not know how I would replace hundreds of thousands of these observations, especially in SAS. I do not want to drop any rows that have missing observations since I would introduce bias and lose a lot of valuable information in my analysis. How would you go about to replace the missing values in this situation?
For variable drive: With what string would you want to replace the missings with? And how would this make any analysis better?
If you look into the documentation for SAS Procedures then you'll see that many of these deal with missings (like: include or exclude missing values as a valid category).
So which color would you want to have if it is missing? Pink (just joking)?
The whole problem here is that there is no right answer, and "Pink" in some sense is just as good (or bad) as "Other" or "Null". And so, the decision goes back to you, not us, to determine what to do here.
Color may be a critical piece of determining the price of a used car, but you are missing this critical piece, and so any effort to determine the price of a used car will be less successful because of this missing data.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.