BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Vicente95
Obsidian | Level 7

Hello All,

 

Background

I am new to SAS programming and general statistics and am working on a personal project where I want to analyze the used cars listing on Craigslist. After some research, I found Austin Reese's Used Cars dataset on Kaggle to be beneficial to my project, as shown here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data. I am using the latest dataset available on Kaggle as of June 16, 2021. 

 

Problem

The problem with the dataset is that there are tens and hundreds of thousands of missing values for the string variables such as condition, cylinders, drive, paint_color, etc. I was able to trim the latest dataset available on Kaggle to 349,188 from 426,880 and clear all missing values for the quantitative variables such as the price, odometer, year, and posting date. I do NOT want to delete any rows that have missing values since that would induce bias into my analysis and reduce the overall sample size. Can anyone please offer me guidance on how I can replace all the missing values in the string variables in SAS so I can start performing my analysis? If needed, I can attach my current code file and any relevant SAS code output to this question. Any assistance is much appreciated. 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26
if condition^='Salvage' then do;
    if year >= 2021 then condition = 'New';
    else if year >= 2019 then condition = 'Like New';
    else condition = 'Used';
end;

I think it would be wiser to not store Salvage in the same variable as New, Like New or Used, it seems like you might want separate variables here.

--
Paige Miller

View solution in original post

13 REPLIES 13
Vicente95
Obsidian | Level 7

I have attached the first 20 observations after dropping the extreme observations from the price and odometer variables along with filtering the year variable. I have also attached my SAS code for reference. 

 

top_20_obs.PNG

SASKiwi
PROC Star

There are likely some attributes you can infer with expert knowledge, and don't forget you could correct some obvious mistakes as well.

 

For example all Ford F-150 models could be labelled type - pickups and size - full-size. 

 

You can't expertly infer some attributes like condition or color though and you could easily introduce bias by populating missing values incorrectly such as @Kurt_Bremser 's color pink joke.

 

I suggest you try analysing after intelligently inferring the missing values that make sense first, before trying to use other techniques.

 

 

Vicente95
Obsidian | Level 7
Hmm OK. How would I go about classifying the following for the condition variable in SAS code:

If the year of the car (year variable) is between 2021 and 2022, the condition is new.

Else if, the year of the car is between 2019 to 2020, the condition is like new.

Else, the condition is used for all cars made from 2018 and earlier.

I want to create a new column for this if-else condition.
SASKiwi
PROC Star
if year >= 2021 then condition = 'New     ';
else if year >= 2019 then condition = 'Like New';
else condition = 'Used';
Vicente95
Obsidian | Level 7

Thank you! I have a follow-up question. How would I preserve the used vehicles that condition is listed as salvage? There are some used vehicles where their title status is clean, salvage, rebuilt, and parts only and I want to keep the condition as salvage (i.e., the condition does not change for these cars). Is it possible to NOT modify the condition of these used vehicles and leave them alone after or before running the if-else statement as shown below? 

PaigeMiller
Diamond | Level 26
if condition^='Salvage' then do;
    if year >= 2021 then condition = 'New';
    else if year >= 2019 then condition = 'Like New';
    else condition = 'Used';
end;

I think it would be wiser to not store Salvage in the same variable as New, Like New or Used, it seems like you might want separate variables here.

--
Paige Miller
Patrick
Opal | Level 21

What would you want to replace missings with?

Most SAS procedures deal just fine with missings - you might want to consult the docu for the Stat procedure you want to use to understand how missings are treated (i.e. if there is a MISSING keyword to include or exclude missings from analysis).

Vicente95
Obsidian | Level 7

Hello Patrick,

 

The missing observations that I am concerned with are the following string/char variables:

  • condition 
  • cylinders
  • drive
  • fuel
  • paint_color
  • size
  • title_status
  • transmission
  • type

I do not know how I would replace hundreds of thousands of these observations, especially in SAS. I do not want to drop any rows that have missing observations since I would introduce bias and lose a lot of valuable information in my analysis. How would you go about to replace the missing values in this situation? 

Patrick
Opal | Level 21

For variable drive: With what string would you want to replace the missings with? And how would this make any analysis better?
Patrick_0-1624534202908.png

If you look into the documentation for SAS Procedures then you'll see that many of these deal with missings (like: include or exclude missing values as a valid category).

 

 

Vicente95
Obsidian | Level 7
Would it be safe to classify them as "other?" I think it would make sense to include the missing observations in my analysis since I don't want to discard so many "valid" observations.
Vicente95
Obsidian | Level 7
Would it be safe to categorize them as "other" or "null"? I don't know how I would impute the missing values for the color variable. Color is a critical piece to determining the price of a used car.
PaigeMiller
Diamond | Level 26

The whole problem here is that there is no right answer, and "Pink" in some sense is just as good (or bad) as "Other" or "Null". And so, the decision goes back to you, not us, to determine what to do here.

 

Color may be a critical piece of determining the price of a used car, but you are missing this critical piece, and so any effort to determine the price of a used car will be less successful because of this missing data.

--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1075 views
  • 8 likes
  • 5 in conversation