Hello Team,
I have a used cars dataset where there are hundreds of thousands of missing values. The dataset looks something like this (I filtered the manufacturer to only show toyota for brevity's sake)
region | price | year | manufacturer | model | condition | cylinders | fuel | odometer | title_status | transmission | drive | size | type | paint_color | state | posting_date |
ventura county | $ 7,800 | 2007 | toyota | gas | 10000000 | clean | automatic | ca | 2021-04-30T11:48:47-0700 | |||||||
south florida | $ 14,498 | 2007 | toyota | fj cruiser 4x4 | diesel | 7777777 | clean | automatic | fl | 2021-05-01T07:38:28-0400 | ||||||
flagstaff / sedona | $ 3,900 | 1998 | toyota | 4-runner sr5 | good | 6 cylinders | gas | 349000 | clean | automatic | rwd | mid-size | SUV | green | az | 2021-04-05T09:56:10-0700 |
phoenix | $ 3,500 | 1998 | toyota | 4runner | good | 6 cylinders | gas | 347000 | clean | automatic | rwd | SUV | silver | az | 2021-05-03T11:27:32-0700 | |
colorado springs | $ 6,500 | 1998 | toyota | good | 8 cylinders | gas | 345000 | clean | automatic | 4wd | full-size | SUV | grey | co | 2021-04-28T10:05:44-0600 | |
seattle-tacoma | $ 1,150 | 2009 | toyota | prius | fair | hybrid | 345000 | clean | automatic | hatchback | blue | wa | 2021-05-03T10:06:08-0700 | |||
knoxville | $ 2,000 | 1998 | toyota | camry ce | good | 4 cylinders | gas | 344200 | clean | automatic | sedan | tn | 2021-05-03T22:07:49-0400 | |||
knoxville | $ 2,000 | 1998 | toyota | camry ce | good | 4 cylinders | gas | 344200 | clean | automatic | sedan | tn | 2021-04-30T21:54:27-0400 | |||
knoxville | $ 2,000 | 1998 | toyota | camry ce | good | 4 cylinders | gas | 344200 | clean | automatic | sedan | tn | 2021-04-26T13:20:34-0400 | |||
knoxville | $ 2,200 | 1998 | toyota | camry ce | good | 4 cylinders | gas | 344200 | clean | automatic | sedan | tn | 2021-04-23T14:43:32-0400 | |||
SF bay area | $ 6,900 | 2001 | toyota | 4runner | 6 cylinders | gas | 342200 | clean | automatic | ca | 2021-05-03T18:18:25-0700 | |||||
champaign urbana | $ 1,000 | 1997 | toyota | camry | fair | 4 cylinders | gas | 342000 | clean | automatic | fwd | full-size | sedan | red | il | 2021-04-08T16:24:29-0500 |
kansas city, MO | $ 2,996 | 2000 | toyota | tundra 2wd truck | excellent | 6 cylinders | gas | 342000 | clean | automatic | rwd | pickup | white | ks | 2021-04-26T16:13:17-0500 | |
fort collins / north CO | $ 2,000 | 1999 | toyota | 4runner | fair | gas | 340500 | clean | automatic | 4wd | SUV | white | co | 2021-04-07T10:55:29-0600 | ||
orange county | $ 8,600 | 2003 | toyota | tundra | 8 cylinders | gas | 340000 | clean | automatic | ca | 2021-04-28T15:54:43-0700 |
I need to keep the existing rows that have no missing values intact and untouched, especially the model and manufacturer columns while replacing all the missing values for the engine using the make/model column. Note, all values/observations are already lower-case so there is no need to perform any sort of upper and lower casing.
My goal is to use an if-else statement in conjunction with a wildcard and list to fill in the missing values in the cylinders column using the make/model columns. The idea is to fill in missing engine cylinder values based on the car model. There are some missing values for the car model which I predict will cause problems.
Here is a sample logic that I want to perform for Toyota. If the SAS code works for Toyota, I can apply this to other car makers such as Honda, BMW, Lexus, Mercedes-Benz, etc. I can also use the same logic to work with the transmission, drive train, size, and type columns.
4 cylinders
6 cylinders
8 cylinders
I imagine this if-else statement will be very long since there are 38 car makers/manufacturers in the dataset. Any guidance is appreciated.
OK, here's one piece of the puzzle to get you started. It identifies cases where there are no missing values.
data want;
set have;
nmiss(of _numeric_) = 0 and cmiss(of _character_) = 0 then return;
* add your IF THEN statements here;
run;
Because of the RETURN statement, any observations with no missing values will bypass any IF THEN statements.
Thank you. Is there an optimal way where I can shorten the following SAS code? The idea is to fill in the missing make values using the description column by matching the automaker keyword all while keeping the non-missing model values intact.
Based on the results, it seems to work.
DATA WORK.usedcars6;
SET WORK.usedcars5;
IF make='' AND FIND(description,'ford') THEN make='ford';
ELSE IF make='' AND FIND(description,'chevy') or FIND(description,'chevrolet') THEN make='chevrolet';
ELSE IF make='' AND FIND(description,'toyota') THEN make='toyota';
ELSE IF make='' AND FIND(description,'honda') THEN make='honda';
ELSE IF make='' AND FIND(description,'nissan') THEN make='nissan';
ELSE IF make='' AND FIND(description,'jeep') THEN make='jeep';
ELSE IF make='' AND FIND(description,'ram') THEN make='ram';
ELSE IF make='' AND FIND(description,'gmc') THEN make='gmc';
ELSE IF make='' AND FIND(description,'bmw') THEN make='bmw';
ELSE IF make='' AND FIND(description,'dodge') THEN make='dodge';
ELSE IF make='' AND FIND(description,'mercedes-benz') OR FIND(description,'mercedes') OR FIND(description,'mercedes benz') OR FIND(description,'benz') THEN make='mercedes-benz';
ELSE IF make='' AND FIND(description,'hyundai') THEN make='hyundai';
ELSE IF make='' AND FIND(description,'subaru') THEN make='subaru';
ELSE IF make='' AND FIND(description,'lexus') THEN make='lexus';
ELSE IF make='' AND FIND(description,'kia') THEN make='kia';
ELSE IF make='' AND FIND(description,'audi') THEN make='audi';
ELSE IF make='' AND FIND(description,'cadillac') THEN make='cadillac';
ELSE IF make='' AND FIND(description,'acura') THEN make='acura';
ELSE IF make='' AND FIND(description,'chrysler') THEN make='chrysler';
ELSE IF make='' AND FIND(description,'mazda') THEN make='mazda';
ELSE IF make='' AND FIND(description,'buick') THEN make='buick';
ELSE IF make='' AND FIND(description,'infiniti') THEN make='infiniti';
ELSE IF make='' AND FIND(description,'lincoln') THEN make='lincoln';
ELSE IF make='' AND FIND(description,'volvo') THEN make='volvo';
ELSE IF make='' AND FIND(description,'mitsubishi') THEN make='mitsubishi';
ELSE IF make='' AND FIND(description,'mini') THEN make='mini';
ELSE IF make='' AND FIND(description,'jaguar') THEN make='jaguar';
ELSE IF make='' AND FIND(description,'pontiac') THEN make='pontiac';
ELSE IF make='' AND FIND(description,'porsche') THEN make='porsche';
ELSE IF make='' AND FIND(description,'saturn') THEN make='saturn';
ELSE IF make='' AND FIND(description,'mercury') THEN make='mercury';
ELSE IF make='' AND FIND(description,'alfa-romeo') OR FIND(description,'alfa romeo') THEN make='alfa-romeo';
ELSE IF make='' AND FIND(description,'tesla') THEN make='tesla';
ELSE IF make='' AND FIND(description,'fiat') THEN make='fiat';
RUN;
Mind that fuel could change the nb of cylinders.
I'd do this using a SQL left join. Maybe something like:
select base.MAKE
, base MODEL
, coalesce(base.CYLINDERS,cyl.CYLINDERS) as CYLINDER
from TABLE base
left join
(select unique MAKE, MODEL, CYLINDER where CYLINDERS ne ' ') cyl
after ensuring the right table only provides one value per make/model
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.