BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Vicente95
Obsidian | Level 7

Hello Team,

 

I have a used cars dataset where there are hundreds of thousands of missing values. The dataset looks something like this (I filtered the manufacturer to only show toyota for brevity's sake)

 

regionpriceyearmanufacturermodelconditioncylindersfuelodometertitle_statustransmissiondrivesizetypepaint_colorstateposting_date
ventura county $              7,8002007toyota   gas10000000cleanautomatic    ca2021-04-30T11:48:47-0700
south florida $                14,4982007toyotafj cruiser 4x4  diesel7777777cleanautomatic    fl2021-05-01T07:38:28-0400
flagstaff / sedona $                   3,9001998toyota4-runner sr5good6 cylindersgas349000cleanautomaticrwdmid-sizeSUVgreenaz2021-04-05T09:56:10-0700
phoenix $                   3,5001998toyota4runnergood6 cylindersgas347000cleanautomaticrwd SUVsilveraz2021-05-03T11:27:32-0700
colorado springs $                   6,5001998toyota good8 cylindersgas345000cleanautomatic4wdfull-sizeSUVgreyco2021-04-28T10:05:44-0600
seattle-tacoma $                   1,1502009toyotapriusfair hybrid345000cleanautomatic  hatchbackbluewa2021-05-03T10:06:08-0700
knoxville $                   2,0001998toyotacamry cegood4 cylindersgas344200cleanautomatic  sedan tn2021-05-03T22:07:49-0400
knoxville $                   2,0001998toyotacamry cegood4 cylindersgas344200cleanautomatic  sedan tn2021-04-30T21:54:27-0400
knoxville $                   2,0001998toyotacamry cegood4 cylindersgas344200cleanautomatic  sedan tn2021-04-26T13:20:34-0400
knoxville $                   2,2001998toyotacamry cegood4 cylindersgas344200cleanautomatic  sedan tn2021-04-23T14:43:32-0400
SF bay area $                   6,9002001toyota4runner 6 cylindersgas342200cleanautomatic    ca2021-05-03T18:18:25-0700
champaign urbana $                   1,0001997toyotacamryfair4 cylindersgas342000cleanautomaticfwdfull-sizesedanredil2021-04-08T16:24:29-0500
kansas city, MO $                   2,9962000toyotatundra 2wd truckexcellent6 cylindersgas342000cleanautomaticrwd pickupwhiteks2021-04-26T16:13:17-0500
fort collins / north CO $                   2,0001999toyota4runnerfair gas340500cleanautomatic4wd SUVwhiteco2021-04-07T10:55:29-0600
orange county $                   8,6002003toyotatundra 8 cylindersgas340000cleanautomatic    ca2021-04-28T15:54:43-0700

 

I need to keep the existing rows that have no missing values intact and untouched, especially the model and manufacturer columns while replacing all the missing values for the engine using the make/model column. Note, all values/observations are already lower-case so there is no need to perform any sort of upper and lower casing.

 

My goal is to use an if-else statement in conjunction with a wildcard and list to fill in the missing values in the cylinders column using the make/model columns. The idea is to fill in missing engine cylinder values based on the car model. There are some missing values for the car model which I predict will cause problems. 

 

Here is a sample logic that I want to perform for Toyota. If the SAS code works for Toyota, I can apply this to other car makers such as Honda, BMW, Lexus, Mercedes-Benz, etc. I can also use the same logic to work with the transmission, drive train, size, and type columns. 

 

4 cylinders

  • If the model starts with (is in or like) "camry", "corolla", "rav4", "prius", "yaris", "matrix", "echo", "previa"; and
  • The cylinders is missing/blank; then 
  • Cylinders equal "4 cylinders"

6 cylinders

  • Else if the model starts with "avalon", "highlander", "4runner", "sienna", "tacoma", "highlander", "fj", "venza", "cruiser"; and
  • The cylinders is missing/blank; then
  • Cylinders equal "6 cylinders"

8 cylinders

  • Else if the model starts with "tundra", "sequoia"; and
  • The cylinders is missing/blank; then
  • Cylinders equal "8 cylinders"

I imagine this if-else statement will be very long since there are 38 car makers/manufacturers in the dataset. Any guidance is appreciated. 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star
Slightly shorter, and easier to read:

if missing (make) then do;
if find(description,......
(Many times)
end;

View solution in original post

4 REPLIES 4
Astounding
PROC Star

OK, here's one piece of the puzzle to get you started.  It identifies cases where there are no missing values.

 

data want;
   set have;
   nmiss(of _numeric_) = 0 and cmiss(of _character_) = 0 then return;
   * add your IF THEN statements here;
run;

Because of the RETURN statement, any observations with no missing values will bypass any IF THEN statements.

Vicente95
Obsidian | Level 7

Thank you. Is there an optimal way where I can shorten the following SAS code? The idea is to fill in the missing make values using the description column by matching the automaker keyword all while keeping the non-missing model values intact. 

 

Based on the results, it seems to work. 

 

DATA WORK.usedcars6;
	SET WORK.usedcars5;
IF make='' AND FIND(description,'ford') THEN make='ford';
ELSE IF make='' AND FIND(description,'chevy') or FIND(description,'chevrolet') THEN make='chevrolet';
ELSE IF make='' AND FIND(description,'toyota') THEN make='toyota';
ELSE IF make='' AND FIND(description,'honda') THEN make='honda';
ELSE IF make='' AND FIND(description,'nissan') THEN make='nissan';
ELSE IF make='' AND FIND(description,'jeep') THEN make='jeep';
ELSE IF make='' AND FIND(description,'ram') THEN make='ram';
ELSE IF make='' AND FIND(description,'gmc') THEN make='gmc';
ELSE IF make='' AND FIND(description,'bmw') THEN make='bmw';
ELSE IF make='' AND FIND(description,'dodge') THEN make='dodge';
ELSE IF make='' AND FIND(description,'mercedes-benz') OR FIND(description,'mercedes') OR FIND(description,'mercedes benz') OR FIND(description,'benz') THEN make='mercedes-benz';
ELSE IF make='' AND FIND(description,'hyundai') THEN make='hyundai';
ELSE IF make='' AND FIND(description,'subaru') THEN make='subaru';
ELSE IF make='' AND FIND(description,'lexus') THEN make='lexus';
ELSE IF make='' AND FIND(description,'kia') THEN make='kia';
ELSE IF make='' AND FIND(description,'audi') THEN make='audi';
ELSE IF make='' AND FIND(description,'cadillac') THEN make='cadillac';
ELSE IF make='' AND FIND(description,'acura') THEN make='acura';
ELSE IF make='' AND FIND(description,'chrysler') THEN make='chrysler';
ELSE IF make='' AND FIND(description,'mazda') THEN make='mazda';
ELSE IF make='' AND FIND(description,'buick') THEN make='buick';
ELSE IF make='' AND FIND(description,'infiniti') THEN make='infiniti';
ELSE IF make='' AND FIND(description,'lincoln') THEN make='lincoln';
ELSE IF make='' AND FIND(description,'volvo') THEN make='volvo';
ELSE IF make='' AND FIND(description,'mitsubishi') THEN make='mitsubishi';
ELSE IF make='' AND FIND(description,'mini') THEN make='mini';
ELSE IF make='' AND FIND(description,'jaguar') THEN make='jaguar';
ELSE IF make='' AND FIND(description,'pontiac') THEN make='pontiac';
ELSE IF make='' AND FIND(description,'porsche') THEN make='porsche';
ELSE IF make='' AND FIND(description,'saturn') THEN make='saturn';
ELSE IF make='' AND FIND(description,'mercury') THEN make='mercury';
ELSE IF make='' AND FIND(description,'alfa-romeo') OR FIND(description,'alfa romeo') THEN make='alfa-romeo';
ELSE IF make='' AND FIND(description,'tesla') THEN make='tesla';
ELSE IF make='' AND FIND(description,'fiat') THEN make='fiat'; RUN;

 

Astounding
PROC Star
Slightly shorter, and easier to read:

if missing (make) then do;
if find(description,......
(Many times)
end;
ChrisNZ
Tourmaline | Level 20

Mind that fuel could change the nb of cylinders.

I'd do this using a SQL left join. Maybe something like:

select base.MAKE

     , base MODEL

     , coalesce(base.CYLINDERS,cyl.CYLINDERS) as CYLINDER

from TABLE base

     left join

     (select unique MAKE, MODEL, CYLINDER where CYLINDERS ne ' ') cyl

after ensuring the right table only provides one value per make/model

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 634 views
  • 3 likes
  • 3 in conversation