BookmarkSubscribeRSS Feed
SAS666
Calcite | Level 5

I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?

 

data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;

9 REPLIES 9
Shmuel
Garnet | Level 18

Do you mean:

data test;
input  StackHeight 13  @;
if  StackHeight not(missing) then 
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
else delete;  /* ignore lines when missing */
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
Shmuel
Garnet | Level 18

You will get same result with:
data test;
input ... ; /* your code */
if missing(StackHeight ) then delete;

datalines;

  .... your data ...
run;

SAS666
Calcite | Level 5

No, I want to keep the rows with the missing stack height values.  What I want to do is predict what they might be using the values that are populated for the other rows. 

PaigeMiller
Diamond | Level 26

@SAS666 wrote:

I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?

 

data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;


I don't think this is a SAS issue at all. I think the problem is there no way to "predict" the value of a missing in this situation if all you have is the SKU and Supplier (at least as far as I understand the meaning of the word "predict"). I mean, forget SAS for a second, let's just talk concepts, how would you, @SAS666, do such a prediction if all you had was pencil and paper???\

 

I would also like to state that it is easy to "fill in" a number when StackHeight is missing, but that isn't the same as a "prediction". So what do you mean?

--
Paige Miller
SAS666
Calcite | Level 5

Well, in this case Suppliers usually use the same stack height across their products.  There are actually a few more variables I would use to make the prediction. In this case it is for TV size.  So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall, I'd look to see what LG usually stacks for TVs in that size range and then input that value. That's how i've been doing it manually.  I'm trying to figure out how to do it in SAS so I can do it on a larger scale. 

 

So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5. 

 

Does that make sense?

PaigeMiller
Diamond | Level 26

So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5. 

 

It would have been very helpful to both of us if you stated this in your first post. If you say LG Value should be 3 2/3 (3.667), then I understand what you are doing and the following code ought to work. If you really meant LG value is 3.75 then I don't know how you got that.

 

UNTESTED CODE

 

proc glm data=test;
    class sku supplier;  
    model stackheight=sku*supplier;
    output out=pred predicted=predicted;
run;
quit;

 

--
Paige Miller
Reeza
Super User

@SAS666 wrote:

Well, in this case Suppliers usually use the same stack height across their products.  There are actually a few more variables I would use to make the prediction. In this case it is for TV size.  So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall,


 

There's nothing in your sample data about height....

 

Given what you've shown, you could consider PROC STDIZE to replace the missing with the mean for the group, or some other metric (median/max).  See an example here and check the documentation for the other options.

https://gist.github.com/statgeek/cbd6bd253b4ab4828dd71d0ce3d17199

PGStats
Opal | Level 21

For numerical values, fill in with the most popular value, i.e. the mode

 

data test;
input sku $ Supplier $ StackHeight;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;

proc means data=test noprint;
where StackHeight is not missing;
by sku supplier notsorted;
var StackHeight;
output out=modes mode=mode;
run;

proc sql;
update test
set StackHeight = 
    (select mode 
     from modes 
     where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit; 
PG
PGStats
Opal | Level 21

For categorical values, you have to find the most popular value yourself:

 

data test;
input sku $ Supplier $ StackHeight $;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;

proc sort data=test; by sku supplier StackHeight; run;

proc sql;
create table freqs as
select sku, supplier, StackHeight, count(*) as n
from test where StackHeight is not missing
group by sku, supplier, StackHeight
order by sku, supplier, n desc;
quit;

data modes;
set freqs; by sku supplier;
if first.supplier;
drop n;
run;

proc sql;
update test
set StackHeight = 
    (select StackHeight 
     from modes 
     where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit; 

PG

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2415 views
  • 0 likes
  • 5 in conversation