I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?
data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
Do you mean:
data test;
input  StackHeight 13  @;
if  StackHeight not(missing) then 
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
else delete;  /* ignore lines when missing */
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;You will get same result with:
data test;
 input ... ; /* your code */
 if missing(StackHeight ) then delete;
datalines;
  .... your data ...
run;
No, I want to keep the rows with the missing stack height values. What I want to do is predict what they might be using the values that are populated for the other rows.
@SAS666 wrote:
I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?
data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
I don't think this is a SAS issue at all. I think the problem is there no way to "predict" the value of a missing in this situation if all you have is the SKU and Supplier (at least as far as I understand the meaning of the word "predict"). I mean, forget SAS for a second, let's just talk concepts, how would you, @SAS666, do such a prediction if all you had was pencil and paper???\
I would also like to state that it is easy to "fill in" a number when StackHeight is missing, but that isn't the same as a "prediction". So what do you mean?
Well, in this case Suppliers usually use the same stack height across their products. There are actually a few more variables I would use to make the prediction. In this case it is for TV size. So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall, I'd look to see what LG usually stacks for TVs in that size range and then input that value. That's how i've been doing it manually. I'm trying to figure out how to do it in SAS so I can do it on a larger scale.
So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5.
Does that make sense?
So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5.
It would have been very helpful to both of us if you stated this in your first post. If you say LG Value should be 3 2/3 (3.667), then I understand what you are doing and the following code ought to work. If you really meant LG value is 3.75 then I don't know how you got that.
UNTESTED CODE
proc glm data=test;
    class sku supplier;  
    model stackheight=sku*supplier;
    output out=pred predicted=predicted;
run;
quit;
@SAS666 wrote:
Well, in this case Suppliers usually use the same stack height across their products. There are actually a few more variables I would use to make the prediction. In this case it is for TV size. So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall,
There's nothing in your sample data about height....
Given what you've shown, you could consider PROC STDIZE to replace the missing with the mean for the group, or some other metric (median/max). See an example here and check the documentation for the other options.
https://gist.github.com/statgeek/cbd6bd253b4ab4828dd71d0ce3d17199
For numerical values, fill in with the most popular value, i.e. the mode
data test;
input sku $ Supplier $ StackHeight;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;
proc means data=test noprint;
where StackHeight is not missing;
by sku supplier notsorted;
var StackHeight;
output out=modes mode=mode;
run;
proc sql;
update test
set StackHeight = 
    (select mode 
     from modes 
     where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit; 
For categorical values, you have to find the most popular value yourself:
data test;
input sku $ Supplier $ StackHeight $;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;
proc sort data=test; by sku supplier StackHeight; run;
proc sql;
create table freqs as
select sku, supplier, StackHeight, count(*) as n
from test where StackHeight is not missing
group by sku, supplier, StackHeight
order by sku, supplier, n desc;
quit;
data modes;
set freqs; by sku supplier;
if first.supplier;
drop n;
run;
proc sql;
update test
set StackHeight = 
    (select StackHeight 
     from modes 
     where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit; 
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
