I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?
data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
Do you mean:
data test;
input StackHeight 13 @;
if StackHeight not(missing) then
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
else delete; /* ignore lines when missing */
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
You will get same result with:
data test;
input ... ; /* your code */
if missing(StackHeight ) then delete;
datalines;
.... your data ...
run;
No, I want to keep the rows with the missing stack height values. What I want to do is predict what they might be using the values that are populated for the other rows.
@SAS666 wrote:
I'm trying to predict the value of a variable using either SAS EG or Enterprise Miner. I have some missing data for the StackHeight Variable shown below. Using the Sku and Supplier I can predict what the stack height might be. How would I do this?
data test;
input sku $ 1-3 Supplier $ 5-12 StackHeight 13 ;
datalines;
123 Sony 5
123 Sony 5
123 Sony
124 LG 4
124 LG 3
124 LG
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba
;
run;
I don't think this is a SAS issue at all. I think the problem is there no way to "predict" the value of a missing in this situation if all you have is the SKU and Supplier (at least as far as I understand the meaning of the word "predict"). I mean, forget SAS for a second, let's just talk concepts, how would you, @SAS666, do such a prediction if all you had was pencil and paper???\
I would also like to state that it is easy to "fill in" a number when StackHeight is missing, but that isn't the same as a "prediction". So what do you mean?
Well, in this case Suppliers usually use the same stack height across their products. There are actually a few more variables I would use to make the prediction. In this case it is for TV size. So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall, I'd look to see what LG usually stacks for TVs in that size range and then input that value. That's how i've been doing it manually. I'm trying to figure out how to do it in SAS so I can do it on a larger scale.
So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5.
Does that make sense?
So in my example data I'd say the missing Sony value should be 5 and the missing LG value should be 3.75 and Toshiba would be 5.
It would have been very helpful to both of us if you stated this in your first post. If you say LG Value should be 3 2/3 (3.667), then I understand what you are doing and the following code ought to work. If you really meant LG value is 3.75 then I don't know how you got that.
UNTESTED CODE
proc glm data=test; class sku supplier; model stackheight=sku*supplier; output out=pred predicted=predicted; run; quit;
@SAS666 wrote:
Well, in this case Suppliers usually use the same stack height across their products. There are actually a few more variables I would use to make the prediction. In this case it is for TV size. So if a TV has a sku of 123 and supplier of LG and is 50-60 inches tall,
There's nothing in your sample data about height....
Given what you've shown, you could consider PROC STDIZE to replace the missing with the mean for the group, or some other metric (median/max). See an example here and check the documentation for the other options.
https://gist.github.com/statgeek/cbd6bd253b4ab4828dd71d0ce3d17199
For numerical values, fill in with the most popular value, i.e. the mode
data test;
input sku $ Supplier $ StackHeight;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;
proc means data=test noprint;
where StackHeight is not missing;
by sku supplier notsorted;
var StackHeight;
output out=modes mode=mode;
run;
proc sql;
update test
set StackHeight =
(select mode
from modes
where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit;
For categorical values, you have to find the most popular value yourself:
data test;
input sku $ Supplier $ StackHeight $;
datalines;
123 Sony 5
123 Sony 5
123 Sony .
124 LG 4
124 LG 3
124 LG .
124 LG 4
125 Toshiba 5
125 Toshiba 5
125 Toshiba .
;
proc sort data=test; by sku supplier StackHeight; run;
proc sql;
create table freqs as
select sku, supplier, StackHeight, count(*) as n
from test where StackHeight is not missing
group by sku, supplier, StackHeight
order by sku, supplier, n desc;
quit;
data modes;
set freqs; by sku supplier;
if first.supplier;
drop n;
run;
proc sql;
update test
set StackHeight =
(select StackHeight
from modes
where sku=test.sku and supplier=test.supplier)
where StackHeight is missing;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.