Hi, Is there a way I can detect the % of the number of observations that have a valid (not missing or >0) value in a series of variables and using a By grouping.. Here's what I mean..
HAVE
Store Product Var1 Var2 Var3 Var4
A A1 5 . 6 3
A A1 7 6 0 9
A A1 6 . 4 .
A A2 . 5 3 3
A A2 5 4 7 5
B B1 3 . 0 8
B B1 . 5 6 7
B B2 . 5 8 9
Now if we look at the above data and see that for each Sore and Product, I want to see how many
. and 0's there are for only the Var1 to Var3 in total.. So what I'd like to see is:
Store Product %-missing
A A1 33.33% <- 3/9 - highlighted in red
A A2 16.66% <- 1/6 - highlighted in blue
B B1 50% <- 3/6 - highlighted in green
B B2 33.33% <- 1/3 - highlighted in purple
Thanks
This is not a very elegant solution as I had to physically set 0's to missing and hardcode the number of variables (3), but if quick and dirty will do. . .
data have (drop = i);
array v (*) v1-v4;
input store $ 1 product $ 3-4 v1-v4;
do i = 1 to dim(v); if v(i) = 0 then v(i) = .; end;
datalines;
A A1 5 . 6 3
A A1 7 6 0 9
A A1 6 . 4 .
A A2 . 5 3 3
A A2 5 4 7 5
B B1 3 . 0 8
B B1 . 5 6 7
B B2 . 5 8 9
;;;;
run;
proc sql;
create table want as
select store,product,(nmiss(v1)+nmiss(v2)+nmiss(v3)) / (count(*)*3) as result
from have
group by store,product
;
quit;
Oh, and result is a proportion not a percent so you'd need to format it how you'd like to see it.
Karl
It would take more than one step because you're looking at multiple variables and by groups.
Why do the calculations use Var1, Var2, and Var3, but ignore Var4?
It's not hard, I just want to know the rules first.
Var4 is just a variable that doesn't have any data that's meaningful, or at least that needs to be considered in my analysis..
OK, using only var1-var3 then:
data want;
set have;
by store product;
if first.product then do;
n_missing=0;
n_nonmissing=0;
end;
array keyvars {3} var1-var3;
do _n_=1 to 3;
if keyvars{_n_} in (0, .) then n_missing + 1;
else n_nonmissing + 1;
end;
if last.product;
pct_missing = 100 * n_missing / (n_missing + n_nonmissing);
keep store product pct_missing /* possibly keep n_missing and n_nonmissing as well */;
run;
Good luck.
This is not a very elegant solution as I had to physically set 0's to missing and hardcode the number of variables (3), but if quick and dirty will do. . .
data have (drop = i);
array v (*) v1-v4;
input store $ 1 product $ 3-4 v1-v4;
do i = 1 to dim(v); if v(i) = 0 then v(i) = .; end;
datalines;
A A1 5 . 6 3
A A1 7 6 0 9
A A1 6 . 4 .
A A2 . 5 3 3
A A2 5 4 7 5
B B1 3 . 0 8
B B1 . 5 6 7
B B2 . 5 8 9
;;;;
run;
proc sql;
create table want as
select store,product,(nmiss(v1)+nmiss(v2)+nmiss(v3)) / (count(*)*3) as result
from have
group by store,product
;
quit;
Oh, and result is a proportion not a percent so you'd need to format it how you'd like to see it.
Karl
Thanks Karl.. great results..
Karl, I'm getting some weird answers when I'm trying only one variable (such as Price)...
select Store,Product,(nmiss(Price)) / (count(*)*1) as result
as I'd like to see how much of that field is being populated throughout the Stores/Products..
Hmm, I ran the code on your existing v1-v3 from the original post as
select store,product,nmiss(v1) / count(*) as result
from have
group by store,product;
and got the correct answers. (I removed the extraneous parens and the "*1", but it shouldn't make any difference.)
Did you remember to recode 0's as missing on your Price variable before running the SQL query? If not, maybe you could post some sample data that shows the problem and I'll take another look.
Karl
I ran the same code and got the correct results on the test data, but seem sshort on the main data.. I did change the 0's to . ..
I'll try again..
Just to make sure that I'm doing this correctly.. I have 13 varaibles that I need to convert 0's to missing.. so I used this:
array v (*) Price--NHIP;
do i = 1 to dim(v);if v(i) = 0 then v(i) = .;end;
and then;
select Store, Product, (nmiss(Price)) / Count(*)*1) as result
Does this look correct ?
Yes, it looks right (assuming the "group by" is in your actual query). As I said, you can drop the *1 if there's only one variable, but I tried it with and without just to convince myself.
If the results for Price don't "smell" right to you, you might try doing a proc freq on Price, before and after the recode to missing and see if there's something funky going on. Can happen if price is calculated by division and displays formatted as zero but actually is equal to a very small but non-zero number.
I'm stumped.. I treid everything.. it all makes sense, all the other variables by themselves are fine, except this Price..
could it be that Format = COMMA9. and Informat = BEST32. ?
Gosh, I don't think so. I'm late for a meeting, but I'll reflect on it during the boring parts. In the meantime, what makes you think the results you're getting from the query are wrong?
I eyeballed the data and each of the observations has a Price listed, so it would be 100% and another product would be perhaps 90% , but am getting 0% for both..
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.