Solved: Re: detect % of observations missing using BY

podarum · Posted 03-21-2012 12:14 PM

Hi, Is there a way I can detect the % of the number of observations that have a valid (not missing or >0) value in a series of variables and using a By grouping.. Here's what I mean..

HAVE

Store Product Var1 Var2 Var3 Var4

A A1 5 . 6 3

A A1 7 6 0 9

A A1 6 . 4 .

A A2 . 5 3 3

A A2 5 4 7 5

B B1 3 . 0 8

B B1 . 5 6 7

B B2 . 5 8 9

Now if we look at the above data and see that for each Sore and Product, I want to see how many

. and 0's there are for only the Var1 to Var3 in total.. So what I'd like to see is:

Store Product %-missing

A A1 33.33% <- 3/9 - highlighted in red

A A2 16.66% <- 1/6 - highlighted in blue

B B1 50% <- 3/6 - highlighted in green

B B2 33.33% <- 1/3 - highlighted in purple

Thanks

KarlK · Posted 03-21-2012 01:36 PM

This is not a very elegant solution as I had to physically set 0's to missing and hardcode the number of variables (3), but if quick and dirty will do. . .

data have (drop = i);

array v (*) v1-v4;

input store $ 1 product $ 3-4 v1-v4;

do i = 1 to dim(v); if v(i) = 0 then v(i) = .; end;

datalines;

A A1 5 . 6 3

A A1 7 6 0 9

A A1 6 . 4 .

A A2 . 5 3 3

A A2 5 4 7 5

B B1 3 . 0 8

B B1 . 5 6 7

B B2 . 5 8 9

;;;;

run;

proc sql;

create table want as

select store,product,(nmiss(v1)+nmiss(v2)+nmiss(v3)) / (count(*)*3) as result

from have

group by store,product

;

quit;

Oh, and result is a proportion not a percent so you'd need to format it how you'd like to see it.

Karl

View solution in original post

Reeza · Posted 03-21-2012 12:51 PM

It would take more than one step because you're looking at multiple variables and by groups.

Astounding · Posted 03-21-2012 01:16 PM

Why do the calculations use Var1, Var2, and Var3, but ignore Var4?

It's not hard, I just want to know the rules first.

podarum · Posted 03-21-2012 01:39 PM

Var4 is just a variable that doesn't have any data that's meaningful, or at least that needs to be considered in my analysis..

Astounding · Posted 03-21-2012 01:55 PM

OK, using only var1-var3 then:

data want;

set have;

by store product;

if first.product then do;

n_missing=0;

n_nonmissing=0;

end;

array keyvars {3} var1-var3;

do _n_=1 to 3;

if keyvars{_n_} in (0, .) then n_missing + 1;

else n_nonmissing + 1;

end;

if last.product;

pct_missing = 100 * n_missing / (n_missing + n_nonmissing);

keep store product pct_missing /* possibly keep n_missing and n_nonmissing as well */;

run;

Good luck.

KarlK · Posted 03-21-2012 01:36 PM

This is not a very elegant solution as I had to physically set 0's to missing and hardcode the number of variables (3), but if quick and dirty will do. . .

data have (drop = i);

array v (*) v1-v4;

input store $ 1 product $ 3-4 v1-v4;

do i = 1 to dim(v); if v(i) = 0 then v(i) = .; end;

datalines;

A A1 5 . 6 3

A A1 7 6 0 9

A A1 6 . 4 .

A A2 . 5 3 3

A A2 5 4 7 5

B B1 3 . 0 8

B B1 . 5 6 7

B B2 . 5 8 9

;;;;

run;

proc sql;

create table want as

select store,product,(nmiss(v1)+nmiss(v2)+nmiss(v3)) / (count(*)*3) as result

from have

group by store,product

;

quit;

Oh, and result is a proportion not a percent so you'd need to format it how you'd like to see it.

Karl

podarum · Posted 03-21-2012 02:17 PM

Thanks Karl.. great results..

podarum · Posted 03-21-2012 02:25 PM

Karl, I'm getting some weird answers when I'm trying only one variable (such as Price)...

select Store,Product,(nmiss(Price)) / (count(*)*1) as result

as I'd like to see how much of that field is being populated throughout the Stores/Products..

KarlK · Posted 03-21-2012 02:36 PM

Hmm, I ran the code on your existing v1-v3 from the original post as

select store,product,nmiss(v1) / count(*) as result

from have

group by store,product;

and got the correct answers. (I removed the extraneous parens and the "*1", but it shouldn't make any difference.)

Did you remember to recode 0's as missing on your Price variable before running the SQL query? If not, maybe you could post some sample data that shows the problem and I'll take another look.

Karl

podarum · Posted 03-21-2012 02:43 PM

I ran the same code and got the correct results on the test data, but seem sshort on the main data.. I did change the 0's to . ..

I'll try again..

podarum · Posted 03-21-2012 02:53 PM

Just to make sure that I'm doing this correctly.. I have 13 varaibles that I need to convert 0's to missing.. so I used this:

array v (*) Price--NHIP;

do i = 1 to dim(v);if v(i) = 0 then v(i) = .;end;

and then;

select Store, Product, (nmiss(Price)) / Count(*)*1) as result

Does this look correct ?

KarlK · Posted 03-21-2012 03:01 PM

Yes, it looks right (assuming the "group by" is in your actual query). As I said, you can drop the *1 if there's only one variable, but I tried it with and without just to convince myself.

If the results for Price don't "smell" right to you, you might try doing a proc freq on Price, before and after the recode to missing and see if there's something funky going on. Can happen if price is calculated by division and displays formatted as zero but actually is equal to a very small but non-zero number.

podarum · Posted 03-21-2012 03:13 PM

I'm stumped.. I treid everything.. it all makes sense, all the other variables by themselves are fine, except this Price..

could it be that Format = COMMA9. and Informat = BEST32. ?

KarlK · Posted 03-21-2012 03:16 PM

Gosh, I don't think so. I'm late for a meeting, but I'll reflect on it during the boring parts. In the meantime, what makes you think the results you're getting from the query are wrong?

podarum · Posted 03-21-2012 03:19 PM

I eyeballed the data and each of the observations has a Price listed, so it would be 100% and another product would be perhaps 90% , but am getting 0% for both..

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away