Hi,
I have a dataset like the following:
Company | VAR1 | VAR2 | VAR3 | VAR4 | VAR5 |
---|---|---|---|---|---|
Company1 | 0 | 28 | 0 | 0 | 0 |
Company2 | 4 | 0 | 0 | 12 | 19 |
Company3 | 7 | 0 | 6 | 55 | 25 |
Company4 | 8 | 0 | 23 | 0 | 7 |
Company5 | 10 | 15 | 10 | 34 | 52 |
The dataset has a total of 160 variables, of which I want to find summary statistics for each variable. However, I first want to determine the number of non-zero observations for each variable. To do this with a small number of variables I would probably do the following:
Proc sql;
create table want as
select *,
sum(case when VAR1 ne 0 then 1 else 0 end) as count1,
sum(case when VAR2 ne 0 then 1 else 0 end) as count2
from have;
Quit;
However, this is code is not as useful when I have 160 variables I want to get counts for. Is there an easier way this can be done? I assume this can be done with proc freq or proc summary, but am not very familiar with these procs. Keep in mind I also want to keep the values for each variable as I want to eventually gather summary statistics for each variable. Eventually I would want my final dataset to look something like this:
Variable | Number of non-zero observations for each variable | Min | Max | Mean | Median |
---|---|---|---|---|---|
Var1 | 4 | ||||
Var2 | 2 | ||||
Var3 | 3 | ||||
Var4 | 3 | ||||
Var5 | 4 | ||||
Var6 | ...etc. | ||||
Var7 | ...etc. |
I can probably handle the transposing and re-arranging the data from a proc means output, so I was mostly wondering what the best way to get the count of non-zero observations for each variable.
Thank you.
OK, in that case ...
There is little to change about the PROC MEANS. It can run on your original data set, but does not require the N statistic as part of the output.
PROC FREQ can generate the counts, but it would probably be easier to use a DATA step:
data want;
set have end=done;
array in {160} var1-var160;
array out {160} count1-count160;
retain count1-count160 0;
do _i_=1 to 160;
if in{_i_} not in (., 0) then out{_i_} + 1;
end;
if done;
keep count1-count160;
run;
So already the code gets a bit complex. Then there is the requirement to combine both sets of results, which uses perhaps more complex tools. How comfortable are you with complex code? Would you prefer to drop the requirement about just counting the non-zeros? Either path is possible.
There are ways to do this, but first you have to clarify what you want. Are the Min, Max, Mean, and Median supposed to be based on the non-zero values only? That's the easiest case of all. Just change all zeros to missing values:
data want;
set have;
array vars {160} var1-var160;
do _i_=1 to 160;
if vars{_i_}=0 then vars{_i_}=.;
end;
run;
proc means data=want n min max mean median;
var var1-var160;
run;
If you want statistics to include all the zeros, that will take a little more work, but is still fairly easy. So what are you looking for?
Good luck.
Thank you for the response. I would like to include any zero or negative values in my summary statistics, but have a count of how many observations were not equal to zero for each variable.
OK, in that case ...
There is little to change about the PROC MEANS. It can run on your original data set, but does not require the N statistic as part of the output.
PROC FREQ can generate the counts, but it would probably be easier to use a DATA step:
data want;
set have end=done;
array in {160} var1-var160;
array out {160} count1-count160;
retain count1-count160 0;
do _i_=1 to 160;
if in{_i_} not in (., 0) then out{_i_} + 1;
end;
if done;
keep count1-count160;
run;
So already the code gets a bit complex. Then there is the requirement to combine both sets of results, which uses perhaps more complex tools. How comfortable are you with complex code? Would you prefer to drop the requirement about just counting the non-zeros? Either path is possible.
Thank you. This should work. I can probably get my data into the final desired format from the code provided by you and ballerdw + Jag.
I believe Astounding has provided you the right answer. The below will generate the output similar to ballardw but only nonzero obs count. Hope it helps
Please try
data have;
input Company$ VAR1 VAR2 VAR3 VAR4 VAR5;
cards;
Company1 0 28 0 0 0
Company2 4 0 0 12 19
Company3 7 0 6 55 25
Company4 8 0 23 0 7
Company5 10 15 10 34 52
;
proc transpose data=have;
run;
data want;
set data1;
array cnt col1-col99;
array cnts cols1-cols99;
do over cnt;
if cnt in (0,.) then cnts=1;
else cnts=.;
allcnts=nmiss(of cols1-cols99);
end;
drop col: cols:;
run;
Thanks,
Jag
And by non- zero do you have an negative values to consider?
If you want a count one way is a custom format and proc freq/report/tabulate
proc format;
value nonzero
0='Zero'
0<-high = 'Nonzero' /* assumes all the non-zero values of interest are positive*/
;
run;
proc freq data= have;
tables _numeric_ ;
format _numeric_ nonzero.;
run;
Or any list of variables will work.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.