SAS Community-
I would like to know how I can perform preliminary analyses (e.g. frequencies, mean) within groups in my dataset. For example, I have the below following dataset.
data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;
run;
Assume the following points.
1. Status is a binary variable (1=yes, 0=no).
2. Assume that where status = ., that group takes on the status value of whatever value is listed (i.e. status - group A =1, status - group E = 0).
3. Even if there are two or more status values per group (i.e. Group D), the value for the entire group is what is listed assuming no discordant status values within groups (i.e. status - group D = 1).
I would like to answer two questions.
Q1) How many groups are present with a count ge 2 ?
Q2) Of those groups with a count ge 2, what is the frequency of the status values?
For this example dataset, the answers to those two questions would be:
A1) 4 groups (A, B, D, E)
A2) Status 0 = 2 (50%), Status 1 = 2 (50%);
I hope this makes sense! Thank you in advance for your help! This one has been causing me some headaches!
How about some proc SQL:
data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;
run;
proc sql;
select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2,
sum(status=0)/count(status) as Percent_0 format=percent10.2
from have
where group in (
select group from have
group by group
having count(id)>=2)
;
quit;
Haikuo
You need to look at proc freq for #1.
Proc freq data=have;
table group/out=more_than_2 (where=(count>=2));
run;
For 2 you first need to fill in the missing values for the status (use a retain statement or lag value).
Then run a proc freq such as:
proc freq data=have;
table group*status/out=more_than_2 (where=(count>=2));
run;
How about some proc SQL:
data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;
run;
proc sql;
select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2,
sum(status=0)/count(status) as Percent_0 format=percent10.2
from have
where group in (
select group from have
group by group
having count(id)>=2)
;
quit;
Haikuo
Thanks both methods are great. The proc sql step is super!
Hi Haikuo,
in your code, count(status)=5 because the two '1's in group D. I believe count(status) should=4 .
Thanks - Linlin
Hello Sophia, if I understood your question correctly, Haikuo's query doesn't quite give the right answer. It counts the proportion of statuses that are either 0 or one, it doesn't count the proportion of groups with either statuses. In fact, it doesn't provide the expected answer to question 2. I propose an alternative query :
data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;
proc sql;
title "PG's query";
select count(group) as nbGroups,
sum(groupStatus=0)/count(groupStatus) as pct_No format=percent7.1,
sum(groupStatus=1)/count(groupStatus) as pct_Yes format=percent7.1
from (select group, max(status) as groupStatus, count(ID) as n from have group by group)
where n >= 2;
title "Haikuo's query";
select count(distinct group) as ctg,
sum(status=0)/count(status) as Percent_0 format=percent10.2,
sum(status=1)/count(status) as Percent_1 format=percent10.2
from have
where group in (
select group from have
group by group
having count(id)>=2);
quit;
PG
PG,
Most of the times I 'd agree with you. However, according to my understanding of OP's intention, there will be 3 x'1's and 2 '0's before computing the percentage.
While I don't understand why you applied max(), what if there are '0' and '1' within the same group?
Well, ultimately it is up to OP to judge.
my 2 cents,
Haikuo
You're right Haikuo, we perceive Sophia's request differently. Your answer was totally appropriate. What made me post an alternative was the fact that your results didn't match Sophia's expectations (A2). I wondered what was going on when Sophia wrote that your SQL step was super, without explaining the discrepancy.
Cheers.
PG
PG and Hai.kuo,
PG is correct in interpreting my question 2 as it was to count the status values within groups - not count the individual values per se. So the proportion I want use only includes the 2 - 1's and 2 - 0's. Based on this, PG's code does work.
However, I was able to successful apply Hai.kuo's code by simply adding a proc sort - nodupkey command before the proc sql. (I am working with two datasets, 1 with the status values included and the other with just the remaining group members. So the proc sort-no dupkey eliminated instances where there were two values listed per group.) I probably should have made that more clear in my OP.
One other question though - in the last line of either codes "(having count(id)>=2)" - you indicate that the variable is ID. Why not use the group variable? I am looking for groups that have a count ge 2 and not IDs?
Thank you so much for all over your help. This was really a crucial part of my code and I couldn't seem to figure out how to solve it! Go SAS community!
Sophia
Hi Sophia,
'Group' instances is addressed by its' contents, 'id' seems to be good candidate as being non-missing. You can actually count the instances of certain 'group' by using structure like: sum(group='A'), however then you can't use 'by' to automate the counts across different group.
Regards,
Haikuo
My code is long but it is easy to understand: Updated follow PG's comments, Thank you PG
data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;
proc sql;
create table temp as
select group,max(status) as status,count(group) as n
from have
group by group
having n>1;
select count(group) as groups,
sum(Status=0)/calculated groups as pct_No format=percent7.1,
sum(Status=1)/calculated groups as pct_Yes format=percent7.1
from temp;
drop table temp;
quit;
Linlin' guery
groups pct_No pct_Yes
4 50.0% 50.0%
Message was edited by: Linlin
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.