Solved: Re: count within groups

sophia_SAS · Posted 05-18-2012 03:01 PM

SAS Community-

I would like to know how I can perform preliminary analyses (e.g. frequencies, mean) within groups in my dataset. For example, I have the below following dataset.

data have;

input ID $ Group $ Status;

datalines;

1A A 1

2A A .

3A A .

1B B 0

2B B .

1C C 0

1D D 1

2D D 1

3D D .

1E E .

2E E 0

;

run;

Assume the following points.

1. Status is a binary variable (1=yes, 0=no).

2. Assume that where status = ., that group takes on the status value of whatever value is listed (i.e. status - group A =1, status - group E = 0).

3. Even if there are two or more status values per group (i.e. Group D), the value for the entire group is what is listed assuming no discordant status values within groups (i.e. status - group D = 1).

I would like to answer two questions.

Q1) How many groups are present with a count ge 2 ?

Q2) Of those groups with a count ge 2, what is the frequency of the status values?

For this example dataset, the answers to those two questions would be:

A1) 4 groups (A, B, D, E)

A2) Status 0 = 2 (50%), Status 1 = 2 (50%);

I hope this makes sense! Thank you in advance for your help! This one has been causing me some headaches!

Haikuo · Posted 05-18-2012 03:55 PM

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

1A A 1

2A A .

3A A .

1B B 0

2B B .

1C C 0

1D D 1

2D D 1

3D D .

1E E .

2E E 0

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2,

sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

View solution in original post

Reeza · Posted 05-18-2012 03:37 PM

You need to look at proc freq for #1.

Proc freq data=have;

table group/out=more_than_2 (where=(count>=2));

run;

For 2 you first need to fill in the missing values for the status (use a retain statement or lag value).

Then run a proc freq such as:

proc freq data=have;

table group*status/out=more_than_2 (where=(count>=2));

run;

Haikuo · Posted 05-18-2012 03:55 PM

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

1A A 1

2A A .

3A A .

1B B 0

2B B .

1C C 0

1D D 1

2D D 1

3D D .

1E E .

2E E 0

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2,

sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

sophia_SAS · Posted 05-18-2012 04:21 PM

Thanks both methods are great. The proc sql step is super!

Linlin · Posted 05-19-2012 09:06 PM

Hi Haikuo,

in your code, count(status)=5 because the two '1's in group D. I believe count(status) should=4 .

Thanks - Linlin

PGStats · Posted 05-18-2012 11:33 PM

Hello Sophia, if I understood your question correctly, Haikuo's query doesn't quite give the right answer. It counts the proportion of statuses that are either 0 or one, it doesn't count the proportion of groups with either statuses. In fact, it doesn't provide the expected answer to question 2. I propose an alternative query :

data have;
input ID $ Group $ Status;
datalines;
1A A 1
2A A .
3A A .
1B B 0
2B B .
1C C 0
1D D 1
2D D 1
3D D .
1E E .
2E E 0
;

proc sql;

title "PG's query";
select count(group) as nbGroups,

sum(groupStatus=0)/count(groupStatus) as pct_No format=percent7.1,
sum(groupStatus=1)/count(groupStatus) as pct_Yes format=percent7.1
from (select group, max(status) as groupStatus, count(ID) as n from have group by group)
where n >= 2;

title "Haikuo's query";
select count(distinct group) as ctg,
sum(status=0)/count(status) as Percent_0 format=percent10.2,
sum(status=1)/count(status) as Percent_1 format=percent10.2
from have
where group in (
select group from have
group by group
having count(id)>=2);

quit;

PG

Haikuo · Posted 05-19-2012 09:49 AM

PG,

Most of the times I 'd agree with you. However, according to my understanding of OP's intention, there will be 3 x'1's and 2 '0's before computing the percentage.

While I don't understand why you applied max(), what if there are '0' and '1' within the same group?

Well, ultimately it is up to OP to judge.

my 2 cents,

Haikuo

PGStats · Posted 05-19-2012 07:15 PM

You're right Haikuo, we perceive Sophia's request differently. Your answer was totally appropriate. What made me post an alternative was the fact that your results didn't match Sophia's expectations (A2). I wondered what was going on when Sophia wrote that your SQL step was super, without explaining the discrepancy.

Cheers.

PG

sophia_SAS · Posted 05-21-2012 07:33 AM

PG and Hai.kuo,

PG is correct in interpreting my question 2 as it was to count the status values within groups - not count the individual values per se. So the proportion I want use only includes the 2 - 1's and 2 - 0's. Based on this, PG's code does work.

However, I was able to successful apply Hai.kuo's code by simply adding a proc sort - nodupkey command before the proc sql. (I am working with two datasets, 1 with the status values included and the other with just the remaining group members. So the proc sort-no dupkey eliminated instances where there were two values listed per group.) I probably should have made that more clear in my OP.

One other question though - in the last line of either codes "(having count(id)>=2)" - you indicate that the variable is ID. Why not use the group variable? I am looking for groups that have a count ge 2 and not IDs?

Thank you so much for all over your help. This was really a crucial part of my code and I couldn't seem to figure out how to solve it! Go SAS community!

Sophia

Haikuo · Posted 05-21-2012 08:14 AM

Hi Sophia,

'Group' instances is addressed by its' contents, 'id' seems to be good candidate as being non-missing. You can actually count the instances of certain 'group' by using structure like: sum(group='A'), however then you can't use 'by' to automate the counts across different group.

Regards,

Haikuo

Linlin · Posted 05-19-2012 08:44 PM

My code is long but it is easy to understand: Updated follow PG's comments, Thank you PG

data have;

input ID $ Group $ Status;

datalines;

1A A 1

2A A .

3A A .

1B B 0

2B B .

1C C 0

1D D 1

2D D 1

3D D .

1E E .

2E E 0

;

proc sql;

create table temp as

select group,max(status) as status,count(group) as n

from have

group by group

having n>1;

select count(group) as groups,

sum(Status=0)/calculated groups as pct_No format=percent7.1,

sum(Status=1)/calculated groups as pct_Yes format=percent7.1

from temp;

drop table temp;

quit;

Linlin' guery

groups pct_No pct_Yes

4 50.0% 50.0%

Message was edited by: Linlin

Registration is open

SAS Training: Just a Click Away