Help using Base SAS procedures

count within groups

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 104
Accepted Solution

count within groups

SAS Community-


I would like to know how I can perform preliminary analyses (e.g. frequencies, mean) within groups in my dataset.   For example, I have the below following dataset.


data have;

input   ID $        Group $           Status;

datalines;

          1A               A                    1

          2A               A                    .

          3A               A                    .

          1B               B                    0

          2B               B                    .

          1C               C                    0

          1D               D                    1

          2D               D                    1

          3D               D                    .

          1E               E                    .

          2E               E                    0    

;

run;


Assume the following points.

1. Status is a binary variable (1=yes, 0=no).

2. Assume that  where status = ., that group takes on the status value of whatever value is listed (i.e. status - group A =1, status - group E = 0). 

3. Even if there are two or more status values per group (i.e. Group D), the value for the entire group is what is listed assuming no discordant status values within groups (i.e. status - group D = 1).


I would like to answer two questions.

Q1) How many groups are present with a count ge 2 ? 

Q2) Of those groups with a count ge 2, what is the frequency of the status values?


For this example dataset, the answers to those two questions would be:

A1) 4 groups (A, B, D, E)

A2) Status 0 = 2 (50%), Status 1 = 2 (50%);


I hope this makes sense!  Thank you in advance for your help!  This one has been causing me some headaches!








Accepted Solutions
Solution
‎05-18-2012 03:55 PM
Respected Advisor
Posts: 3,124

Re: count within groups

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0 

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2, 

  sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

View solution in original post


All Replies
Super User
Posts: 17,813

Re: count within groups

You need to look at proc freq for #1.

Proc freq data=have;

table group/out=more_than_2 (where=(count>=2));

run;

For 2 you first need to fill in the missing values for the status (use a retain statement or lag value).

Then run a proc freq such as:

proc freq data=have;

table group*status/out=more_than_2 (where=(count>=2));

run;

Solution
‎05-18-2012 03:55 PM
Respected Advisor
Posts: 3,124

Re: count within groups

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0 

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2, 

  sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

Frequent Contributor
Posts: 104

Re: count within groups

Thanks both methods are great.  The proc sql step is super!

Super Contributor
Posts: 1,636

Re: count within groups

Hi Haikuo,

in your code, count(status)=5 because the two '1's in group D. I believe count(status) should=4 .

Thanks - Linlin

Respected Advisor
Posts: 4,644

Re: count within groups

Hello Sophia, if I understood your question correctly, Haikuo's query doesn't quite give the right answer. It counts the proportion of statuses that are either 0 or one, it doesn't count the proportion of groups with either statuses. In fact, it doesn't provide the expected answer to question 2. I propose an alternative query :

data have;
input ID $ Group $ Status;
datalines;
  1A A 1
  2A A .
  3A A .
  1B B 0
  2B B .
  1C C 0
  1D D 1
  2D D 1
  3D D .
  1E E .
  2E E 0 
;

proc sql;

title "PG's query";
select count(group) as nbGroups,

  sum(groupStatus=0)/count(groupStatus) as pct_No format=percent7.1,
  sum(groupStatus=1)/count(groupStatus) as pct_Yes format=percent7.1
from (select group, max(status) as groupStatus, count(ID) as n from have group by group)
where n >= 2;

title "Haikuo's query";
select count(distinct group) as ctg,
  sum(status=0)/count(status) as Percent_0 format=percent10.2,
  sum(status=1)/count(status) as Percent_1 format=percent10.2 
from have
where group in (
select group from have
group by group
having count(id)>=2);

quit;

PG

PG
Respected Advisor
Posts: 3,124

Re: count within groups

PG,

Most of the times I 'd agree with you. However, according to my understanding of OP's intention, there will be 3 x'1's and 2 '0's before computing the percentage.

While I don't understand why you applied max(), what if there are '0' and '1' within the same group?

Well, ultimately it is up to OP to judge.

my 2 cents,

Haikuo

Respected Advisor
Posts: 4,644

Re: count within groups

You're right Haikuo, we perceive Sophia's request differently. Your answer was totally appropriate. What made me post an alternative was the fact that your results didn't match Sophia's expectations (A2). I wondered what was going on when Sophia wrote that your SQL step was super, without explaining the discrepancy.

Cheers.

PG

PG
Frequent Contributor
Posts: 104

Re: count within groups

PG and Hai.kuo,

PG is correct in interpreting my question 2 as it was to count the status values within groups - not count the individual values per se.  So the proportion I want use only includes the 2 - 1's and 2 - 0's.  Based on this, PG's code does work.

However, I was able to successful apply Hai.kuo's code by simply adding a proc sort - nodupkey command before the proc sql.  (I am working with two datasets, 1 with the status values included and the other with just the remaining group members. So the proc sort-no dupkey eliminated instances where there were two values listed per group.)  I probably should have made that more clear in my OP.

One other question though - in the last line of either codes "(having count(id)>=2)" - you indicate that the variable is ID.  Why not use the group variable?  I am looking for groups that have a count ge 2 and not IDs?

Thank you so much for all over your help.  This was really a crucial part of my code and I couldn't seem to figure out how to solve it!  Go SAS community!

Sophia

Respected Advisor
Posts: 3,124

Re: count within groups

Hi Sophia,

'Group' instances is addressed by its' contents, 'id' seems to be good candidate as being non-missing. You can actually count the instances of certain 'group' by using structure like: sum(group='A'), however then you can't use 'by'  to automate the counts across different group.

Regards,

Haikuo

Super Contributor
Posts: 1,636

Re: count within groups

My code is long but it is easy to understand: Updated follow PG's comments, Thank you PG

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0

;

proc sql;

create table temp as

   select group,max(status) as status,count(group) as n

     from have

       group by group

         having n>1;

  select count(group) as groups,

  sum(Status=0)/calculated groups as pct_No format=percent7.1,

  sum(Status=1)/calculated groups as pct_Yes format=percent7.1

    from temp;

  drop table temp;

quit;

                                     Linlin' guery

                           groups   pct_No  pct_Yes

                                     4   50.0%    50.0%

Message was edited by: Linlin

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 527 views
  • 7 likes
  • 5 in conversation