BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sophia_SAS
Obsidian | Level 7

SAS Community-


I would like to know how I can perform preliminary analyses (e.g. frequencies, mean) within groups in my dataset.   For example, I have the below following dataset.


data have;

input   ID $        Group $           Status;

datalines;

          1A               A                    1

          2A               A                    .

          3A               A                    .

          1B               B                    0

          2B               B                    .

          1C               C                    0

          1D               D                    1

          2D               D                    1

          3D               D                    .

          1E               E                    .

          2E               E                    0    

;

run;


Assume the following points.

1. Status is a binary variable (1=yes, 0=no).

2. Assume that  where status = ., that group takes on the status value of whatever value is listed (i.e. status - group A =1, status - group E = 0). 

3. Even if there are two or more status values per group (i.e. Group D), the value for the entire group is what is listed assuming no discordant status values within groups (i.e. status - group D = 1).


I would like to answer two questions.

Q1) How many groups are present with a count ge 2 ? 

Q2) Of those groups with a count ge 2, what is the frequency of the status values?


For this example dataset, the answers to those two questions would be:

A1) 4 groups (A, B, D, E)

A2) Status 0 = 2 (50%), Status 1 = 2 (50%);


I hope this makes sense!  Thank you in advance for your help!  This one has been causing me some headaches!







1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0 

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2, 

  sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

View solution in original post

10 REPLIES 10
Reeza
Super User

You need to look at proc freq for #1.

Proc freq data=have;

table group/out=more_than_2 (where=(count>=2));

run;

For 2 you first need to fill in the missing values for the status (use a retain statement or lag value).

Then run a proc freq such as:

proc freq data=have;

table group*status/out=more_than_2 (where=(count>=2));

run;

Haikuo
Onyx | Level 15

How about some proc SQL:

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0 

;

run;

proc sql;

select count(distinct group) as ctg, sum(status=1)/count(status) as Percent_1 format=percent10.2, 

  sum(status=0)/count(status) as Percent_0 format=percent10.2

from have

where group in (

select group from have

group by group

having count(id)>=2)

;

quit;

Haikuo

sophia_SAS
Obsidian | Level 7

Thanks both methods are great.  The proc sql step is super!

Linlin
Lapis Lazuli | Level 10

Hi Haikuo,

in your code, count(status)=5 because the two '1's in group D. I believe count(status) should=4 .

Thanks - Linlin

PGStats
Opal | Level 21

Hello Sophia, if I understood your question correctly, Haikuo's query doesn't quite give the right answer. It counts the proportion of statuses that are either 0 or one, it doesn't count the proportion of groups with either statuses. In fact, it doesn't provide the expected answer to question 2. I propose an alternative query :

data have;
input ID $ Group $ Status;
datalines;
  1A A 1
  2A A .
  3A A .
  1B B 0
  2B B .
  1C C 0
  1D D 1
  2D D 1
  3D D .
  1E E .
  2E E 0 
;

proc sql;

title "PG's query";
select count(group) as nbGroups,

  sum(groupStatus=0)/count(groupStatus) as pct_No format=percent7.1,
  sum(groupStatus=1)/count(groupStatus) as pct_Yes format=percent7.1
from (select group, max(status) as groupStatus, count(ID) as n from have group by group)
where n >= 2;

title "Haikuo's query";
select count(distinct group) as ctg,
  sum(status=0)/count(status) as Percent_0 format=percent10.2,
  sum(status=1)/count(status) as Percent_1 format=percent10.2 
from have
where group in (
select group from have
group by group
having count(id)>=2);

quit;

PG

PG
Haikuo
Onyx | Level 15

PG,

Most of the times I 'd agree with you. However, according to my understanding of OP's intention, there will be 3 x'1's and 2 '0's before computing the percentage.

While I don't understand why you applied max(), what if there are '0' and '1' within the same group?

Well, ultimately it is up to OP to judge.

my 2 cents,

Haikuo

PGStats
Opal | Level 21

You're right Haikuo, we perceive Sophia's request differently. Your answer was totally appropriate. What made me post an alternative was the fact that your results didn't match Sophia's expectations (A2). I wondered what was going on when Sophia wrote that your SQL step was super, without explaining the discrepancy.

Cheers.

PG

PG
sophia_SAS
Obsidian | Level 7

PG and Hai.kuo,

PG is correct in interpreting my question 2 as it was to count the status values within groups - not count the individual values per se.  So the proportion I want use only includes the 2 - 1's and 2 - 0's.  Based on this, PG's code does work.

However, I was able to successful apply Hai.kuo's code by simply adding a proc sort - nodupkey command before the proc sql.  (I am working with two datasets, 1 with the status values included and the other with just the remaining group members. So the proc sort-no dupkey eliminated instances where there were two values listed per group.)  I probably should have made that more clear in my OP.

One other question though - in the last line of either codes "(having count(id)>=2)" - you indicate that the variable is ID.  Why not use the group variable?  I am looking for groups that have a count ge 2 and not IDs?

Thank you so much for all over your help.  This was really a crucial part of my code and I couldn't seem to figure out how to solve it!  Go SAS community!

Sophia

Haikuo
Onyx | Level 15

Hi Sophia,

'Group' instances is addressed by its' contents, 'id' seems to be good candidate as being non-missing. You can actually count the instances of certain 'group' by using structure like: sum(group='A'), however then you can't use 'by'  to automate the counts across different group.

Regards,

Haikuo

Linlin
Lapis Lazuli | Level 10

My code is long but it is easy to understand: Updated follow PG's comments, Thank you PG

data have;

input ID $ Group $ Status;

datalines;

  1A A 1

  2A A .

  3A A .

  1B B 0

  2B B .

  1C C 0

  1D D 1

  2D D 1

  3D D .

  1E E .

  2E E 0

;

proc sql;

create table temp as

   select group,max(status) as status,count(group) as n

     from have

       group by group

         having n>1;

  select count(group) as groups,

  sum(Status=0)/calculated groups as pct_No format=percent7.1,

  sum(Status=1)/calculated groups as pct_Yes format=percent7.1

    from temp;

  drop table temp;

quit;

                                     Linlin' guery

                           groups   pct_No  pct_Yes

                                     4   50.0%    50.0%

Message was edited by: Linlin

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3505 views
  • 7 likes
  • 5 in conversation