BookmarkSubscribeRSS Feed
Calcite | Level 5

I'm pretty new to proc sql, but I was trying to count the number of people. Each unique person can have multiple rows (multiple application submissions). An example of a table below is:  


I want to count the number of people in each Category*Ref_type, so it'll be



I'm not at home right now so I can't check but I remember that I tried doing

proc sql;
create table want as
select Category
,count(unique Person_ID)
from have 
group by Category, Ref_type

which gave me not exactly what I wanted (note that this was for a large dataset, probably 2 million rows). When I removed the "unique" keyword, I got what I wanted.

Can someone explain to me what the code with the "unique" when I'm using it with the count function, and a group by statement, and also what the code without the "unique" function does when I'm counting a specific variable that's not in the group by statement?

Super User Tom
Super User

How many people do you think you have in your fist listing?


Person_ID Category Ref_type Total_amt
100 Green 2 350
100 Blue 2 300
100 Red 3 100
200 Green 1 20
200 Black 3 500
300 Blue 2 200


I see 6 observations for 3 people.


COUNT(*) will count how many observations.

COUNT(PERSON_ID) will count how many observations have a non-missing value of PERSON_ID.

COUNT(DISTINCT PERSON_ID) will count how many people.



Calcite | Level 5
That makes sense - so if PERSON_ID is always not missing then count(*) = count(person_id)? Also, how does that work with the group by statement? I see @JosvanderVelden 's reply, but the post they linked only says there's a caveat with the aggregate function, but doesn't say what the caveat is.
Super User Tom
Super User

GROUP BY separates the observations into groups based on the values of the variables listed.  When you have GROUP BY the aggregate functions, like COUNT(),MAX(),SUM(), etc operate over each group separately.


Also when you have GROUP BY you need to take care what columns (variables) you list in the SELECT clause.

data have;
  input Person_ID $ Category $ Ref_type $ Total_amt;
100 Green 2 350
100 Blue 2 300
100 Red 3 100
200 Green 1 20
200 Black 3 500
300 Blue 2 200

proc sql ;
 select category,ref_type
      , count(*) as nobs
      , count(person_id) as n_with_ids
      , count(person_id) as n_people
      , sum(total_amt) as sum_total_amt
   from have
   group by category,ref_type


Category  Ref_type      nobs  n_with_ids  n_people           amt
Black     3                1           1         1           500
Blue      2                2           2         2           500
Green     1                1           1         1            20
Green     2                1           1         1           350
Red       3                1           1         1           100

But if you include a variable that is neither one of the grouping variables nor an aggregate result then SAS will re-merge the computed aggregate values back with all of the observations selected in that group.  It will tell you this in the LOG.

1178  proc sql ;
1179   select category,ref_type,total_amt
1180        , count(*) as nobs
1181        , count(person_id) as n_with_ids
1182        , count(person_id) as n_people
1183        , sum(total_amt) as sum_total_amt
1184     from have
1185     group by category,ref_type
1186   ;
NOTE: The query requires remerging summary statistics back with the original data.
1187  quit;


Category  Ref_type  Total_amt      nobs  n_with_ids  n_people           amt
Black     3               500         1           1         1           500
Blue      2               200         2           2         2           500
Blue      2               300         2           2         2           500
Green     1                20         1           1         1            20
Green     2               350         1           1         1           350
Red       3               100         1           1         1           100
Calcite | Level 5
Thanks! That's really helpful. I think now I'm just not understanding what the difference between using
count(*), count(unique person_id) and count(person_id) is when using that "group by category, ref_type". It seems like it will always give the same result (but I know that count(unique person_id) and count(person_id) would give different results in general, I just can't see why, since we are using the group by statement).
Calcite | Level 5
I say that will give different results in general, because I remember taking away the "unique" keyword and it gave different results. Also, did you mean to type count(person_id) twice in the proc sql statements?
Super User Tom
Super User

No. Second one should have the DISTINCT keyword.


They will yield the same count when the three variable combination of PERSON_ID, CATEGORY, and REF_TYPE uniquely identify the observation.  



Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg



Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 3 in conversation