BookmarkSubscribeRSS Feed
newbatprocsql1
Calcite | Level 5

I'm pretty new to proc sql, but I was trying to count the number of people. Each unique person can have multiple rows (multiple application submissions). An example of a table below is:  

Person_IDCategoryRef_typeTotal_amt
100Green2350
100Blue2300
100Red3100
200Green120
200Black3500
300Blue2200


I want to count the number of people in each Category*Ref_type, so it'll be

CategoryRef_typeNo_ppl
Green11
Green22
Blue22
Red31
Black31

 

I'm not at home right now so I can't check but I remember that I tried doing

proc sql;
create table want as
select Category
,Ref_type
,count(unique Person_ID)
from have 
group by Category, Ref_type
;quit;

which gave me not exactly what I wanted (note that this was for a large dataset, probably 2 million rows). When I removed the "unique" keyword, I got what I wanted.

Can someone explain to me what the code with the "unique" when I'm using it with the count function, and a group by statement, and also what the code without the "unique" function does when I'm counting a specific variable that's not in the group by statement?

7 REPLIES 7
Tom
Super User Tom
Super User

How many people do you think you have in your fist listing?

 

Person_ID Category Ref_type Total_amt
100 Green 2 350
100 Blue 2 300
100 Red 3 100
200 Green 1 20
200 Black 3 500
300 Blue 2 200

 

I see 6 observations for 3 people.

 

COUNT(*) will count how many observations.

COUNT(PERSON_ID) will count how many observations have a non-missing value of PERSON_ID.

COUNT(DISTINCT PERSON_ID) will count how many people.

 

 

newbatprocsql1
Calcite | Level 5
That makes sense - so if PERSON_ID is always not missing then count(*) = count(person_id)? Also, how does that work with the group by statement? I see @JosvanderVelden 's reply, but the post they linked only says there's a caveat with the aggregate function, but doesn't say what the caveat is.
Tom
Super User Tom
Super User

GROUP BY separates the observations into groups based on the values of the variables listed.  When you have GROUP BY the aggregate functions, like COUNT(),MAX(),SUM(), etc operate over each group separately.

 

Also when you have GROUP BY you need to take care what columns (variables) you list in the SELECT clause.

data have;
  input Person_ID $ Category $ Ref_type $ Total_amt;
cards;
100 Green 2 350
100 Blue 2 300
100 Red 3 100
200 Green 1 20
200 Black 3 500
300 Blue 2 200
;

proc sql ;
 select category,ref_type
      , count(*) as nobs
      , count(person_id) as n_with_ids
      , count(person_id) as n_people
      , sum(total_amt) as sum_total_amt
   from have
   group by category,ref_type
 ;
quit;

Result

                                                      sum_total_
Category  Ref_type      nobs  n_with_ids  n_people           amt
----------------------------------------------------------------
Black     3                1           1         1           500
Blue      2                2           2         2           500
Green     1                1           1         1            20
Green     2                1           1         1           350
Red       3                1           1         1           100

But if you include a variable that is neither one of the grouping variables nor an aggregate result then SAS will re-merge the computed aggregate values back with all of the observations selected in that group.  It will tell you this in the LOG.

1178  proc sql ;
1179   select category,ref_type,total_amt
1180        , count(*) as nobs
1181        , count(person_id) as n_with_ids
1182        , count(person_id) as n_people
1183        , sum(total_amt) as sum_total_amt
1184     from have
1185     group by category,ref_type
1186   ;
NOTE: The query requires remerging summary statistics back with the original data.
1187  quit;

Result

                                                                 sum_total_
Category  Ref_type  Total_amt      nobs  n_with_ids  n_people           amt
---------------------------------------------------------------------------
Black     3               500         1           1         1           500
Blue      2               200         2           2         2           500
Blue      2               300         2           2         2           500
Green     1                20         1           1         1            20
Green     2               350         1           1         1           350
Red       3               100         1           1         1           100
newbatprocsql1
Calcite | Level 5
Thanks! That's really helpful. I think now I'm just not understanding what the difference between using
count(*), count(unique person_id) and count(person_id) is when using that "group by category, ref_type". It seems like it will always give the same result (but I know that count(unique person_id) and count(person_id) would give different results in general, I just can't see why, since we are using the group by statement).
newbatprocsql1
Calcite | Level 5
I say that will give different results in general, because I remember taking away the "unique" keyword and it gave different results. Also, did you mean to type count(person_id) twice in the proc sql statements?
Tom
Super User Tom
Super User

No. Second one should have the DISTINCT keyword.

 

They will yield the same count when the three variable combination of PERSON_ID, CATEGORY, and REF_TYPE uniquely identify the observation.  

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 857 views
  • 0 likes
  • 3 in conversation