I'm pretty new to proc sql, but I was trying to count the number of people. Each unique person can have multiple rows (multiple application submissions). An example of a table below is:
Person_ID | Category | Ref_type | Total_amt |
100 | Green | 2 | 350 |
100 | Blue | 2 | 300 |
100 | Red | 3 | 100 |
200 | Green | 1 | 20 |
200 | Black | 3 | 500 |
300 | Blue | 2 | 200 |
I want to count the number of people in each Category*Ref_type, so it'll be
Category | Ref_type | No_ppl |
Green | 1 | 1 |
Green | 2 | 2 |
Blue | 2 | 2 |
Red | 3 | 1 |
Black | 3 | 1 |
I'm not at home right now so I can't check but I remember that I tried doing
proc sql;
create table want as
select Category
,Ref_type
,count(unique Person_ID)
from have
group by Category, Ref_type
;quit;
which gave me not exactly what I wanted (note that this was for a large dataset, probably 2 million rows). When I removed the "unique" keyword, I got what I wanted.
Can someone explain to me what the code with the "unique" when I'm using it with the count function, and a group by statement, and also what the code without the "unique" function does when I'm counting a specific variable that's not in the group by statement?
How many people do you think you have in your fist listing?
Person_ID Category Ref_type Total_amt 100 Green 2 350 100 Blue 2 300 100 Red 3 100 200 Green 1 20 200 Black 3 500 300 Blue 2 200
I see 6 observations for 3 people.
COUNT(*) will count how many observations.
COUNT(PERSON_ID) will count how many observations have a non-missing value of PERSON_ID.
COUNT(DISTINCT PERSON_ID) will count how many people.
GROUP BY separates the observations into groups based on the values of the variables listed. When you have GROUP BY the aggregate functions, like COUNT(),MAX(),SUM(), etc operate over each group separately.
Also when you have GROUP BY you need to take care what columns (variables) you list in the SELECT clause.
data have;
input Person_ID $ Category $ Ref_type $ Total_amt;
cards;
100 Green 2 350
100 Blue 2 300
100 Red 3 100
200 Green 1 20
200 Black 3 500
300 Blue 2 200
;
proc sql ;
select category,ref_type
, count(*) as nobs
, count(person_id) as n_with_ids
, count(person_id) as n_people
, sum(total_amt) as sum_total_amt
from have
group by category,ref_type
;
quit;
Result
sum_total_ Category Ref_type nobs n_with_ids n_people amt ---------------------------------------------------------------- Black 3 1 1 1 500 Blue 2 2 2 2 500 Green 1 1 1 1 20 Green 2 1 1 1 350 Red 3 1 1 1 100
But if you include a variable that is neither one of the grouping variables nor an aggregate result then SAS will re-merge the computed aggregate values back with all of the observations selected in that group. It will tell you this in the LOG.
1178 proc sql ; 1179 select category,ref_type,total_amt 1180 , count(*) as nobs 1181 , count(person_id) as n_with_ids 1182 , count(person_id) as n_people 1183 , sum(total_amt) as sum_total_amt 1184 from have 1185 group by category,ref_type 1186 ; NOTE: The query requires remerging summary statistics back with the original data. 1187 quit;
Result
sum_total_ Category Ref_type Total_amt nobs n_with_ids n_people amt --------------------------------------------------------------------------- Black 3 500 1 1 1 500 Blue 2 200 2 2 2 500 Blue 2 300 2 2 2 500 Green 1 20 1 1 1 20 Green 2 350 1 1 1 350 Red 3 100 1 1 1 100
No. Second one should have the DISTINCT keyword.
They will yield the same count when the three variable combination of PERSON_ID, CATEGORY, and REF_TYPE uniquely identify the observation.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.