Re: Distinct count and group by multiple variables

colabear · Posted 09-05-2019 02:02 PM

I am trying to generate a summary table, but I can't seem to think of how to execute it. What I need is to add a column with distinct count of ID by DeptID (or name) and by month of service. The 'hours' and 'amount' columns would need to be summed accordingly. I currently have a table but I would like it to contain distinct count of ID.

proc means data = services_fortab noprint nway sum;
var hours paid;
class vendorid vendorname svc_month Fiscal_Year SvcLocDesc ;
output out = services_fortab2(drop = _TYPE_ _FREQ_) sum=;
run;

data welcome;
infile datalines dlm=",";
input ID $ DeptID $ Name :$20. svcdate :yymmdd10. Fiscal_Year Hours Amount ;
format svcdate yymmddd10.;
datalines;
1234, 7, East, 2019-01-01, 2019, 1.50, 109.7
1234, 7, East, 2019-01-01, 2019, 7.50, 509.7
6543, 7, East, 2019-01-1, 2019, 2.75, 340.82
7847, 8, West, 2019-02-01, 2019, 5, 7000
5818, 10, North, 2019-03-01, 2019, 6, 1032
0844, 3, South, 2019-04-01, 2019, 40, 855
4499, 5, West, 2018-09-01, 2019, 10, 790
1234, 7, East, 2019-09-01, 2019, 6.50, 1900.7
3000, 2, North, 2018-12-01, 2019, 11.75, 874.79
3785, 27, West, 2018-11-01, 2019, 20.3, 450.64

;
run;

I'd like the output to look something like this:

DeptID	Name	svcdate	Fiscal Year	Hours	Amount	Count
7	East	1/1/2019	2019	10	5324	50
7	East	3/1/2019	2019	6	756	3
8	West	2/1/2019	2019	67	2366	34
8	West	4/1/2019	2019	98	7899	467
10	North	2/1/2019	2019	33	956	64
10	North	3/1/2019	2019	100	5899	223
3	South	9/1/2018	2019	93	578	11
3	South	2/1/2019	2019	25	235	176

I know the solution is likely obvious, but I can't figure it out. Thank you!

Jagadishkatam · Posted 09-05-2019 02:33 PM

You can try below code

data welcome;
infile datalines dlm=",";
input ID $ DeptID $ Name :$20. svcdate :yymmdd10. Fiscal_Year Hours Amount ;
format svcdate yymmddd10.;
datalines;
1234, 7, East, 2019-01-01, 2019, 1.50, 109.7
1234, 7, East, 2019-01-01, 2019, 7.50, 509.7
6543, 7, East, 2019-01-1, 2019, 2.75, 340.82
7847, 8, West, 2019-02-01, 2019, 5, 7000
5818, 10, North, 2019-03-01, 2019, 6, 1032
0844, 3, South, 2019-04-01, 2019, 40, 855
4499, 5, West, 2018-09-01, 2019, 10, 790
1234, 7, East, 2019-09-01, 2019, 6.50, 1900.7
3000, 2, North, 2018-12-01, 2019, 11.75, 874.79
3785, 27, West, 2018-11-01, 2019, 20.3, 450.64

;
run;

proc sql;
create table want as select DeptID, Name, svcdate, Fiscal_Year , sum(Hours) as hours, sum(amount) as amount, count(id) as count 
from welcome group by DeptID, Name, svcdate, Fiscal_Year ;
quit;

Thanks,
Jag

colabear · Posted 09-05-2019 02:46 PM

Hello Jag,

Thank you for the quick response! The count I am getting is not a sum of distinct ID. I want to show how many individuals were seen at the department X during month XX-YYYY.

Thank you!

Reeza · Posted 09-05-2019 02:52 PM

Sadly it isn't. There isn't a default function to do distinct counts in SAS so you need to calculate it externally and then merge it in with your other table.

There are two methods to do this that I recommend, PROC SQL or double PROC FREQ. Examples for both are below. To scale it for multiple variables add you extra variables to the GROUP BY or TABLE statement.

/*This demonstrates how to count the number of unique occurences of a variable
across groups. It uses the SASHELP.CARS dataset which is available with any SAS installation.
The objective is to determine the number of unique car makers by origin/

Note: The SQL solution can be off if you have a large data set and these are not the only two ways to calculate distinct counts.
If you're dealing with a large data set other methods may be appropriate.*/

*Count distinct IDs;
proc sql;
create table distinct_sql as
select origin, count(distinct make) as n_make
from sashelp.cars
group by origin;
quit;

*Double PROC FREQ;
proc freq data=sashelp.cars noprint;
table origin * make / out=origin_make;
run;

proc freq data=origin_make noprint;
table origin / out= distinct_freq;
run;

title 'PROC FREQ';
proc print data=distinct_freq;
run;
title 'PROC SQL';
proc print data=distinct_sql;
run;

@colabear wrote:

I am trying to generate a summary table, but I can't seem to think of how to execute it. What I need is to add a column with distinct count of ID by DeptID (or name) and by month of service. The 'hours' and 'amount' columns would need to be summed accordingly. I currently have a table but I would like it to contain distinct count of ID.

proc means data = services_fortab noprint nway sum;
var hours paid;
class vendorid vendorname svc_month Fiscal_Year SvcLocDesc ;
output out = services_fortab2(drop = _TYPE_ _FREQ_) sum=;
run;

data welcome;
infile datalines dlm=",";
input ID $ DeptID $ Name :$20. svcdate :yymmdd10. Fiscal_Year Hours Amount ;
format svcdate yymmddd10.;
datalines;
1234, 7, East, 2019-01-01, 2019, 1.50, 109.7
1234, 7, East, 2019-01-01, 2019, 7.50, 509.7
6543, 7, East, 2019-01-1, 2019, 2.75, 340.82
7847, 8, West, 2019-02-01, 2019, 5, 7000
5818, 10, North, 2019-03-01, 2019, 6, 1032
0844, 3, South, 2019-04-01, 2019, 40, 855
4499, 5, West, 2018-09-01, 2019, 10, 790
1234, 7, East, 2019-09-01, 2019, 6.50, 1900.7
3000, 2, North, 2018-12-01, 2019, 11.75, 874.79
3785, 27, West, 2018-11-01, 2019, 20.3, 450.64

;
run;

I'd like the output to look something like this:

DeptID Name svcdate Fiscal Year Hours Amount Count

7 East 1/1/2019 2019 10 5324 50

7 East 3/1/2019 2019 6 756 3

8 West 2/1/2019 2019 67 2366 34

8 West 4/1/2019 2019 98 7899 467

10 North 2/1/2019 2019 33 956 64

10 North 3/1/2019 2019 100 5899 223

3 South 9/1/2018 2019 93 578 11

3 South 2/1/2019 2019 25 235 176

I know the solution is likely obvious, but I can't figure it out. Thank you!

Distinct count and group by multiple variables