Solved: Calculating percent of the by group total in proc sql

Whitlea · Posted 10-10-2024 12:47 PM

I want to know what percent of clients had more than one stay for each client. I am trying to calculate percent of the group total where the group total is the distinct count of clients in the group (clinic). The code below is calculating percent of the grand total for each row. How can I fix this?

Thank you!

data have;
input client_id$	Span_Begin :DATE9.	Span_End :DATE9. clinic$ stay$ cost R$ year$ clinic_type$;
format Span_Begin MMDDYY10. Span_End MMDDYY10. ;
datalines;
A	17Nov2022	04Jan2023	8	1	0      1  2023  1
B	01Jul2022	29Jul2022	12	1	2746   1  2023  3
B	21Nov2022	04Apr2023	12	2	1373   1  2023  3
C	01Jul2022	07Jul2022	8	1	2403   1  2023  3
C   12Jul2022	25Jul2022   6   2   343    1  2023  3
C   26Aug2022	15May2023   10  3   687    1  2023  3
D	01Jul2022	12Jul2022	10	1	5479   0  2023  2
E	01Jul2022	23Mar2023	9	1	159000 1  2023  3 
F	22Nov2022	19Mar2023	8	1	6522   1  2023  3
G	26Aug2022	15May2023	12	1	687    1  2023  3
H	01Oct2022	10Apr2023	6	1	5479   1  2023  3
I	30Dec2022	19Jan2023	5	1	8675   1  2023  3
J	01Jul2022	01Nov2022	3	1	4566   1  2023  3
K	11Nov2022	10Jan2023	3	2	10044  1  2023  3
L	01Jul2022	15Jul2022	3	1	9588   1  2023  3
L	26Jul2022	17Aug2022	12	2	2060   0  2023  3
L	23Sep2022	19Feb2023	12	3	5149   1  2023  3
O	20Sep2022	20Feb2023	5	1	5022   1  2023  3
P	01Jul2022	02Aug2022	5	1	913    1  2023  1
Q	06Jul2022	01Sep2022	8	1	343    1  2023  3
R	01Jul2022	01Sep2022	1	1	457    1  2023  3
S	01Jul2022	24Oct2022	2	1	10957  1  2023  3
T	01Jul2022	04Dec2022	2	1	1826   1  2023  3
U	28Aug2022	18Sep2022	8	1	6179   1  2023  3
V	19Dec2022	31May2023	7	2	10501  1  2023  3
W	08Jun2023	18Jun2023	7	3	5022   1  2023  3
X	01Jul2022	10Jul2022	8	1	3433   0  2023  3
Y	04Aug2022	17Nov2022	10	2	7305   1  2023  3
Z	13Oct2022	07Nov2022	8	1	2403   1  2023  3
Z	09Feb2023	30Jun2023	12	2	5149   1  2023  3
AA	08Sep2022	04Oct2022	8	1	2746   1  2023  3
BB	01Jul2022	26Apr2023	2	1	10501  1  2023  3
CC	09Aug2022	19Jun2023	12	1	1373   1  2023  3
DD	01Jul2022	21Aug2022	10	1	9588   1  2023  3
DD	17Oct2022	13Jun2023	11	2	5935   1  2023  3
EE	01Jul2022	15Jul2022	5	1	3196   1  2023  3
EE	26Jul2022	17Aug2022	5	2	7305   1  2023  3
EE	23Sep2022	19Feb2023	4	3	13697  1  2023  3
;;;
run;
*** distinct client count per clinic****;
proc sql;
select clinic, count(distinct client_id) as pct_of from have
where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0 
group by clinic;
run;

Title1 “Percent of clients with >1 stay”;
proc sql ;
select distinct clinic, count( distinct client_id) as Clients, calculated clients / (select count(distinct client_id) from have 
where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0) as pct format=percent8.1
from have
where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0 and Stay ne '1'
group clinic
union all
select "Total",  count( distinct client_id) as Total from have
where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0 and Stay ne '1';
quit;

I'm getting this:

I want this:

Tom · Posted 10-19-2024 04:21 PM

You seem to want to

1) Filter by some selection criteria

2) Count how many stays (across any clinic) per client

3) Count number of clients per clinic and how many of those clients had multiple stays.

So you want something like:

data subset;
  set have;
  where clinic_type=3 and year=2023 and r=1 and cost>0 ;
run;

proc sql;
create table stays_per_client as
  select *,client_id,count(*) as n_stays
  from subset
  group by client_id
;
create table want as 
  select clinic
       , count(*) as client_count
       , sum(n_stays>1) as clients
       , mean(n_stays>1) as pct format=percent.
  from stays_per_client
  group by clinic
;
quit;

Which results in this report:

So you could collapse that into a single SQL query by nesting it.

proc sql;
create table want as 
  select clinic
       , count(*) as client_count
       , sum(n_stays>1) as clients
       , mean(n_stays>1) as pct format=percent.
  from (
        select clinic,client_id,count(*) as n_stays
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
        group by client_id
       )
  group by clinic
;
quit;

Note that this query is taking advantage of the fact that SAS will remerge summary statistics back onto the detail observations when you do something like:

select id,var2,mean(x) as mean_x
from have
group by id

If you want to run this SQL in some other system's implementation of SQL you will have to explicitly to the merge (join in SQL speak) yourself. Make sure to apply the same WHERE clause to both sub queries. And perhaps replace the boolean expressions with CASE clauses.

proc sql;
create table want as 
  select a.clinic
       , count(*) as client_count
       , sum(case when n_stays>1 then 1 else 0 end) as clients
       , mean(case when n_stays>1 then 1 else 0 end) as pct format=percent.
  from (select clinic,client_id 
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
       ) a
  inner join 
       (select client_id,count(*) as n_stays
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
        group by client_id
       ) b
  on a.client_id=b.client_id
  group by a.clinic
;
quit;

View solution in original post

PaigeMiller · Posted 10-10-2024 01:21 PM

In my opinion, SQL is a poor choice for this. PROC FREQ (or PROC SUMMARY) would do the job easily.

I don't understand the math that results in Clinic 10 having 2 clients which is 66.7%, I can't seem to figure out how you get that from the raw data.

--
Paige Miller

Whitlea · Posted 10-18-2024 05:28 PM

I used proc sql because it is the only way I know how to get distinct count. The first proc SQL gives you the distinct count of total clients per clinic. When looking at distinct clients, 2 of the 3 distinct clients in clinic 10 had more than 1 stay in a clinic (66.7% of the distinct clients in clinic 10).

PaigeMiller · Posted 10-19-2024 05:45 AM

I see 4 distinct clients (C D Y and DD) in clinic 10 in your raw data.

--
Paige Miller

Whitlea · Posted 10-19-2024 02:58 PM

Yes, but only 3 are clinic type 3, which is the only clinic type I want to count.

where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0)

PaigeMiller · Posted 10-19-2024 03:49 PM

@Whitlea wrote:

Yes, but only 3 are clinic type 3, which is the only clinic type I want to count.
where clinic_type eq '3' and  year eq '2023' and R eq '1' and cost>0)

Thanks. As far as I can tell, this was not stated in the text of the original problem statement. If the text of the problem statement doesn't mention it but your sample code does mention it, this is a contradiction between the problem statement and sample code (something you should avoid) and normally I stop and ask questions. Better to do that than to go ahead and solve a problem I don't understand.

I have switched the variable STAY in your data set from character to numeric. It is always better to make counts (and other numeric things) a numeric variable. If I don't do this, then I can't use the WEIGHT statement in PROC FREQ.

proc sort data=have(where=(clinic_type='3')) out=have1;
    by clinic;
run;
proc freq data=have1; /* Do the counting */
	by clinic;
	tables client_id/noprint out=_counts_;
	weight stay;
run;
data want; /* Find percent more than one stay */
    set _counts_(drop=percent);
    by clinic;
    if first.clinic then do; 
    	more_than_once=0; n=0;
	end;
    n+1;
    if count>1 then more_than_once+1; /* Count times client stayed more than once */
    if last.clinic then do;
    	more_than_once_percent=more_than_once/n;
    	output;
	end;
run;

--
Paige Miller

Tom · Posted 10-18-2024 08:09 PM

Count the number of stays per client first. Then since you will already have distinct clients counting will not require the DISTINCT keyword.

Tom · Posted 10-19-2024 04:21 PM

You seem to want to

1) Filter by some selection criteria

2) Count how many stays (across any clinic) per client

3) Count number of clients per clinic and how many of those clients had multiple stays.

So you want something like:

data subset;
  set have;
  where clinic_type=3 and year=2023 and r=1 and cost>0 ;
run;

proc sql;
create table stays_per_client as
  select *,client_id,count(*) as n_stays
  from subset
  group by client_id
;
create table want as 
  select clinic
       , count(*) as client_count
       , sum(n_stays>1) as clients
       , mean(n_stays>1) as pct format=percent.
  from stays_per_client
  group by clinic
;
quit;

Which results in this report:

So you could collapse that into a single SQL query by nesting it.

proc sql;
create table want as 
  select clinic
       , count(*) as client_count
       , sum(n_stays>1) as clients
       , mean(n_stays>1) as pct format=percent.
  from (
        select clinic,client_id,count(*) as n_stays
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
        group by client_id
       )
  group by clinic
;
quit;

Note that this query is taking advantage of the fact that SAS will remerge summary statistics back onto the detail observations when you do something like:

select id,var2,mean(x) as mean_x
from have
group by id

If you want to run this SQL in some other system's implementation of SQL you will have to explicitly to the merge (join in SQL speak) yourself. Make sure to apply the same WHERE clause to both sub queries. And perhaps replace the boolean expressions with CASE clauses.

proc sql;
create table want as 
  select a.clinic
       , count(*) as client_count
       , sum(case when n_stays>1 then 1 else 0 end) as clients
       , mean(case when n_stays>1 then 1 else 0 end) as pct format=percent.
  from (select clinic,client_id 
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
       ) a
  inner join 
       (select client_id,count(*) as n_stays
        from have
        where clinic_type=3 and year=2023
              and r=1 and cost>0 
        group by client_id
       ) b
  on a.client_id=b.client_id
  group by a.clinic
;
quit;

Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Re: Calculating percent of the by group total in proc sql

Registration is open

SAS Training: Just a Click Away