topic Re: Summary functions conditioning on another column, proc sql in SAS Programming

Summary functions conditioning on another column, proc sql

penguins0101 — Tue, 26 Jul 2022 19:27:57 GMT

I have a dataset

data have;
input ID Type $ DIS $ Timepoint Answer_1 Answer_2 Payment;
datalines;
1001 0to5 A 0 0 1 100
1001 0to5 A 1 1 0 210
1001 0to5 A 3 1 . 340
1002 0to5 A 0 1 0 290
1002 0to5 A 1 . 1 180
1003 0to5 B 0 1 1 200
1003 0to5 B 2 . 0 150
1004 6to9 A 0 . . 300
1004 6to9 A 1 1 1 100
1004 6to9 A 2 1 . 340
1005 6to9 A 0 1 1 900
1005 6to9 A 1 0 1 100
1005 6to9 A 3 . 0 940
1006 6to9 A 0 1 . 400
1006 6to9 A 1 1 1 450
;

I would like the output table:

Type	DIS	MedianPayment_Ans1	MedianPayment_Ans2	Count
0to5	A	250	195	2
0to5	B	200	175	1
6to9	A	370	450	3

The output is: grouped by Type and DIS, with three new columns.

Two of the columns are the median payments for each respective answer field, only considering the values which have the respective answer not being missing.

The third column has the count of the number of unique ID's which contributed to the median.

Constraints are:
- There are many ID's, Type's and DIS's
- Data set is ordered on ID and Timepoint

- For any distinct ID, the first Timepoint is equal to 0, and Timepoint is a nonnegative integer

- All answer fields are either 0, 1 or missing (.) .

- Payment is a positive integer, possibly missing too.

- There are at least 50 answer fields.

Could someone help me with this please? Preferably using proc sql. I thought of using some indicator function but it didn't work:

proc sql;
create table want as
select Type
,DIS
,median(Payment*(Answer1 ne .)) as medianPayment_Ans1
,median(Payment*(Answer2 ne .)) as medianPayment_Ans2
from have
group by Type, DIS;
quit;

Re: Summary functions conditioning on another column, proc sql

Reeza — Tue, 26 Jul 2022 21:52:39 GMT

The third column has the count of the number of unique ID's which contributed to the median.

Do missing values contribute here in any way?

Re: Summary functions conditioning on another column, proc sql

ballardw — Tue, 26 Jul 2022 21:59:27 GMT

(Payment*(Answer1 ne .))

Is not going to do what you want because SAS returns 0 for false for the Answer1 ne . So Payment*0=0. Making it a valid value for the Median function to consider and getting included in a "count".

What is your "count" supposed to be counting? Since you have two variables that might be missing at random it is not clear what it might be counting.

The medians by not missing:

proc sql; 
   create table sortofwant as
   select a.type,a.dis,a.MedianPayment_Ans1,b.MedianPayment_Ans2
   from (
         select type,dis, median(payment) as MedianPayment_Ans1
         from (select * from have where not missing(answer_1))
         group by type,dis
        ) as a
        left join
        (
         select type,dis, median(payment) as MedianPayment_Ans2
         from (select * from have where not missing(answer_2))
         group by type,dis
        ) as b
        on a.type=b.type
           and a.dis=b.dis
   ;
quit;

Caution: small example =>small code and if you have a combination of answer_1 that are all missing for a type*dis combination that are not missing for answer_2 then you need something other than a left join to get both groups of summaries.

Note that this is likely going to be a bit of a headache if you try to extend it to many Answer periods.

Re: Summary functions conditioning on another column, proc sql

Reeza — Tue, 26 Jul 2022 22:04:11 GMT

This should scale for you now:

data have;
input ID Type $ DIS $ Timepoint Answer_1 Answer_2 Payment;
datalines;
1001 0to5 A 0 0 1 100
1001 0to5 A 1 1 0 210
1001 0to5 A 3 1 . 340
1002 0to5 A 0 1 0 290
1002 0to5 A 1 . 1 180
1003 0to5 B 0 1 1 200
1003 0to5 B 2 . 0 150
1004 6to9 A 0 . . 300
1004 6to9 A 1 1 1 100
1004 6to9 A 2 1 . 340
1005 6to9 A 0 1 1 900
1005 6to9 A 1 0 1 100
1005 6to9 A 3 . 0 940
1006 6to9 A 0 1 . 400
1006 6to9 A 1 1 1 450
;

proc sort data=have;
by TYPE DIS ID;
run;


data have_expanded;
set have;
by type DIS ID;

first_id = first.id;

array pmt_ans(2);
array answer(2) answer_1-answer_2;

do i=1 to dim(answer);
if nmiss(answer(i), payment) = 0 then pmt_ans(i) = payment;
end;

run;

proc means data=have_expanded nway noprint;
by type dis;
var pmt_ans: ;
output out=want sum(first_id) = Num_IDs median =   / autoname;
run;