I have a dataset
data have;
input ID Type $ DIS $ Timepoint Answer_1 Answer_2 Payment;
datalines;
1001 0to5 A 0 0 1 100
1001 0to5 A 1 1 0 210
1001 0to5 A 3 1 . 340
1002 0to5 A 0 1 0 290
1002 0to5 A 1 . 1 180
1003 0to5 B 0 1 1 200
1003 0to5 B 2 . 0 150
1004 6to9 A 0 . . 300
1004 6to9 A 1 1 1 100
1004 6to9 A 2 1 . 340
1005 6to9 A 0 1 1 900
1005 6to9 A 1 0 1 100
1005 6to9 A 3 . 0 940
1006 6to9 A 0 1 . 400
1006 6to9 A 1 1 1 450
;
I would like the output table:
Type | DIS | MedianPayment_Ans1 | MedianPayment_Ans2 | Count |
0to5 | A | 250 | 195 | 2 |
0to5 | B | 200 | 175 | 1 |
6to9 | A | 370 | 450 | 3 |
The output is: grouped by Type and DIS, with three new columns.
Two of the columns are the median payments for each respective answer field, only considering the values which have the respective answer not being missing.
The third column has the count of the number of unique ID's which contributed to the median.
Constraints are:
- There are many ID's, Type's and DIS's
- Data set is ordered on ID and Timepoint
- For any distinct ID, the first Timepoint is equal to 0, and Timepoint is a nonnegative integer
- All answer fields are either 0, 1 or missing (.) .
- Payment is a positive integer, possibly missing too.
- There are at least 50 answer fields.
Could someone help me with this please? Preferably using proc sql. I thought of using some indicator function but it didn't work:
proc sql;
create table want as
select Type
,DIS
,median(Payment*(Answer1 ne .)) as medianPayment_Ans1
,median(Payment*(Answer2 ne .)) as medianPayment_Ans2
from have
group by Type, DIS;
quit;
The third column has the count of the number of unique ID's which contributed to the median.
Do missing values contribute here in any way?
(Payment*(Answer1 ne .))
Is not going to do what you want because SAS returns 0 for false for the Answer1 ne . So Payment*0=0. Making it a valid value for the Median function to consider and getting included in a "count".
What is your "count" supposed to be counting? Since you have two variables that might be missing at random it is not clear what it might be counting.
The medians by not missing:
proc sql; create table sortofwant as select a.type,a.dis,a.MedianPayment_Ans1,b.MedianPayment_Ans2 from ( select type,dis, median(payment) as MedianPayment_Ans1 from (select * from have where not missing(answer_1)) group by type,dis ) as a left join ( select type,dis, median(payment) as MedianPayment_Ans2 from (select * from have where not missing(answer_2)) group by type,dis ) as b on a.type=b.type and a.dis=b.dis ; quit;
Caution: small example =>small code and if you have a combination of answer_1 that are all missing for a type*dis combination that are not missing for answer_2 then you need something other than a left join to get both groups of summaries.
Note that this is likely going to be a bit of a headache if you try to extend it to many Answer periods.
This should scale for you now:
data have;
input ID Type $ DIS $ Timepoint Answer_1 Answer_2 Payment;
datalines;
1001 0to5 A 0 0 1 100
1001 0to5 A 1 1 0 210
1001 0to5 A 3 1 . 340
1002 0to5 A 0 1 0 290
1002 0to5 A 1 . 1 180
1003 0to5 B 0 1 1 200
1003 0to5 B 2 . 0 150
1004 6to9 A 0 . . 300
1004 6to9 A 1 1 1 100
1004 6to9 A 2 1 . 340
1005 6to9 A 0 1 1 900
1005 6to9 A 1 0 1 100
1005 6to9 A 3 . 0 940
1006 6to9 A 0 1 . 400
1006 6to9 A 1 1 1 450
;
proc sort data=have;
by TYPE DIS ID;
run;
data have_expanded;
set have;
by type DIS ID;
first_id = first.id;
array pmt_ans(2);
array answer(2) answer_1-answer_2;
do i=1 to dim(answer);
if nmiss(answer(i), payment) = 0 then pmt_ans(i) = payment;
end;
run;
proc means data=have_expanded nway noprint;
by type dis;
var pmt_ans: ;
output out=want sum(first_id) = Num_IDs median = / autoname;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.