I have a messy dataset where I have up to 16 iterations of a variable Result1-Result16. Some people in the dataset only have a few iterations, some have 16 and everywhere in between. I'm trying to create a result summary variable that takes into account each of these versions. The Result variable is in number form and I have a format for it that interprets what the numbers mean, I'm not sure if that matters for the coding. When interpreting the result options into the SummaryResult variable I'd like to create, some only match up with one interpretation but others should be grouped together since they mean essentially the same thing for the SummaryResult variable. For example:
A result of 100 means UNKNOWN
A result of 200 means NO
But then a result of 300, 400, 500, etc means YES
Many are also missing with . as the result, so the ResultSummary should be . also
To complicate matters, I can't just look at the last/highest iteration of Result because the correct/best ResultSummary may not be there, it could be the first or middle or last. The ResultSummary should be determined by a hierarchy:
If there are any results that mean YES, the ResultSummary variable should be YES.
If there are no YES results, but there is at least one NO, the ResultSummary variable should be NO.
If there are no YES or NO results, but there is at least one UNKNOWN, the ResultSummary variable should be UNKNOWN.
If . is the only result for all iterations of Result for that person, the ResultSummary should be . (missing).
Here is a fake example:
Person | Result1 | Result2 | Result3 | Result4 | Result5 | SummaryResult |
111 | . | . | 100 | . | 100 | UNKNOWN |
112 | 200 | 200 | . | . | 500 | YES |
113 | . | . | 100 | 200 | 200 | NO |
114 | . | . | . | . | . | . |
115 | . | 300 | 100 | 100 | 400 | YES |
I know only basics about assays from a class about 6 years ago and I just can't wrap my head around how I could get it to check each iteration while not keeping the last version instead of the best summary based on the hierarchy.
Thank you for any help!!
Something like:
length summaryresult $7;
array res [*] result1-result5;
select (max(of res[*]));
when (300,400,500) summaryresult = "YES";
when (200) summaryresult = "NO";
when (100) summaryresult = "UNKNOWN";
otherwise summaryresult = ".";
end;
Alternatively, you could store the exact value of max(of result1-result16) in variable SummaryResult (and thus have more information available) and define a format to display 100 as "UNKNOWN", 200 as "NO", etc.
proc format;
value sumres
.='MISSING'
100='UNKNOWN'
200='NO'
300-high='YES'
other='STRANGE'
;
run;
data want;
set have;
if n(of result:) then SummaryResult=max(of result:);
format SummaryResult sumres.;
run;
This formatted variable could be used in CLASS statements (e.g. in PROC MEANS), as a group variable in PROC REPORT, for creating frequency tables (with PROC FREQ), etc. However, it would be slightly less convenient than a character variable if you had to retrieve the formatted values very often (as in if vvalue(SummaryResult)='YES' then ...).
data have;
infile cards truncover expandtabs;
input Person Result1 Result2 Result3 Result4 Result5 ;
cards;
111 . . 100 . 100
112 200 200 . . 500
113 . . 100 200 200
114 . . . . .
115 . 300 100 100 400
;
data want;
set have;
array x{*} Result:;
length ResultSummary $ 40;
if (300 in x) or (400 in x) or (500 in x) then ResultSummary ='YES ';
else if 200 in x then ResultSummary ='NO ';
else if 100 in x then ResultSummary ='UNKNOWN';
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.