Hello,
I have this code I'm using to assign a flag to the worst case scenario (High or within range). (R is from a previous data step to group visits where subject is on treatment)
proc sql;
create table flg as
select distinct usubjid,
paramcd,
adt,
visit,
a1ind,
"Y" as anl02fl
from advs
where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt)
group by usubjid, paramcd
having adt eq min(adt);
quit;
This is the spec for reference.
Set to "Y" for the worst case A1IND occurring post-baseline (including unscheduled and early termination visits) for each parameter for each subject. If multiple visits exist with the worst case, flag only the earliest occurring visit. |
My question is how can I order the worst case variable A1IND so that if "High" does exist that gets higher priority over within range as at the moment I have subjects that have both values but because the visit date that has the value of "Within range" occurs before "High" that is getting incorrectly assigned the flag when it should be High as that is the worst case scenario.
Hello @smackerz1988,
In your PROC SQL step you could define a (numeric or character) sort key, say x, whose sort order corresponds to your selection priorities. Then select the observation(s) with the minimum of x in each USUBJID-PARAMCD group (and finally drop x).
Example:
proc sql; create table flg(drop=x) as select distinct usubjid, paramcd, adt, visit, a1ind, "Y" as anl02fl, 1e7*(a1ind eqt 'W')+1e6*missing(adt)+coalesce(adt,0) as x /* temporary sort key */ from advs where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt) group by usubjid, paramcd having x=min(x); quit;
The term 1e7*(a1ind eqt 'W') penalizes observations with A1IND values starting with "W" in such a way that a single observation with A1IND="High" would "beat" them (by getting a smaller x value). Next, missing visit dates (assuming SAS date values in variable ADT!) are penalized by 1e6*missing(adt) so that non-missing ADT values, if any, are preferred. Finally, among non-missing visit dates the smallest ADT value (=earliest visit date) will be selected. Or do you need a different handling of missing ADT values? Are there any at all? If not, the definition of x could be simplified to 1e7*(a1ind eqt 'W')+adt.
Show us (a portion of) the actual data that illustrates the problem. Provide the data as working SAS data step code, which you can type in yourself, or follow these instructions. Do not provide the data via other methods.
Also please show the desired output from this data.
I don't follow this. Is it simply a question of ordering the output data so that the records with a1ind = 'High' comes before records with a1ind = 'Within range'?
Yes and no. So some subjects might only have "Within Range" for their analysis values so that would be their worst case scenario so to speak. But if a subject has an analysis value of both "High" and "Within Range" then the code I'm using will flag the earliest date (adt) regardless if the analysis value is "High" or "Within Range". I need to prioritise the value of "High" if it is present for a subject so that the code select the earliest date that contains that value and then if it isn't present I can just use the earliest occurring "Within range" value
SQL seems a bad choice for this type of question, is SQL absolutely required?
Hello @smackerz1988,
In your PROC SQL step you could define a (numeric or character) sort key, say x, whose sort order corresponds to your selection priorities. Then select the observation(s) with the minimum of x in each USUBJID-PARAMCD group (and finally drop x).
Example:
proc sql; create table flg(drop=x) as select distinct usubjid, paramcd, adt, visit, a1ind, "Y" as anl02fl, 1e7*(a1ind eqt 'W')+1e6*missing(adt)+coalesce(adt,0) as x /* temporary sort key */ from advs where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt) group by usubjid, paramcd having x=min(x); quit;
The term 1e7*(a1ind eqt 'W') penalizes observations with A1IND values starting with "W" in such a way that a single observation with A1IND="High" would "beat" them (by getting a smaller x value). Next, missing visit dates (assuming SAS date values in variable ADT!) are penalized by 1e6*missing(adt) so that non-missing ADT values, if any, are preferred. Finally, among non-missing visit dates the smallest ADT value (=earliest visit date) will be selected. Or do you need a different handling of missing ADT values? Are there any at all? If not, the definition of x could be simplified to 1e7*(a1ind eqt 'W')+adt.
Wow really interesting and cool solution! Thank you!. Would you have any additional links so I could learn more about that?
@smackerz1988 wrote:
Wow really interesting and cool solution! Thank you!. Would you have any additional links so I could learn more about that?
You're welcome. Other than the usual documentation links for the EQT and related comparison operators, the MISSING function and the COALESCE SQL function (not identical with the COALESCE function) I don't know of any references specifically on this technique. But I assume that similar sort keys have been used in other posts in this forum, in SAS conference papers or possibly books.
I had been thinking about this problem -- selecting records in PROC SQL based on a hierarchy of criteria -- earlier this year (without a concrete application), had the idea of constructing such sort keys and added this idea to my SAS notes (started in 1999). Your question seemed to be an opportunity to apply this idea to a real-world problem. So thank you for asking. :-)
@FreelanceReinh Can I ask if this can be expanded further?. I have a similar analysis flag specification with three values ("High", "Low" ,"Within range") but the wrinkle is , as highlighted in red, if a subject has both "High" and "Low" I have to flag the earliest occurrence for each?. Is this approach even viable here?
I don't think the earlier solution can be expanded easily to the new task because checking (in PROC SQL) whether a USUBJID-PARAMCD group has only observations with ANRIND="Within range" would typically require a GROUP BY USUBJID, PARAMCD clause. But for the earliest-date flags you would rather need a GROUP BY USUBJID, PARAMCD, ANRIND clause. So I'd switch to a different technique such as BY-group processing in a DATA step where those nested BY groups are readily available: You could sort your data (including only "High", "Low" and "Within range" cases) into a new dataset HAVE by usubjid paramcd anrind adt. Then a DATA step as shown below would create a flag for the first observation in each USUBJID-PARAMCD-ANRIND group except for those where ANRIND="Within range" occurs together with ANRIND="High" or ANRIND="Low" in the same USUBJID-PARAMCD group.
data want;
set have;
by usubjid paramcd anrind adt;
if first.anrind & (anrind ne: 'W' | first.paramcd) then flag='Y';
run;
(Note that this step uses the alphabetical order of "High", "Low" coming before "Within range".) Unlike your PROC SQL step, this results in the complete dataset with the flag variable added.
Note that missing or tied minimum ADT values are handled differently in the DATA step than in PROC SQL:
Without further changes
@FreelanceReinh That is brilliant thank you so much for going into so much depth it's very much appreciated. My only drawback and I do understand the benefit of having simulated data more and more to illustrate the crux of the coding issue but this flag is for post baseline records only and where ANL02FL ='Y' and it currently is flagging for all screening records. I apologies it was more cumbersome that what it appeared to be
USUBJID | PARAMCD | ANRIND | ABLFL | ANL02FL | FLAG |
1001 | ALT | Within range | Y | ||
1001 | ALT | Within range | Y | Y | |
1001 | ALT | Within range | Y | ||
1001 | ALT | Within range | Y | ||
1001 | ALT | Within range | Y | ||
1001 | ALT | Within range | Y | ||
1001 | AST | Within range | Y | ||
1001 | AST | Within range | Y | Y | |
1001 | AST | Within range | Y | ||
1001 | AST | Within range | Y | ||
1001 | AST | Within range | Y | ||
1001 | AST | Within range | Y | ||
1001 | BASO | Within range | Y | ||
1001 | BASO | Within range | Y | ||
1001 | BASO | Within range | |||
1001 | BASO | Within range | |||
1001 | BASO | Within range | |||
1001 | BASO | Within range | |||
1001 | BASOLE | Within range | Y | ||
1001 | BASOLE | Within range | Y | ||
1001 | BASOLE | Within range | |||
1001 | BASOLE | Within range | |||
1001 | BASOLE | Within range | |||
1001 | BASOLE | Within range | |||
1001 | BILIC | Within range | Y | ||
1001 | BILIC | Within range | Y | Y | |
1001 | BILIC | Within range | Y | ||
1001 | BILIC | Within range | Y | ||
1001 | BILIC | Within range | Y | ||
1001 | BILIC | Within range | Y | ||
1001 | BILIU | Within range | Y | ||
1001 | BILIU | Within range | Y | Y | |
1001 | BILIU | Within range | Y | ||
1001 | BILIU | Within range | Y | ||
1001 | BILIU | Within range | Y | ||
1001 | BILIU | Within range | Y | ||
1001 | CREAT | Within range | Y | ||
1001 | CREAT | Within range | Y | Y | |
1001 | CREAT | Within range | Y | ||
1001 | CREAT | Within range | Y | ||
1001 | CREAT | Within range | Y | ||
1001 | CREAT | Within range | Y | ||
1001 | CRP | Within range | Y | ||
1001 | CRP | Within range | Y | Y | |
1001 | CRP | Within range | Y | ||
1001 | CRP | Within range | Y | ||
1001 | CRP | Within range | Y | ||
1002 | MCHC | Low | Y | Y | |
1002 | MCHC | Within range | |||
1002 | MCHC | Within range | Y | Y | |
1002 | MCHC | High | Y | ||
1002 | MCHC | Low | Y | ||
1002 | MCHC | Within range | Y | ||
1002 | MCV | Within range | Y | ||
1002 | MCV | Within range | Y | Y | |
1002 | MCV | Within range | Y | ||
1002 | MCV | Within range | Y | ||
1002 | MCV | Within range | Y | ||
1002 | MCV | Within range | Y |
A simple solution would be to select the relevant observations to be flagged and then merge back with the original dataset. If there's no suitable merge key, you could create one.
Example:
data _tmp / view=_tmp;
set have;
_seqno+1;
run;
data _flg(keep=_seqno);
set _tmp;
where /* insert applicable conditions incl. ANL02FL='Y' */;
by usubjid paramcd anrind adt;
if first.anrind & (anrind ne: 'W' | first.paramcd);
run;
data want(drop=_seqno);
merge _tmp _flg(in=_f);
by _seqno;
if _f then flag='Y';
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.