Solved: Re: Ordering worst case A1IND in PROC SQL

smackerz1988 · Posted 09-12-2022 02:18 PM

Hello,

I have this code I'm using to assign a flag to the worst case scenario (High or within range). (R is from a previous data step to group visits where subject is on treatment)

proc sql;
 create table flg as
   select distinct usubjid,
                   paramcd,
                   adt,
                   visit,
                   a1ind,						 
                   "Y" as anl02fl
						 
   from advs
   where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt)
   group by usubjid, paramcd  
   having adt eq min(adt);	
quit;

This is the spec for reference.

Set to "Y" for the worst case A1IND occurring post-baseline (including unscheduled and early termination visits) for each parameter for each subject. If multiple visits exist with the worst case, flag only the earliest occurring visit.

My question is how can I order the worst case variable A1IND so that if "High" does exist that gets higher priority over within range as at the moment I have subjects that have both values but because the visit date that has the value of "Within range" occurs before "High" that is getting incorrectly assigned the flag when it should be High as that is the worst case scenario.

FreelanceReinh · Posted 09-12-2022 03:58 PM

Hello @smackerz1988,

In your PROC SQL step you could define a (numeric or character) sort key, say x, whose sort order corresponds to your selection priorities. Then select the observation(s) with the minimum of x in each USUBJID-PARAMCD group (and finally drop x).

Example:

proc sql;
 create table flg(drop=x) as
   select distinct usubjid,
                   paramcd,
                   adt,
                   visit,
                   a1ind,
                   "Y" as anl02fl,
                   1e7*(a1ind eqt 'W')+1e6*missing(adt)+coalesce(adt,0) as x  /* temporary sort key */

   from advs
   where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt)
   group by usubjid, paramcd
   having x=min(x);
quit;

The term 1e7*(a1ind eqt 'W') penalizes observations with A1IND values starting with "W" in such a way that a single observation with A1IND="High" would "beat" them (by getting a smaller x value). Next, missing visit dates (assuming SAS date values in variable ADT!) are penalized by 1e6*missing(adt) so that non-missing ADT values, if any, are preferred. Finally, among non-missing visit dates the smallest ADT value (=earliest visit date) will be selected. Or do you need a different handling of missing ADT values? Are there any at all? If not, the definition of x could be simplified to 1e7*(a1ind eqt 'W')+adt.

View solution in original post

PaigeMiller · Posted 09-12-2022 02:24 PM

Show us (a portion of) the actual data that illustrates the problem. Provide the data as working SAS data step code, which you can type in yourself, or follow these instructions. Do not provide the data via other methods.

Also please show the desired output from this data.

--
Paige Miller

PeterClemmensen · Posted 09-12-2022 02:32 PM

I don't follow this. Is it simply a question of ordering the output data so that the records with a1ind = 'High' comes before records with a1ind = 'Within range'?

The DATA to DATA Step Macro
Blog: SASnrd

smackerz1988 · Posted 09-12-2022 03:02 PM

Yes and no. So some subjects might only have "Within Range" for their analysis values so that would be their worst case scenario so to speak. But if a subject has an analysis value of both "High" and "Within Range" then the code I'm using will flag the earliest date (adt) regardless if the analysis value is "High" or "Within Range". I need to prioritise the value of "High" if it is present for a subject so that the code select the earliest date that contains that value and then if it isn't present I can just use the earliest occurring "Within range" value

smackerz1988 · Posted 09-12-2022 03:18 PM

That Data to Data Step Macro looks great btw!

Reeza · Posted 09-12-2022 03:01 PM

SQL seems a bad choice for this type of question, is SQL absolutely required?

smackerz1988 · Posted 09-12-2022 03:03 PM

No, if there is a better way I'm all ears it's just because I currently am using PROC SQL code

FreelanceReinh · Posted 09-12-2022 03:58 PM

Hello @smackerz1988,

In your PROC SQL step you could define a (numeric or character) sort key, say x, whose sort order corresponds to your selection priorities. Then select the observation(s) with the minimum of x in each USUBJID-PARAMCD group (and finally drop x).

Example:

proc sql;
 create table flg(drop=x) as
   select distinct usubjid,
                   paramcd,
                   adt,
                   visit,
                   a1ind,
                   "Y" as anl02fl,
                   1e7*(a1ind eqt 'W')+1e6*missing(adt)+coalesce(adt,0) as x  /* temporary sort key */

   from advs
   where r in (1,2) and a1ind in('High','Within range') and ~missing(trtsdt)
   group by usubjid, paramcd
   having x=min(x);
quit;

The term 1e7*(a1ind eqt 'W') penalizes observations with A1IND values starting with "W" in such a way that a single observation with A1IND="High" would "beat" them (by getting a smaller x value). Next, missing visit dates (assuming SAS date values in variable ADT!) are penalized by 1e6*missing(adt) so that non-missing ADT values, if any, are preferred. Finally, among non-missing visit dates the smallest ADT value (=earliest visit date) will be selected. Or do you need a different handling of missing ADT values? Are there any at all? If not, the definition of x could be simplified to 1e7*(a1ind eqt 'W')+adt.

smackerz1988 · Posted 09-12-2022 04:38 PM

Wow really interesting and cool solution! Thank you!. Would you have any additional links so I could learn more about that?

FreelanceReinh · Posted 09-12-2022 05:14 PM

@smackerz1988 wrote:

Wow really interesting and cool solution! Thank you!. Would you have any additional links so I could learn more about that?

You're welcome. Other than the usual documentation links for the EQT and related comparison operators, the MISSING function and the COALESCE SQL function (not identical with the COALESCE function) I don't know of any references specifically on this technique. But I assume that similar sort keys have been used in other posts in this forum, in SAS conference papers or possibly books.

I had been thinking about this problem -- selecting records in PROC SQL based on a hierarchy of criteria -- earlier this year (without a concrete application), had the idea of constructing such sort keys and added this idea to my SAS notes (started in 1999). Your question seemed to be an opportunity to apply this idea to a real-world problem. So thank you for asking. :-)

smackerz1988 · Posted 09-12-2022 05:36 PM

You're very welcome!

smackerz1988 · Posted 09-13-2022 10:45 AM

@FreelanceReinh Can I ask if this can be expanded further?. I have a similar analysis flag specification with three values ("High", "Low" ,"Within range") but the wrinkle is , as highlighted in red, if a subject has both "High" and "Low" I have to flag the earliest occurrence for each?. Is this approach even viable here?

FreelanceReinh · Posted 09-13-2022 12:40 PM

I don't think the earlier solution can be expanded easily to the new task because checking (in PROC SQL) whether a USUBJID-PARAMCD group has only observations with ANRIND="Within range" would typically require a GROUP BY USUBJID, PARAMCD clause. But for the earliest-date flags you would rather need a GROUP BY USUBJID, PARAMCD, ANRIND clause. So I'd switch to a different technique such as BY-group processing in a DATA step where those nested BY groups are readily available: You could sort your data (including only "High", "Low" and "Within range" cases) into a new dataset HAVE by usubjid paramcd anrind adt. Then a DATA step as shown below would create a flag for the first observation in each USUBJID-PARAMCD-ANRIND group except for those where ANRIND="Within range" occurs together with ANRIND="High" or ANRIND="Low" in the same USUBJID-PARAMCD group.

data want;
set have;
by usubjid paramcd anrind adt;
if first.anrind & (anrind ne: 'W' | first.paramcd) then flag='Y';
run;

(Note that this step uses the alphabetical order of "High", "Low" coming before "Within range".) Unlike your PROC SQL step, this results in the complete dataset with the flag variable added.

Note that missing or tied minimum ADT values are handled differently in the DATA step than in PROC SQL:

Without further changes

a missing ADT would qualify as the "earliest" visit date because a numeric missing value is less than any non-missing value
only the first of several tied minimum ADT values would be flagged (if any).

smackerz1988 · Posted 09-13-2022 02:07 PM

@FreelanceReinh That is brilliant thank you so much for going into so much depth it's very much appreciated. My only drawback and I do understand the benefit of having simulated data more and more to illustrate the crux of the coding issue but this flag is for post baseline records only and where ANL02FL ='Y' and it currently is flagging for all screening records. I apologies it was more cumbersome that what it appeared to be

USUBJID	PARAMCD	ANRIND	ABLFL	ANL02FL	FLAG
1001	ALT	Within range			Y
1001	ALT	Within range	Y	Y
1001	ALT	Within range		Y
1001	ALT	Within range		Y
1001	ALT	Within range		Y
1001	ALT	Within range		Y
1001	AST	Within range			Y
1001	AST	Within range	Y	Y
1001	AST	Within range		Y
1001	AST	Within range		Y
1001	AST	Within range		Y
1001	AST	Within range		Y
1001	BASO	Within range			Y
1001	BASO	Within range	Y
1001	BASO	Within range
1001	BASO	Within range
1001	BASO	Within range
1001	BASO	Within range
1001	BASOLE	Within range			Y
1001	BASOLE	Within range	Y
1001	BASOLE	Within range
1001	BASOLE	Within range
1001	BASOLE	Within range
1001	BASOLE	Within range
1001	BILIC	Within range			Y
1001	BILIC	Within range	Y	Y
1001	BILIC	Within range		Y
1001	BILIC	Within range		Y
1001	BILIC	Within range		Y
1001	BILIC	Within range		Y
1001	BILIU	Within range			Y
1001	BILIU	Within range	Y	Y
1001	BILIU	Within range		Y
1001	BILIU	Within range		Y
1001	BILIU	Within range		Y
1001	BILIU	Within range		Y
1001	CREAT	Within range			Y
1001	CREAT	Within range	Y	Y
1001	CREAT	Within range		Y
1001	CREAT	Within range		Y
1001	CREAT	Within range		Y
1001	CREAT	Within range		Y
1001	CRP	Within range			Y
1001	CRP	Within range	Y	Y
1001	CRP	Within range		Y
1001	CRP	Within range		Y
1001	CRP	Within range		Y
1002	MCHC	Low		Y	Y
1002	MCHC	Within range
1002	MCHC	Within range	Y	Y
1002	MCHC	High		Y
1002	MCHC	Low		Y
1002	MCHC	Within range		Y
1002	MCV	Within range			Y
1002	MCV	Within range	Y	Y
1002	MCV	Within range		Y
1002	MCV	Within range		Y
1002	MCV	Within range		Y
1002	MCV	Within range		Y

FreelanceReinh · Posted 09-13-2022 03:20 PM

A simple solution would be to select the relevant observations to be flagged and then merge back with the original dataset. If there's no suitable merge key, you could create one.

Example:

data _tmp / view=_tmp;
set have;
_seqno+1;
run;

data _flg(keep=_seqno);
set _tmp;
where /* insert applicable conditions incl. ANL02FL='Y' */;
by usubjid paramcd anrind adt;
if first.anrind & (anrind ne: 'W' | first.paramcd);
run;

data want(drop=_seqno);
merge _tmp _flg(in=_f);
by _seqno;
if _f then flag='Y';
run;

Classroom Training Available!