Re: Alternate to sql using data step

RAVI2000 · Posted 08-09-2022 12:56 PM

Hello everyone,
I would like to have an alternate data step code for the below sql code. Any help would be appreciated.

proc sql;
	create table crs(where=(anl02fl='Y')) as
	select a.*, b.anlfl as anl02fl, b.astdt, b.aendt, b.aedecod
	from adbmk as a
	left join adae(where=crsfl='y')) as b
	on a.usubjid=b.usubjid and (b.astdt-1<=a.adt<=b.aendt2+ 3)
	;
quit;

proc sort data=crs nodupkey;
	by usubjid adt adtm;
run;

PaigeMiller · Posted 08-09-2022 01:37 PM

on a.usubjid=b.usubjid and (b.astdt-1<=a.adt<=b.aendt2+ 3)

As far as I know, this ON clause where you are using inequalities and boolean conditions isn't possible in a DATA step merge. A DATA step merge requires exact match.

--
Paige Miller

RAVI2000 · Posted 08-09-2022 01:44 PM

Exactly that where my issue comes. Can it be done in multiple data steps?

PaigeMiller · Posted 08-09-2022 01:49 PM

@RAVI2000 wrote:
Can it be done in multiple data steps?

Why not use the tool that does exactly what you want in one step?

--
Paige Miller

RAVI2000 · Posted 08-09-2022 01:50 PM

data crs(where=(anl04fl='Y'));
	merge adbmk(in=bmk) adae(in=ae where=(crsfl='Y') rename=(anlfl=anl04fl));
	by usubjid;
	if bmk;
	if astdt-1 <= adt <= aeendt2 +3;
run;

proc sort data=crs nodupkey;
	by usubjid adt adtm;
run;

Is this not equivalent?

PaigeMiller · Posted 08-09-2022 01:52 PM

Seems as if you have answered your own question, @RAVI2000 .

As far as "is it equivalent?" I leave that to you to test on your data.

--
Paige Miller

RAVI2000 · Posted 08-09-2022 01:54 PM

Well they don't match. That's what I am confused about. Aren't they both meet the same condition?

Tom · Posted 08-09-2022 02:01 PM

@RAVI2000 wrote:
Well they don't match. That's what I am confused about. Aren't they both meet the same condition?

Not when there are multiple observations for the same value of USUBJID in either or the two inputs.

RAVI2000 · Posted 08-09-2022 02:10 PM

Good catch, both of the datasets have multiple observations for the same value of USUBJID. What can be done here?

Tom · Posted 08-09-2022 02:27 PM

@RAVI2000 wrote:
Good catch, both of the datasets have multiple observations for the same value of USUBJID. What can be done here?

So USUBJID is NOT a UNIQUE key for either dataset. What are the combination variables that uniquely identify the observations in each of the datasets?

In general to do a MANY to MANY join you want to use SQL.

But depending on what you are doing then perhaps you can use a simple INTERLEAVE

data want;
  set tablea(in=inA) tableb(in=inB) ;
  by usubjid date ;
....

to get what you want, but would need to know more about the dataset and the purpose of the join.

RAVI2000 · Posted 08-09-2022 03:16 PM

Except for usubjid there are no common variables.

PaigeMiller · Posted 08-09-2022 02:01 PM

@RAVI2000 wrote:
Well they don't match.

So you have information that you are not sharing with us. Please share it with us. Give an example of several records that don't match doing it via SQL and doing it via DATA step.

That's what I am confused about. Aren't they both meet the same condition?

Running the code in SAS and looking at the results is much more authoritative than any answer I (or anyone else here in this forum) can give.

--
Paige Miller

RAVI2000 · Posted 08-09-2022 03:14 PM

data adbmk;
input usubjid $ adt $ paramcd $;
cards;
100001 20MAY19 E2SRAL
100001 20MAY19 EF	 
100001 20MAY19 ETLK5 
100003 18JUL19 ETLK8 
100003 29JUL19 ETLK8 
100014 08JUL19 E2SRAL
100014 08JUL19 EF	 
100014 09JUL19 ETLK5 
100014 09JUL19 ETLK8 
100014 11JUL19 E2SRAL
100014 11JUL19 EF	 
100014 11JUL19 ETLK5 
100014 11JUL19 ETLK8 
100014 12JUL19 E2SRAL
100014 12JUL19 EF
100004 11JUL19 E12P70	
100004 11JUL19 E2SRAL
100006 14AUG19 ENG	
100006 14AUG19 ETLK13	
100006 14AUG19 ETLK17
100007 30AUG19 E12P70	
100007 30AUG19 E2SRAL
100015 24JAN20 E12P70	
100015 24JAN20 E2SRAL
100021 07APR20 EF	
100021 07APR20 ENG	
100021 07APR20 ETLK10	
100023 28MAY20 ETLK6	
100029 06AUG20 ETLK2	
100029 06AUG20 ETLK4
100031 12OCT20 E12P70	
100031 12OCT20 E2SRAL
100036 19OCT20 ETLK1B	
100036 19OCT20 ETLK2
100038 03NOV20 ETLK6
100040 05NOV20 E2SRAL
100042 27OCT20 EF	
100042 27OCT20 ENG	
100045 17NOV20 ETLK8	
100045 20NOV20 E12P70	
100048 14DEC20 ETLK4	
100048 14DEC20 ETLK5
100059 13MAR21 ETLK13		 
;

run;

data adae;
input usubjid $ astdt $ trt01a $ crsfl $ aendt $;
cards;
100014 03FEB20 3  Y	      03FEB20	
100018 10FEB20 3  Y	      12FEB20	
100018 13FEB20 3  Y	      13FEB20	
100029 10AUG20 5  Y	      10AUG20	
100029 17AUG20 5  Y	      17AUG20	
100031 06AUG20 5  Y	      07AUG20	
100031 07AUG20 5  Y	      21AUG20	
100032 11SEP20 5  Y	      12SEP20	
100040 24OCT20 5  Y	      28OCT20	
100047 21NOV20 11 Y	      21NOV20	
100047 25NOV20 11 Y	      25NOV20	
100047 28NOV20 11 Y	      29NOV20	
100047 04DEC20 11 Y	      05DEC20	
100047 15DEC20 11 Y	      15DEC20	
100047 22DEC20 11 Y	      22DEC20	
100048 15DEC20 11 Y	      16DEC20	
100048 18DEC20 11 Y	      19DEC20	
100048 22DEC20 11 Y	      23DEC20	
100048 29DEC20 11 Y	      29DEC20	
100048 05JAN21 11 Y	      06JAN21	
100048 12JAN21 11 Y	      12JAN21	
100049 15DEC20 11 Y	      16DEC20	
100049 28JAN21 11 Y	      29JAN21	
100056 30JAN21 12 Y	      01FEB21	
100056 01FEB21 12 Y	      02FEB21	
100059 02FEB21 12 Y	      02FEB21	
100059 05FEB21 12 Y	      05FEB21	
100059 25FEB21 12 Y	      26FEB21	
100061 19FEB21 11 Y	      20FEB21	
;
run;

Here is the data from both the datasets.

I want to create a flag ANL02FL flag.

ANL02FL flag records within ADBMK if a subject experience an CRS.

The following logic should be applied to determine if a reading falls within the given windiow.

IF ADAE.ASTDT-1 <= ADBMK.ADT <= ADAE.AENDT+3;

mkeintz · Posted 08-10-2022 09:47 AM

Your SQL code does not work with your sample data. If you could supply SQL code that works properly with the data (and data which has numeric variables where the SQL expects them), then it will be much easier to determine the equivalent DATA step code.

The problem you have identified, doing a many-to-many in a DATA step takes a bit of coding. But use of a hash object can make it reliable and useful.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 08-09-2022 02:06 PM

You don't say what's wrong with either result.

Maybe...

data crs(where=(anl04fl='Y'));
	merge adbmk(in=bmk) adae(in=ae where=(crsfl='Y') rename=(anlfl=anl04fl));
	by usubjid;
	if bmk;
	if not ae or (astdt-1 <= adt <= aeendt2 +3) then output;
run;

This will keep the usubjid that are absent from table adae.

PG

Classroom Training Available!