Re: Evaluate row but date on different row

Q1983 · Posted 10-09-2019 08:55 PM

data test;

input ln $ step $ date1 date9.;

format date1 date9.;

datalines;

0122 B 01jun2019

0122 B 07jun2019

0122 C 06jun2019

;

run;

data test2;

set test;

if step='B' then b_step=date1;

if step='C' then c_step=date1;

format b_step c_step mmddyy10.;

run;

ln	step	date1	b_step	c_step
0122	B	01Jun2019	06/01/2019
0122	B	07Jun2019	06/07/2019
0122	C	06Jun2019		06/06/2019

The above query produces this output. The issue is I need to evaluate both the date in b_step and c_step however as you can they are on different rows because there was no C step on the other dates. Since there is not way to evaluate rows unless they are on the same row, is there a way to program the code to get the c_step date on the same row as the b_step. So 6/6/2019 in c_step would move to the top row. If not then is there a way to evaluate b_step and c_step in its current condition (ie if b_step and c_step >0 then F= 'Yes' ;else F='No';

Shmuel · Posted 10-09-2019 09:04 PM

Check next code:

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;
run;

data test2;
 set test;
  by ln ;
      retain _all_;
     if step='B' then b_step=date1;
     if step='C' then c_step=date1;

     if last.ln then do; if b_step and c_step >0 then F= 'Yes' ;else F='No'; output; end;
     
     keep ln b_step c_step F;
     format b_step c_step mmddyy10.;
run;

Q1983 · Posted 10-09-2019 09:48 PM

Your solution puts the c_step on the top row however this is the actual output desired

Original output
ln	step	date1	b_step	c_step
0122	B	01Jun2019	06/01/2019
0122	B	07Jun2019	06/07/2019
0122	C	06Jun2019		06/06/2019


Desired Output
ln	step	date1	b_step	c_step
0122	B	01Jun2019	06/01/2019	06/06/2019
0122	B	07Jun2019	06/07/2019
0122	C	06Jun2019

Idea is to show c_step on same row as the B_step so I can evaluate borizontally. By nature the steps will have different rows due to different step names "B" and "C"

mkeintz · Posted 10-09-2019 09:29 PM

I don't fully understand your description of what you want. Please post what the resulting data would actually look like.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Q1983 · Posted 10-09-2019 09:49 PM

Original output
ln	step	date1	b_step	c_step
0122	B	01Jun2019	06/01/2019
0122	B	07Jun2019	06/07/2019
0122	C	06Jun2019		06/06/2019


Desired Output
ln	step	date1	b_step	c_step
0122	B	01Jun2019	06/01/2019	06/06/2019
0122	B	07Jun2019	06/07/2019
0122	C	06Jun2019

mkeintz · Posted 10-09-2019 10:48 PM

Make a minor modification to the way you make dataset TEST. Then it is straightforward:

data test;
  input ln $ step $ date1 date9.;
  if step='B' then b_step=date1; else
  if step='C' then c_step=date1;
  format date1 b_step c_step date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
run;

data test2;
 merge test (where=(step='C') in=inc)
       test;
 by ln ;
 output;
 call missing(of _all_);
run;

The MERGE statement merges the single step='C' record with all three records. Ordinarily that means the variable C_STEP would be present in all 3 resulting records (because it's a one-to-many merge).

But because it uses an explicit OUTPUT statement followed by a CALL MISSING(of _ALL_), the C_STEP inserted into record 1 was set to missing and was not restored until it was re-read in the 3rd record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ScottBass · Posted 10-09-2019 11:59 PM

Much like in life, the SAS data step is good at "knowing the past" but not as good at "knowing the future" (it's just an analogy)

Thus, SAS has the LAG function, but not a LEAD function.

I generally avoid LAG when a RETAIN works just as well, but that's just my coding style.

However, it's easy to use PROC SQL's Cartesian product plus a surrogate key to implement both LEAD and LAG:

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;
run;

data have/view=have;
   sk+1;
   set test;
run;

proc sql;
   create table want as
   select a.ln
         ,a.step
         ,a.date1
         ,b.date1 as date1_lag
         ,c.date1 as date1_lead
   from have a
   full join have b
   on a.ln=b.ln and a.sk=b.sk+1
   full join have c
   on a.ln=c.ln and a.sk+1=c.sk
   where a.ln is not missing
   ;
quit;

There are other approaches:

double set statement with obs= ds options (and a lot of fiddling if you have to account for grouping columns)
indexes + surrogate key (but requires the sk to be physically in the file)
hash objects (does not require the sk to be in the file, you can use a view as I have done)

but SQL + Cartesian + SK is my go-to approach as long as data volumes don't cause performance issues due to the Cartesian product.

HTH...

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

mkeintz · Posted 10-10-2019 12:44 AM

@ScottBass

I would argue that there really isn't a true LAG function in SAS either. This misleadingly-named function should really be called something like UPDFIFOQ or UFQ (for "update fifo queue").

I bet using such queue-oriented nomenclature would not only reduce the frequent lament at the absence of "LEAD" function, but would also enhance understanding when, how, and whether to use the function in an if statement. And using the notion of updating queues would be a better fit for your observation that the "data step is good at knowing the past".

And of course, the data step can actually look ahead about as easily as it can look back:

data past_present_future;
    merge have 
          have (firstobs=2 keep=x rename=(x=next_x));
    if _n_>1 then set have (keep=x rename=(x=past_x));
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ScottBass · Posted 10-10-2019 01:31 AM

I would argue that there really isn't a true LAG function in SAS either.

I dunno...it's listed in the doc as a function 😉 IIRC with a number of caveats - it's been a while since I read the doc on LAG and am too lazy to look it up right now.

This misleadingly-named function should really be called something like UPDFIFOQ or UFQ (for "update fifo queue").

SASWare Ballot? 😉

I just never really use LAG (LAG1 that is):

data class;
   length lag_name $10;
   retain lag_name;
   lag_name=name;
   set sashelp.class;
run;

Or perhaps:

proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data want;
   length lag_name $10;
   retain lag_name;
   lag_name=name;
   set class;
   by age name;
   if first.age then call missing(lag_name);
run;

But it gets more cumbersome the more grouping columns you have.

Things get trickier using this approach if I need LAG2, LAG3, etc, but I rarely have a need for that.

And of course, the data step can actually look ahead about as easily as it can look back:
data past_present_future;
    merge have 
          have (firstobs=2 keep=x rename=(x=next_x));
    if _n_>1 then set have (keep=x rename=(x=past_x));
run;

Define "easily" lol. I'll define it as "intuitively obvious to someone with less than 3 years SAS experience" (or easily found as an example in the doc).

But you've shown a cool technique mixing merge and double set statement. I've done something similar in the past using double (or triple) set statements, and the premature eof (firstobs=2) was problematic. I can't remember what approach I used? point=? do until eof? Anyway, now I have your approach, so thanks!

Standalone example:

data past_present_future;
   format past_name name next_name;
   if _n_>1 then set sashelp.class (keep=name rename=(name=past_name));
   merge sashelp.class
         sashelp.class (firstobs=2 keep=name rename=(name=next_name));
run;

However, how would you do the below using your approach if you needed to group (without lots of first. and last. processing + call missings)?

proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data have/view=have;
   sk+1;
   set class;
run;

proc sql;
   create table want as
   select b.name as past_name
         ,a.name
         ,c.name as next_name
         ,a.age
   from have a
   full join have b
   on a.age=b.age and a.sk=b.sk+1
   full join have c
   on a.age=c.age and a.sk+1=c.sk
   where a.name is not missing
   ;
quit;

Or worse:

proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data have/view=have;
   sk+1;
   set cars;
run;

proc sql;
   create table want as
   select b.model as past_model
         ,a.model
         ,c.model as next_model
         ,a.make
         ,a.type
         ,a.drivetrain
   from have a
   full join have b
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk=b.sk+1
   full join have c
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk+1=c.sk
   where a.model is not missing
   ;
quit;

(I haven't done any in depth checking on this last example to see if the results make sense - they looked ok on a quick glance).

Edit: I have used your approach, or something similar, when performance was an issue. Say 50M records, and I needed LEAD within say two grouping columns. 50M * 50M * 50M Cartesian, even when filtered out with the join criteria, can be a bit of an "issue"! But it required a bit of first. and last. processing plus call missing statements (and a lot of testing) to get the desired results.

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

mkeintz · Posted 10-10-2019 05:39 PM

@ScottBass wrote:

However, how would you do the below using your approach if you needed to group (without lots of first. and last. processing + call missings)?
proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data have/view=have;
   sk+1;
   set class;
run;

proc sql;
   create table want as
   select b.name as past_name
         ,a.name
         ,c.name as next_name
         ,a.age
   from have a
   full join have b
   on a.age=b.age and a.sk=b.sk+1
   full join have c
   on a.age=c.age and a.sk+1=c.sk
   where a.name is not missing
   ;
quit;
Or worse:
proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data have/view=have;
   sk+1;
   set cars;
run;

proc sql;
   create table want as
   select b.model as past_model
         ,a.model
         ,c.model as next_model
         ,a.make
         ,a.type
         ,a.drivetrain
   from have a
   full join have b
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk=b.sk+1
   full join have c
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk+1=c.sk
   where a.model is not missing
   ;
quit;
(I haven't done any in depth checking on this last example to see if the results make sense - they looked ok on a quick glance).

Edit: I have used your approach, or something similar, when performance was an issue. Say 50M records, and I needed LEAD within say two grouping columns. 50M * 50M * 50M Cartesian, even when filtered out with the join criteria, can be a bit of an "issue"! But it required a bit of first. and last. processing plus call missing statements (and a lot of testing) to get the desired results.

@ScottBass

Yes, by-groups require the use of a BY statement. But that BY statement should be associated not with the MERGE or SET statements in my prior example, but to an additional SET that reads in only the BY vars, as in:

proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data past_present_future;
   if _n_>1 then set cars (keep=model rename=(model=past_model));
   merge cars
         cars (firstobs=2 keep=model rename=(model=next_model));

   set cars (keep=make type drivetrain);
   by make type drivetrain;
   if first.drivetrain then call missing(of past_:);
   if last.drivetrain then call missing(of next_:);
run;

This is easily expanded to carry multiple PAST_ and NEXT_ variables, requiring only the addition of those variables (and associated RENAMEs) in the "IF ... THEN SET" statement and the 2nd argument of the MERGE statement. Just make sure to use PAST_xxx and NEXT_xxx for the renamed variables, so that the CALL MISSING can be left unchanged.

I'm not going to argue over "easily", but I do regard this code as relatively compact and straightforward. Its fundamental requirement is a level of comfort with understanding how the DATA step builds the program-data-vector. If only SAS Institute had written more meaningful documentation on SET and MERGE.

But my main point is that the DATA step need not be limited to "looking back".

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 10-10-2019 02:04 AM

Please use the proper window ("little running man" button) for posting code. The main window mangles code, so it needs all kinds of cosmetic operations before it runs and looks good.

See this code:

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;

data test2;
set test;
if step='B' then b_step=date1;
if step='C' then c_step=date1;
format b_step c_step mmddyy10.;
run;

data want;
if 0 then set test2;
call missing (of _all_);
merge
  test2 (drop=c_step)
  test2 (drop=step date1 b_step where=(c_step ne .));
;
by ln;
run;

proc print data=want noobs;
run;

Result:

 ln     step        date1        b_step        c_step

0122     B      01JUN2019    06/01/2019    06/06/2019
0122     B      07JUN2019    06/07/2019             .
0122     C      06JUN2019             .             .

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Call for Content

Classroom Training Available!