BookmarkSubscribeRSS Feed
Q1983
Lapis Lazuli | Level 10

data test;

input ln $ step $ date1 date9.;

format date1 date9.;

datalines;

 

0122 B 01jun2019

0122 B 07jun2019

0122 C 06jun2019

;

run;

data test2;

set test;

if step='B' then b_step=date1;

if step='C' then c_step=date1;

format b_step c_step mmddyy10.;

run;

ln step date1 b_step c_step
0122 B 01Jun2019 06/01/2019  
0122 B 07Jun2019 06/07/2019  
0122 C 06Jun2019   06/06/2019

The above query produces this output.  The issue is I need to evaluate both the date in b_step and c_step however as you can they are on different rows because there was no C step on the other dates.  Since there is not way to evaluate rows unless they are on the same row, is there a way to program the code to get the c_step date on the same row as the b_step.  So 6/6/2019 in c_step would move to the top row.  If not then is there a way to evaluate b_step and c_step in its current condition (ie if b_step and c_step >0 then F= 'Yes' ;else F='No';

10 REPLIES 10
Shmuel
Garnet | Level 18

Check next code:

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;
run;

data test2;
 set test;
  by ln ;
      retain _all_;
     if step='B' then b_step=date1;
     if step='C' then c_step=date1;

     if last.ln then do; if b_step and c_step >0 then F= 'Yes' ;else F='No'; output; end;
keep ln b_step c_step F; format b_step c_step mmddyy10.; run;
Q1983
Lapis Lazuli | Level 10

Your solution puts the c_step on the top row however this is the actual output desired

Original output    
ln step date1 b_step c_step
0122 B 01Jun2019 06/01/2019  
0122 B 07Jun2019 06/07/2019  
0122 C 06Jun2019   06/06/2019
         
         
Desired Output    
ln step date1 b_step c_step
0122 B 01Jun2019 06/01/2019 06/06/2019
0122 B 07Jun2019 06/07/2019  
0122 C 06Jun2019    

Idea is to show c_step on same row as the B_step so I can evaluate borizontally.  By nature the steps will have different rows due to different step names "B" and "C"

mkeintz
PROC Star

I don't fully understand your description of what you want.  Please post what the resulting data would actually look like.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Q1983
Lapis Lazuli | Level 10
Original output    
ln step date1 b_step c_step
0122 B 01Jun2019 06/01/2019  
0122 B 07Jun2019 06/07/2019  
0122 C 06Jun2019   06/06/2019
         
         
Desired Output    
ln step date1 b_step c_step
0122 B 01Jun2019 06/01/2019 06/06/2019
0122 B 07Jun2019 06/07/2019  
0122 C 06Jun2019    
mkeintz
PROC Star

Make a minor modification to the way you make dataset TEST.  Then it is straightforward:

 

data test;
  input ln $ step $ date1 date9.;
  if step='B' then b_step=date1; else
  if step='C' then c_step=date1;
  format date1 b_step c_step date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
run;

data test2;
 merge test (where=(step='C') in=inc)
       test;
 by ln ;
 output;
 call missing(of _all_);
run;

The MERGE statement merges the single step='C' record with all three records.   Ordinarily that means the variable C_STEP would be present in all 3 resulting records (because it's a one-to-many merge).

 

But because it uses an explicit OUTPUT statement followed by a CALL MISSING(of _ALL_), the C_STEP inserted into record 1 was set to missing and was not restored until it was re-read in the 3rd record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ScottBass
Rhodochrosite | Level 12

Much like in life, the SAS data step is good at "knowing the past" but not as good at "knowing the future" (it's just an analogy)

 

Thus, SAS has the LAG function, but not a LEAD function. 

 

I generally avoid LAG when a RETAIN works just as well, but that's just my coding style.

 

However, it's easy to use PROC SQL's Cartesian product plus a surrogate key to implement both LEAD and LAG:

 

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;
run;

data have/view=have;
   sk+1;
   set test;
run;

proc sql;
   create table want as
   select a.ln
         ,a.step
         ,a.date1
         ,b.date1 as date1_lag
         ,c.date1 as date1_lead
   from have a
   full join have b
   on a.ln=b.ln and a.sk=b.sk+1
   full join have c
   on a.ln=c.ln and a.sk+1=c.sk
   where a.ln is not missing
   ;
quit;

There are other approaches:

 

  • double set statement with obs= ds options (and a lot of fiddling if you have to account for grouping columns)
  • indexes + surrogate key (but requires the sk to be physically in the file)
  • hash objects (does not require the sk to be in the file, you can use a view as I have done)

but SQL + Cartesian + SK is my go-to approach as long as data volumes don't cause performance issues due to the Cartesian product.

 

HTH...


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
mkeintz
PROC Star

@ScottBass 

 

I would argue that there really isn't a true LAG function in SAS either.  This misleadingly-named function should really be called something like UPDFIFOQ  or UFQ  (for "update fifo queue").

 

I bet using such queue-oriented nomenclature would not only reduce the frequent lament at the absence of "LEAD" function, but would also enhance understanding when, how, and whether to use the function in an if statement.  And using the notion of updating queues would be a better fit for your observation that the "data step is good at knowing the past".

 

And of course, the data step can actually  look ahead about as easily as it can look back:

 

data past_present_future;
    merge have 
          have (firstobs=2 keep=x rename=(x=next_x));
    if _n_>1 then set have (keep=x rename=(x=past_x));
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ScottBass
Rhodochrosite | Level 12

I would argue that there really isn't a true LAG function in SAS either.

 

I dunno...it's listed in the doc as a function 😉  IIRC with a number of caveats - it's been a while since I read the doc on LAG and am too lazy to look it up right now.

 

This misleadingly-named function should really be called something like UPDFIFOQ  or UFQ  (for "update fifo queue").

 

SASWare Ballot? 😉  

 

I just never really use LAG (LAG1 that is):

 

data class;
   length lag_name $10;
   retain lag_name;
   lag_name=name;
   set sashelp.class;
run;

Or perhaps:

 

proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data want;
   length lag_name $10;
   retain lag_name;
   lag_name=name;
   set class;
   by age name;
   if first.age then call missing(lag_name);
run;

But it gets more cumbersome the more grouping columns you have.

 

Things get trickier using this approach if I need LAG2, LAG3, etc, but I rarely have a need for that.

 


And of course, the data step can actually  look ahead about as easily as it can look back:

 

data past_present_future;
    merge have 
          have (firstobs=2 keep=x rename=(x=next_x));
    if _n_>1 then set have (keep=x rename=(x=past_x));
run;

 

 

Define "easily" lol.  I'll define it as "intuitively obvious to someone with less than 3 years SAS experience" (or easily found as an example in the doc).

 

But you've shown a cool technique mixing merge and double set statement.  I've done something similar in the past using double (or triple) set statements, and the premature eof (firstobs=2) was problematic.  I can't remember what approach I used?  point=? do until eof?  Anyway, now I have your approach, so thanks!

 

Standalone example:

 

data past_present_future;
   format past_name name next_name;
   if _n_>1 then set sashelp.class (keep=name rename=(name=past_name));
   merge sashelp.class
         sashelp.class (firstobs=2 keep=name rename=(name=next_name));
run;

However, how would you do the below using your approach if you needed to group (without lots of first. and last. processing + call missings)?

 

proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data have/view=have;
   sk+1;
   set class;
run;

proc sql;
   create table want as
   select b.name as past_name
         ,a.name
         ,c.name as next_name
         ,a.age
   from have a
   full join have b
   on a.age=b.age and a.sk=b.sk+1
   full join have c
   on a.age=c.age and a.sk+1=c.sk
   where a.name is not missing
   ;
quit;

Or worse:

 

proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data have/view=have;
   sk+1;
   set cars;
run;

proc sql;
   create table want as
   select b.model as past_model
         ,a.model
         ,c.model as next_model
         ,a.make
         ,a.type
         ,a.drivetrain
   from have a
   full join have b
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk=b.sk+1
   full join have c
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk+1=c.sk
   where a.model is not missing
   ;
quit;

(I haven't done any in depth checking on this last example to see if the results make sense - they looked ok on a quick glance).

 

Edit:  I have used your approach, or something similar, when performance was an issue.  Say 50M records, and I needed LEAD within say two grouping columns.  50M * 50M * 50M Cartesian, even when filtered out with the join criteria, can be a bit of an "issue"!  But it required a bit of first. and last. processing plus call missing statements (and a lot of testing) to get the desired results.


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
mkeintz
PROC Star

@ScottBass wrote:

However, how would you do the below using your approach if you needed to group (without lots of first. and last. processing + call missings)?

 

proc sort data=sashelp.class out=class (keep=name age);
   by age name;
run;

data have/view=have;
   sk+1;
   set class;
run;

proc sql;
   create table want as
   select b.name as past_name
         ,a.name
         ,c.name as next_name
         ,a.age
   from have a
   full join have b
   on a.age=b.age and a.sk=b.sk+1
   full join have c
   on a.age=c.age and a.sk+1=c.sk
   where a.name is not missing
   ;
quit;

Or worse:

 

proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data have/view=have;
   sk+1;
   set cars;
run;

proc sql;
   create table want as
   select b.model as past_model
         ,a.model
         ,c.model as next_model
         ,a.make
         ,a.type
         ,a.drivetrain
   from have a
   full join have b
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk=b.sk+1
   full join have c
   on a.make=b.make and a.type=b.type and a.drivetrain=b.drivetrain and a.sk+1=c.sk
   where a.model is not missing
   ;
quit;

(I haven't done any in depth checking on this last example to see if the results make sense - they looked ok on a quick glance).

 

Edit:  I have used your approach, or something similar, when performance was an issue.  Say 50M records, and I needed LEAD within say two grouping columns.  50M * 50M * 50M Cartesian, even when filtered out with the join criteria, can be a bit of an "issue"!  But it required a bit of first. and last. processing plus call missing statements (and a lot of testing) to get the desired results.


 

@ScottBass 

 

Yes, by-groups require the use of a BY statement.   But that BY statement should be associated not with the MERGE or SET statements in my prior example, but  to an additional SET that reads in only the BY vars, as in:

 

proc sort data=sashelp.cars out=cars (keep=make type model drivetrain);
   by make type drivetrain;
run;

data past_present_future;
   if _n_>1 then set cars (keep=model rename=(model=past_model));
   merge cars
         cars (firstobs=2 keep=model rename=(model=next_model));

   set cars (keep=make type drivetrain);
   by make type drivetrain;
   if first.drivetrain then call missing(of past_:);
   if last.drivetrain then call missing(of next_:);
run;

This is easily expanded to carry multiple PAST_ and NEXT_ variables, requiring only the addition of those variables (and associated RENAMEs) in the "IF ... THEN SET" statement and the 2nd argument of the MERGE statement.  Just make sure to use PAST_xxx and NEXT_xxx for the renamed variables, so that the CALL MISSING can be left unchanged.

 

I'm not going to argue over "easily", but I do regard this code as relatively compact and straightforward.  Its fundamental requirement is a level of comfort with understanding how the DATA step builds the program-data-vector.  If only  SAS Institute had written more meaningful documentation on SET and MERGE.

 

But my main point is that the DATA step need not be limited to "looking back".

 

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

Please use the proper window ("little running man" button) for posting code. The main window mangles code, so it needs all kinds of cosmetic operations before it runs and looks good.

 

See this code:

data test;
input ln $ step $ date1 date9.;
format date1 date9.;
datalines;
0122 B 01jun2019
0122 B 07jun2019
0122 C 06jun2019
;

data test2;
set test;
if step='B' then b_step=date1;
if step='C' then c_step=date1;
format b_step c_step mmddyy10.;
run;

data want;
if 0 then set test2;
call missing (of _all_);
merge
  test2 (drop=c_step)
  test2 (drop=step date1 b_step where=(c_step ne .));
;
by ln;
run;

proc print data=want noobs;
run;

Result:

 ln     step        date1        b_step        c_step

0122     B      01JUN2019    06/01/2019    06/06/2019
0122     B      07JUN2019    06/07/2019             .
0122     C      06JUN2019             .             .

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1843 views
  • 2 likes
  • 5 in conversation