Data step set, merge or update question involving lags or retains

PhilC · Posted 06-25-2019 09:28 AM

Is there a way to use merge, modify, update or set to do the following

data Left (sortedby=ID date);
  infile datalines;
  input ID $4.  RDATE  MMDDYY12. date MMDDYY10.;
    format RDATE date mmddyy10.;
datalines;
A998 06/01/2018 06/01/2018  
B027 05/21/2018 05/21/2018  
C001 03/31/2018 03/31/2018  
D221 11/01/2017 11/01/2017  
E379 10/07/2017 10/07/2017  
F693 06/19/2018 06/19/2018  
;
run;

data Right (sortedby=ID date);
  infile datalines;
  input ID $4.	INTVDT MMDDYY12.	STA  $1.	date MMDDYY11.;
    format INTVDT date mmddyy10.;
datalines;
A998 09/30/2012 Z 09/30/2012  
A998 03/11/2013 Y 03/11/2013  
A998 07/31/2017 X 07/31/2017  
A998 02/21/2018 W 02/21/2018  
B027 05/17/2011 V 05/17/2011  
E379 10/04/2017 U 10/04/2017  
G032 03/01/2017 T 03/01/2017   
;
run;

data whole;
  set left Right;
  by ID date;
run;

This is the desired dataset. Where the bolded data are lagged, retained or it is as if the SET statement forgot to clear the data vector each record.

ID   RDATE    date     INTVDT   STA
A998 .        09/30/12 09/30/12 Z
A998 .        03/11/13 03/11/13 Y
A998 .        07/31/17 07/31/17 X
A998 .        02/21/18 02/21/18 W
A998 06/01/18 06/01/18 02/21/18 W
B027 06/01/18 05/17/11 05/17/11 V
B027 05/21/18 05/21/18 05/17/11 V
C001 03/31/18 03/31/18 05/17/11 V
D221 11/01/17 11/01/17 05/17/11 V
E379 11/01/17 10/04/17 10/04/17 U
E379 10/07/17 10/07/17 10/04/17 U
F693 06/19/18 06/19/18 10/04/17 U
G032 06/19/18 03/01/17 03/01/17 T

Instead of this, as coded:

ID   RDATE    date     INTVDT   STA
A998 .        09/30/12 09/30/12 Z
A998 .        03/11/13 03/11/13 Y
A998 .        07/31/17 07/31/17 X
A998 .        02/21/18 02/21/18 W
A998 06/01/18 06/01/18 . 
B027 .        05/17/11 05/17/11 V
B027 05/21/18 05/21/18 . 
C001 03/31/18 03/31/18 . 
D221 11/01/17 11/01/17 . 
E379 .        10/04/17 10/04/17 U
E379 10/07/17 10/07/17 . 
F693 06/19/18 06/19/18 . 
G032 .        03/01/17 03/01/17 T

Astounding · Posted 06-25-2019 09:40 AM

This looks about right:

data whole;
   set left right;
   by id date;
   prior_rdate = lag(rdate);
   prior_intvdt = lag(intvdt);
   prior_sta = lag(sta);
   if rdate = . then rdate = prior_rdate;
   if intvdt = . then intvdt = prior_intvdt;
   if sta = ' ' then sta = prior_sta;
   drop prior_: ;
run;

Give it a test and see if it does what you need.

PhilC · Posted 06-25-2019 10:04 AM

Pro: it works.

Kuddos: drop prior_:

Con: To scale this up, we'd need a lag() function, if statement and prior_ variable for each column not in the by statement.

I am looking for something else.

Astounding · Posted 06-25-2019 10:14 AM

There are several ways to go. You could use macro language to generate as many "prior" variables as needed. But there are issues that require knowing something about your data. These would be helpful questions:

Do you know how many numeric variables you have?
Do you know how many character variables you have?
Will the BY variables always be the same?
For character variables, do you know the maximum length across all character variables?
What is the maximum length for variable names (so that we know there would be room to prefix them with "prior_")?

The answers can guide the proper approach. Approximate answers are probably OK.

PhilC · Posted 06-25-2019 10:22 AM

This is a common thing that I want to do. Each time I encounter it, I feel like I have to re-engineer the solution. Last time I asked this question I was educated about set statement interleaving, this blew my mind since I never encountered this even after seven years of SAS experience. I'm wondering if there is something else to learn... but if not, then maybe this is a simple enough thing to ask for a new option in the set statement.

Edit: ok, that was you that showed me the set statement interleave. ha ha

PhilC · Posted 06-25-2019 10:07 AM

I always do some of my best thinking after I've asked smart people for help. I came up with this after asking. Adding indexes to the two input files one can do this.

data whole2;
  set Left  (keep=ID date)
      Right (keep=ID date);
  by ID date;
  set Left  key=IDdate;
  set Right key=IDdate;
run;

Con: extra SET KEY statement for each file.

Con: possible file errors

Annoyances: Keep statements and indexes

Can someone one-up me here?

Astounding · Posted 06-25-2019 10:41 AM

Does this solution actually work when you cross the boundary from one ID to the next?

PhilC · Posted 06-25-2019 10:42 AM

A Do Whitlock loop and a call missing function would handle that.

Astounding · Posted 06-25-2019 10:56 AM

Here's something to consider. It assumes there are no more than 999 numeric variables, no more than 999 character variables, and a maximum length for any character variable of $1000.

data want;
if 5=4 then do;
   set left (drop=id year) right (drop=id year);
   array nums {*} _numeric_;
   array chars {*} _character_;
   array priorn {999} _temporary_;
   array prilorc {999} $1000 _temporary_;
end;
set left right;
by id date;
do _n_=1 to dim(nums);
   priorn{_n_} = lag(nums{_n_});
   if nums{_n_} = . then nums{_n_} = priorn{_n_};
end;
do _n_=1 to dim(chars);
   priorc{_n_} = lag(chars{_n_});
   if chars{_n_} = " " then chars{_n_} = priorc{_n_};
end;
run;

It's long, it's somewhat clumsy, and it's untested. Perhaps it's useful nevertheless.

PhilC · Posted 06-25-2019 11:02 AM

That's so great, to me, -- my coworkers, OTOH, would hate me if I coded like that.

Honestly, I want this:

data whole;
  set left Right /noPDVreset;
  by ID date;
run;

Thanks for your contribution, also thanks for teaching me about SET statement, way back, too.

FreelanceReinh · Posted 06-25-2019 01:56 PM

Hi @PhilC,

Maybe the "UPDATE trick" (which I've learned in this forum) can help a bit:

data tmp / view=tmp;
set left Right;
by ID date;
_i=.;
run;

data whole(drop=_i);
update tmp(obs=0) tmp;
by _i;
output;
run;

PhilC · Posted 06-25-2019 02:28 PM

Pro: it works

Pro: This works if you want to respect the ID boundaries.

data tmp0 tmp1 ;
  do until (last.ID); 
    set left Right; by ID date; output tmp1;
  end;
  output tmp0;
data whole3;
  update tmp0 tmp1; by ID; output; 
run;

Annoyances: opening a file twice, a little too many lines...

Is this just the "Update trick" it needs another name. Thanks,

edit: The "KMW-update trick"? See SAS-Programming/How-can-I-combine-multiple-rows-to-a-single-row, @kmw

Data step set, merge or update question involving lags or retains

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Re: Data step set or merge question involving lags

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!