BookmarkSubscribeRSS Feed
TRCWard
Calcite | Level 5

Hello - I have a data set like the table below.  My goal is, for a given ID, to find a matching ID with the conditions that:

  • the ID characters match; and
  • Var1(ID) = Var2(laggedID).

The matching ID always precedes the given ID.

My data set looks like this:

 

DATA HAVE:

IDPeriodVar1Var2Start_LagEnd_Lag
a11500300..
a21450350..
a3235030012
a4230020023
a533005012
a6320010023
b11525175..
b21451321..
b31226216..
b4232110113
b5221610624
b6217410535

I'm trying to generate a table like this that links the ID variables.  The reason I'm doing this is in order, eventually, to transpose the table into time series data, according to the linked records.  So my Data WANT looks like this:

DATA WANT:

IDLINK
a1.
a2.
a3a2
a4a1
a5a3
a6a4
b1.
b2.
b3.
b4b2
b5b3
b6b1

LINK indicates which ID has the matching Var2.

I've tried the following approach...

% Macro TEST;

     Data WANT (Keep= ID LINK);

          Set HAVE;

               Threshold = 1;

/*I have a threshold for Linking because I need to account for rounding error between periods - e.g. b6 --> b1*/

%Do i = Start_Lag %to End_Lag;

     Difference = VAR1 - Lag&i(VAR2);

     If Difference < Threshold then do;

          Link = Lag&i(D);

          Threshold = Difference;

     end;

%end;

run;

%mend;

%TEST

I originally started with a Data Step approach but switched to a Macro because otherwise I can't seem to get Lag function to adjust dynamically to the different values specified in my Start_Lag and End_Lag columns.  Inside of the macro the %Do loop needs macro variables, though, and I haven't been able to work this step out.  I have tried using symput like this:

    

     call symput('Begin', Start_Lag);

     call symput('Stop', End_Lag);

     %Do i = &Begin %to &Stop;

          ...

but it doesn't seem to be running properly since LINK ends up all over the place...

Any ideas?

10 REPLIES 10
Haikuo
Onyx | Level 15

rules need to be more specific. Both a4 and a5 (var1) can to linked to a1 and a3(var2).

TRCWard
Calcite | Level 5

ID vars with identical period values cannot be matched.  So a4 and a3 cannot be matched. 

ID vars can only be matched to ID vars in the immedietely preceding period.  So a5 and a1 cannot be matched.

This is also shown in the variables meant to govern the lags.  For example, a5(Var1) would never be compared to a1(Var2) because the maximum permitted lag (End_Lag) for a5 is 2 - which would take you back to a3.  Similarly, a4(Var1) would never be compared to a3(Var2) because the minimum permitted lag (Start_Lag) for a4 is 2 - so the search would begin in a2.  This is what the %Do loop is supposed to be governing.

ScottBass
Rhodochrosite | Level 12

Why do so many people think macro is the cure for everything 😉

Try using either an SQL self-join or a hash object join.

Untested:

proc sql;

   create table want as

      select

         a.id

        ,b.id as link

        ,a.var1

        ,b.var2

     from

        have a

     left joinb

       have b

     on

       a.var1=b.var2

   ;

quit;

Try that, examine the cartesian product.  If you have too many records in it, see what where clause you can add to get the desired results.  Drop the two var* lines once you get your desired results - they're just in there for debugging.

I haven't coded a hash object join, would give you a bit more control over the join (like what you're doing with start_lag and end_lag???).  But there are plenty of examples in this forum; just search on "hash".

Hope this helps,

Scott


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Ksharp
Super User

It will be easy. If you only want to check it at last period for the same group.

data x;
input ID $     Period     Var1     Var2 ;
if char(id,1) ne char(lag(id),1) then group+1;
cards;
a1     1     500     300
a2     1     450     350
a3     2     350     300
a4     2     300     200
a5     3     300     50
a6     3     200     100
b1     1     525     174
b2     1     451     321
b3     1     226     216
b4     2     321     101
b5     2     216     106
b6     2     174     105
;
run;
data want(keep=id link);
 if _n_ eq 1 then do;
  if 0 then set x;
  declare hash ha(dataset:'x(rename=(id=link))');
   ha.definekey('group','period','var2');
   ha.definedata('link');
   ha.definedone();
 end;
set x;
link=id;period=period-1;var2=var1;
if ha.find() ne 0 then call missing(link);
run;



KSharp

Haikuo
Onyx | Level 15

To apply all of your restrictions, such as start_lag, end_lag, based on Ksharp's code:

data have;

input ID$    Period    Var1    Var2    Start_Lag    End_Lag    ;

k=_n_;

group=first(id);

cards;

a1    1    500    300    .    .

a2    1    450    350    .    .

a3    2    350    300    1    2

a4    2    300    200    2    3

a5    3    300    50    1    2

a6    3    200    100    2    3

b1    1    525    175    .    .

b2    1    451    321    .    .

b3    1    226    216    .    .

b4    2    321    101    1    3

b5    2    216    106    2    4

b6    2    174    105    3    5

;

data want (keep=id link);

if _n_ eq 1 then do;

  if 0 then set have(rename=(id=link k=k1));

  declare hash ha(dataset:'have(rename=(id=link k=k1))', multidata:'y');

   ha.definekey('group','period','var2');

   ha.definedata('link','k1');

   ha.definedone();

end;

set have;

link=id;period=period-1;var2=var1;

if ha.find() ne 0 then call missing(link);

else if not (start_lag <= k-k1 <= end_lag) then call missing(link); 

run;

proc print;run;

TRCWard
Calcite | Level 5

Thanks for the help!  I don't really know anything about hash yet (I'm quite new to SAS) so I guess I've got some studying up to do.  Will try this out and post back if I have any more questions!

Tom

Linlin
Lapis Lazuli | Level 10

Hi Ksharp and Haikuo,

I am learning hash. Is "link=id" in "link=id;period=period-1;var2=var1;" necessary?

Thanks - Linlin

Haikuo
Onyx | Level 15

Great question, LinLin.

'link=id' is not really necessary for my code, as I have already initiated here : "  if 0 then set have(rename=(id=link k=k1));" , while ksharp has to do it if it is character variable.

As for "period=period-1;var2=var1", yes, they are necessary in both codes. That is to construct the 'key' for search in the hash.

Regards,

Haikuo

Linlin
Lapis Lazuli | Level 10

Thank you Haikuo!  - Linlin

Linlin
Lapis Lazuli | Level 10

Hi Ksharp and Haikuo,

Your hash solutions are really nice. I used datastep and merge to help me to understand how hash works.

Thank you!  - Linlin

data have;

input ID$    Period    Var1   Var2    Start_Lag    End_Lag   ;

k=_n_;

group=first(id);

cards;

a1    1   500    300    .   .

a2    1   450    350    .   .

a3    2   350    300    1   2

a4    2   300    200    2   3

a5    3   300    50    1   2

a6    3   200    100    2   3

b1    1   525    175    .   .

b2    1   451    321    .   .

b3    1   226    216    .   .

b4    2   321    101    1   3

b5    2   216    106    2   4

b6    2   175    105    3   5

;

run;

data temp1;

  set have(rename=(id=link k=k1));

     drop Start_Lag End_Lag var1;

  proc sort;

  by group period var2;

run;

data temp2;

  set have;

    period=period-1;var2=var1;

proc sort;

  by group period var2;

run;

data want(keep=id link);

  merge  temp2(in=a)temp1(in=b);

  by group period var2;

  if a;

  if not b then link='';

  if not (start_lag <= k-k1 <= end_lag) then call missing(link);

proc sort;

   by id;

run;

proc print;run;

                                  obs    ID   link

                                   1    a1

                                   2    a2

                                   3    a3    a2

                                   4    a4    a1

                                   5    a5    a3

                                   6    a6    a4

                                   7    b1

                                   8    b2

                                   9    b3

                                  10    b4    b2

                                  11    b5    b3

                                  12    b6    b1

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3698 views
  • 8 likes
  • 5 in conversation