Hello - I have a data set like the table below. My goal is, for a given ID, to find a matching ID with the conditions that:
The matching ID always precedes the given ID.
My data set looks like this:
DATA HAVE:
ID | Period | Var1 | Var2 | Start_Lag | End_Lag |
---|---|---|---|---|---|
a1 | 1 | 500 | 300 | . | . |
a2 | 1 | 450 | 350 | . | . |
a3 | 2 | 350 | 300 | 1 | 2 |
a4 | 2 | 300 | 200 | 2 | 3 |
a5 | 3 | 300 | 50 | 1 | 2 |
a6 | 3 | 200 | 100 | 2 | 3 |
b1 | 1 | 525 | 175 | . | . |
b2 | 1 | 451 | 321 | . | . |
b3 | 1 | 226 | 216 | . | . |
b4 | 2 | 321 | 101 | 1 | 3 |
b5 | 2 | 216 | 106 | 2 | 4 |
b6 | 2 | 174 | 105 | 3 | 5 |
I'm trying to generate a table like this that links the ID variables. The reason I'm doing this is in order, eventually, to transpose the table into time series data, according to the linked records. So my Data WANT looks like this:
DATA WANT:
ID | LINK |
---|---|
a1 | . |
a2 | . |
a3 | a2 |
a4 | a1 |
a5 | a3 |
a6 | a4 |
b1 | . |
b2 | . |
b3 | . |
b4 | b2 |
b5 | b3 |
b6 | b1 |
LINK indicates which ID has the matching Var2.
I've tried the following approach...
% Macro TEST;
Data WANT (Keep= ID LINK);
Set HAVE;
Threshold = 1;
/*I have a threshold for Linking because I need to account for rounding error between periods - e.g. b6 --> b1*/
%Do i = Start_Lag %to End_Lag;
Difference = VAR1 - Lag&i(VAR2);
If Difference < Threshold then do;
Link = Lag&i(D);
Threshold = Difference;
end;
%end;
run;
%mend;
%TEST
I originally started with a Data Step approach but switched to a Macro because otherwise I can't seem to get Lag function to adjust dynamically to the different values specified in my Start_Lag and End_Lag columns. Inside of the macro the %Do loop needs macro variables, though, and I haven't been able to work this step out. I have tried using symput like this:
call symput('Begin', Start_Lag);
call symput('Stop', End_Lag);
%Do i = &Begin %to &Stop;
...
but it doesn't seem to be running properly since LINK ends up all over the place...
Any ideas?
rules need to be more specific. Both a4 and a5 (var1) can to linked to a1 and a3(var2).
ID vars with identical period values cannot be matched. So a4 and a3 cannot be matched.
ID vars can only be matched to ID vars in the immedietely preceding period. So a5 and a1 cannot be matched.
This is also shown in the variables meant to govern the lags. For example, a5(Var1) would never be compared to a1(Var2) because the maximum permitted lag (End_Lag) for a5 is 2 - which would take you back to a3. Similarly, a4(Var1) would never be compared to a3(Var2) because the minimum permitted lag (Start_Lag) for a4 is 2 - so the search would begin in a2. This is what the %Do loop is supposed to be governing.
Why do so many people think macro is the cure for everything 😉
Try using either an SQL self-join or a hash object join.
Untested:
proc sql;
create table want as
select
a.id
,b.id as link
,a.var1
,b.var2
from
have a
left joinb
have b
on
a.var1=b.var2
;
quit;
Try that, examine the cartesian product. If you have too many records in it, see what where clause you can add to get the desired results. Drop the two var* lines once you get your desired results - they're just in there for debugging.
I haven't coded a hash object join, would give you a bit more control over the join (like what you're doing with start_lag and end_lag???). But there are plenty of examples in this forum; just search on "hash".
Hope this helps,
Scott
It will be easy. If you only want to check it at last period for the same group.
data x; input ID $ Period Var1 Var2 ; if char(id,1) ne char(lag(id),1) then group+1; cards; a1 1 500 300 a2 1 450 350 a3 2 350 300 a4 2 300 200 a5 3 300 50 a6 3 200 100 b1 1 525 174 b2 1 451 321 b3 1 226 216 b4 2 321 101 b5 2 216 106 b6 2 174 105 ; run; data want(keep=id link); if _n_ eq 1 then do; if 0 then set x; declare hash ha(dataset:'x(rename=(id=link))'); ha.definekey('group','period','var2'); ha.definedata('link'); ha.definedone(); end; set x; link=id;period=period-1;var2=var1; if ha.find() ne 0 then call missing(link); run;
KSharp
To apply all of your restrictions, such as start_lag, end_lag, based on Ksharp's code:
data have;
input ID$ Period Var1 Var2 Start_Lag End_Lag ;
k=_n_;
group=first(id);
cards;
a1 1 500 300 . .
a2 1 450 350 . .
a3 2 350 300 1 2
a4 2 300 200 2 3
a5 3 300 50 1 2
a6 3 200 100 2 3
b1 1 525 175 . .
b2 1 451 321 . .
b3 1 226 216 . .
b4 2 321 101 1 3
b5 2 216 106 2 4
b6 2 174 105 3 5
;
data want (keep=id link);
if _n_ eq 1 then do;
if 0 then set have(rename=(id=link k=k1));
declare hash ha(dataset:'have(rename=(id=link k=k1))', multidata:'y');
ha.definekey('group','period','var2');
ha.definedata('link','k1');
ha.definedone();
end;
set have;
link=id;period=period-1;var2=var1;
if ha.find() ne 0 then call missing(link);
else if not (start_lag <= k-k1 <= end_lag) then call missing(link);
run;
proc print;run;
Thanks for the help! I don't really know anything about hash yet (I'm quite new to SAS) so I guess I've got some studying up to do. Will try this out and post back if I have any more questions!
Tom
Hi Ksharp and Haikuo,
I am learning hash. Is "link=id" in "link=id;period=period-1;var2=var1;" necessary?
Thanks - Linlin
Great question, LinLin.
'link=id' is not really necessary for my code, as I have already initiated here : " if 0 then set have(rename=(id=link k=k1));" , while ksharp has to do it if it is character variable.
As for "period=period-1;var2=var1", yes, they are necessary in both codes. That is to construct the 'key' for search in the hash.
Regards,
Haikuo
Thank you Haikuo! - Linlin
Hi Ksharp and Haikuo,
Your hash solutions are really nice. I used datastep and merge to help me to understand how hash works.
Thank you! - Linlin
data have;
input ID$ Period Var1 Var2 Start_Lag End_Lag ;
k=_n_;
group=first(id);
cards;
a1 1 500 300 . .
a2 1 450 350 . .
a3 2 350 300 1 2
a4 2 300 200 2 3
a5 3 300 50 1 2
a6 3 200 100 2 3
b1 1 525 175 . .
b2 1 451 321 . .
b3 1 226 216 . .
b4 2 321 101 1 3
b5 2 216 106 2 4
b6 2 175 105 3 5
;
run;
data temp1;
set have(rename=(id=link k=k1));
drop Start_Lag End_Lag var1;
proc sort;
by group period var2;
run;
data temp2;
set have;
period=period-1;var2=var1;
proc sort;
by group period var2;
run;
data want(keep=id link);
merge temp2(in=a)temp1(in=b);
by group period var2;
if a;
if not b then link='';
if not (start_lag <= k-k1 <= end_lag) then call missing(link);
proc sort;
by id;
run;
proc print;run;
obs ID link
1 a1
2 a2
3 a3 a2
4 a4 a1
5 a5 a3
6 a6 a4
7 b1
8 b2
9 b3
10 b4 b2
11 b5 b3
12 b6 b1
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.