Hello all,
I have an interesting (to me anyway) problem that I've never come across before.
I have two datasets that are in the 10s-100s of millions of rows that I need to do some match merging on. However, the KEY from dataset 1 is contained within a longer string in a variable on dataset 2. I need to look for that key in dataset 2 by a substr/index/find type function to indicate a match. After a match has been found in that manner.. I need to check a second match criteria.
Some test data can be created with these data steps. In the data BASE.KEY must be found inside of LOOK.KEY2.. If match is found, check that BASE.AMT = LOOK.AMT2 and return LOOK.REF on the output dataset.
data base;
informat key :$10. amt :8.;
input key $ amt ;
datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;
data look;
informat key2 :$20. amt2 :8. ref :$10. ;
input key2 $ amt2 ref $;
datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;
I don't think PROC SQL is feasible as I can only see a non optimized cartesian join being required using (i.e.)
proc sql;
create table together as
select t1.* , t2.*
from base t1 , look t2
where index(t1.key,t2.key2)>0
quit;
The other option I've thought about would be a data step multiple set/point to loop through all observations in the LOOK dataset until finding a match to the key on the BASE dataset .. then checking the second condition. But this seems very computationally expensive as well.
Any thoughts or observations you could share from experience?
Thanks
With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.
But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.
proc sql;
create table together as
select t1.* , t2.*
from base t1 left join look t2
on t1.amt=t2.amt2
where index(t1.key,t2.key2)>0
;
quit;
With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.
But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.
proc sql;
create table together as
select t1.* , t2.*
from base t1 left join look t2
on t1.amt=t2.amt2
where index(t1.key,t2.key2)>0
;
quit;
The keys here were made up and aren't 'keys' by any means.. Was only using this to illustrate the task at hand as I can't post the actual data involved.
I hadn't thought about switching the order.. I can give that a shot and see how it performs.
Thanks.
Another approach might be to create two datasets that are only the distinct key values and use the logic to join them to create an intermediate look up table. Then you join with that in subquery.
something like:
proc sql;
create table lookup as
select a.key,b.key2
from (select distinct key from base) as a,
(select distinct key2 from look) as b
where index(a.key,b.key2)>0
;
create table want as
select t1.*, t2.*
from base as t1 left join
(
select *
from LookUp left join Look on
LookUp.Key2=Look.Key2
) as t2 on
Base.key =t2.key
;
quit;
Performance may still be an issue depending on how many unique values for the keys are.
I do say that I would still be concerned about the potential of Key having a value like 123456 on one record and 1234567 on another. Both would then match a Key2 that the second Key value matched.
You might want to try using a view that generates all the possible key substrings. It could be efficient, depending on how SAS optimizes the joins.
data base;
length key $10 amt 8;
input key amt ;
datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;
data look;
length key2 $20 amt2 8 ref $10;
input key2 amt2 ref;
datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;
data looksub / view=looksub;
set look;
do i = 1 to 11;
key = substr(key2, i, 10);
output;
end;
drop i key2;
run;
proc sql;
create table want as
select a.key, b.ref
from base as a inner join looksub as b on a.key=b.key;
quit;
Like this:
data base;
informat key $10. ;
input key $ amt ;
datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;
data look;
informat key2 $20. amt2 ref $10. ;
input key2 $ amt2 ref $;
datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;
proc sql;
select b.*,a.*
from base as a,look as b
where b.key2 contains strip(a.key);
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.