BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JoshB
Quartz | Level 8

Hello all,

 

I have an interesting (to me anyway) problem that I've never come across before.

 

I have two datasets that are in the 10s-100s of millions of rows that I need to do some match merging on. However, the KEY from dataset 1 is contained within a longer string in a variable on dataset 2. I need to look for that key in dataset 2 by a substr/index/find type function to indicate a match. After a match has been found in that manner.. I need to check a second match criteria.

 

Some test data can be created with these data steps. In the data BASE.KEY must be found inside of LOOK.KEY2.. If match is found, check that BASE.AMT = LOOK.AMT2 and return LOOK.REF on the output dataset.

 

data base;
  informat key :$10. amt :8.;
  input key $ amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;

data look;
  informat key2 :$20. amt2 :8. ref :$10. ;
  input key2 $ amt2 ref $;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;

 

I don't think PROC SQL is feasible as I can only see a non optimized cartesian join being required using (i.e.)

 

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 , look t2
   where index(t1.key,t2.key2)>0
quit;

 

The other option I've thought about would be a data step multiple set/point to loop through all observations in the LOOK dataset until finding a match to the key on the BASE dataset .. then checking the second condition. But this seems very computationally expensive as well.

 

Any thoughts or observations you could share from experience?

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.

 

But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.

 

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 left join look t2
   on t1.amt=t2.amt2
   where index(t1.key,t2.key2)>0
   ;
quit;

View solution in original post

5 REPLIES 5
ballardw
Super User

With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.

 

But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.

 

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 left join look t2
   on t1.amt=t2.amt2
   where index(t1.key,t2.key2)>0
   ;
quit;
JoshB
Quartz | Level 8

The keys here were made up and aren't 'keys' by any means.. Was only using this to illustrate the task at hand as I can't post the actual data involved.

 

I hadn't thought about switching the order.. I can give that a shot and see how it performs.

 

Thanks.

ballardw
Super User

Another approach might be to create two datasets that are only the distinct key values and use the logic to join them to create an intermediate look up table. Then you join with that in subquery.

something like:

 

proc sql;
   create table lookup as
   select a.key,b.key2
   from (select distinct key  from base) as a,
        (select distinct key2 from look) as b
   where index(a.key,b.key2)>0
   ;
   create table want as 
   select t1.*, t2.* 
   from base as t1 left join
     (
      select * 
      from LookUp left join Look on
      LookUp.Key2=Look.Key2
     ) as t2 on
     Base.key =t2.key
   ;
quit;

Performance may still be an issue depending on how many unique values for the keys are.

 

 

I do say that I would still be concerned about the potential of Key having a value like 123456 on one record and 1234567 on another. Both would then match a Key2 that the second Key value matched.

PGStats
Opal | Level 21

You might want to try using a view that generates all the possible key substrings. It could be efficient, depending on how SAS optimizes the joins.

 

data base;
  length key $10 amt 8;
  input key amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;

data look;
  length key2 $20 amt2 8 ref $10;
  input key2 amt2 ref;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;

data looksub / view=looksub;
set look;
do i = 1 to 11;
    key = substr(key2, i, 10);
    output;
    end;
drop i key2;
run;

proc sql;
create table want as
select a.key, b.ref
from base as a inner join looksub as b on a.key=b.key;
quit; 
PG
Ksharp
Super User

Like this:

 

data base;
  informat key $10. ;
  input key $ amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;

data look;
  informat key2 $20. amt2 ref $10. ;
  input key2 $ amt2 ref $;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;
proc sql;
select b.*,a.*
 from base as a,look as b
  where b.key2 contains strip(a.key);
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 3889 views
  • 0 likes
  • 4 in conversation