Solved: Re: SAS matching on substring

JoshB · Posted 10-06-2016 12:59 PM

Hello all,

I have an interesting (to me anyway) problem that I've never come across before.

I have two datasets that are in the 10s-100s of millions of rows that I need to do some match merging on. However, the KEY from dataset 1 is contained within a longer string in a variable on dataset 2. I need to look for that key in dataset 2 by a substr/index/find type function to indicate a match. After a match has been found in that manner.. I need to check a second match criteria.

Some test data can be created with these data steps. In the data BASE.KEY must be found inside of LOOK.KEY2.. If match is found, check that BASE.AMT = LOOK.AMT2 and return LOOK.REF on the output dataset.

data base;
  informat key :$10. amt :8.;
  input key $ amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;

data look;
  informat key2 :$20. amt2 :8. ref :$10. ;
  input key2 $ amt2 ref $;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;

I don't think PROC SQL is feasible as I can only see a non optimized cartesian join being required using (i.e.)

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 , look t2
   where index(t1.key,t2.key2)>0
quit;

The other option I've thought about would be a data step multiple set/point to loop through all observations in the LOOK dataset until finding a match to the key on the BASE dataset .. then checking the second condition. But this seems very computationally expensive as well.

Any thoughts or observations you could share from experience?

Thanks

ballardw · Posted 10-06-2016 01:15 PM

With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.

But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 left join look t2
   on t1.amt=t2.amt2
   where index(t1.key,t2.key2)>0
   ;
quit;

View solution in original post

ballardw · Posted 10-06-2016 01:15 PM

With the numbers of records involved I would consider what my definition of "key" means. Perhaps one of the keys originally was created by combining elements and possibly could be reversed back.

But perhaps turning the problem around might work: match on the amount first such as a left join and then use the where to subset that result.

proc sql;
  create table together as
   select t1.* , t2.*
   from base t1 left join look t2
   on t1.amt=t2.amt2
   where index(t1.key,t2.key2)>0
   ;
quit;

JoshB · Posted 10-06-2016 02:52 PM

The keys here were made up and aren't 'keys' by any means.. Was only using this to illustrate the task at hand as I can't post the actual data involved.

I hadn't thought about switching the order.. I can give that a shot and see how it performs.

Thanks.

ballardw · Posted 10-06-2016 04:43 PM

Another approach might be to create two datasets that are only the distinct key values and use the logic to join them to create an intermediate look up table. Then you join with that in subquery.

something like:

proc sql;
   create table lookup as
   select a.key,b.key2
   from (select distinct key  from base) as a,
        (select distinct key2 from look) as b
   where index(a.key,b.key2)>0
   ;
   create table want as 
   select t1.*, t2.* 
   from base as t1 left join
     (
      select * 
      from LookUp left join Look on
      LookUp.Key2=Look.Key2
     ) as t2 on
     Base.key =t2.key
   ;
quit;

Performance may still be an issue depending on how many unique values for the keys are.

I do say that I would still be concerned about the potential of Key having a value like 123456 on one record and 1234567 on another. Both would then match a Key2 that the second Key value matched.

PGStats · Posted 10-06-2016 04:42 PM

You might want to try using a view that generates all the possible key substrings. It could be efficient, depending on how SAS optimizes the joins.

data base;
  length key $10 amt 8;
  input key amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;

data look;
  length key2 $20 amt2 8 ref $10;
  input key2 amt2 ref;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;

data looksub / view=looksub;
set look;
do i = 1 to 11;
    key = substr(key2, i, 10);
    output;
    end;
drop i key2;
run;

proc sql;
create table want as
select a.key, b.ref
from base as a inner join looksub as b on a.key=b.key;
quit;

PG

Ksharp · Posted 10-06-2016 11:33 PM

Like this:

data base;
  informat key $10. ;
  input key $ amt ;
  datalines;
z911023712 5132
z082182497 1123
9182739499 1157
car1291011 1777
;;;

data look;
  informat key2 $20. amt2 ref $10. ;
  input key2 $ amt2 ref $;
  datalines;
askjz911023712aslk 5132 metadata1
nnsslkjdz082182497 1500 2metadata
car12910111298108z 1777 datameta3
;;;
proc sql;
select b.*,a.*
 from base as a,look as b
  where b.key2 contains strip(a.key);
quit;

SAS matching on substring

Re: SAS matching on substring

Re: SAS matching on substring

Re: SAS matching on substring

Re: SAS matching on substring

Re: SAS matching on substring

Re: SAS matching on substring

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away