Re: Merge by different fields on a priority

podarum · Posted 08-02-2013 03:50 PM

Hi, I'm hoping someone can help showing me how to merge 2 tables baed on different fields on a priority.. Here's what I mean. I havea tabel with a large dataset (Table1) and 2 fields Call them City1 and City2.. and another reference table that has City_A which could fall within either City1 or City2 or netiher in Table1..

Have (Table1):

City1 City2
Tor Tor

Mon Sch

Ott Bra

Tor Sca

Tor Ajx

Tor Sel

Van Win

ROC Tor

ROC Chr

ROC ROC

Reference Table:

City_A

Tor

Mon

Bra

Sel

Chr

Ott

Want a table where it takes the City2 priority if it's matched in Reference Table, OR if not, take City1 if it's matched, and if not leave ROC:

Looks like this:

City1 City2 City_A

Tor Tor Tor

Mon Sch Mon

Ott Bra Bra

Tor Sca Tor

Tor Ajx Tor

Tor Sel Sel

Van Win ROC

ROC Tor Tor

ROC Chr Chr

ROC ROC ROC

Thanks much....

Astounding · Posted 08-02-2013 06:28 PM

One approach would create a format based on the reference table:

data ref;

set reference_table;

fmtname='$city';

label='found';

start=City_A;

run;

proc format cntlin=ref;

run;

Then a DATA step could use the format to look up matches:

data want;

set have;

if put(city2, $city.)='found' then City_A = city2;

else if put(city1, $city.)='found' then City_A=city1;

else city_A='ROC';

run;

Good luck.

Fugue · Posted 08-02-2013 07:12 PM

Astounding's approach is simple and effective.

You could also do something like:

proc sql;
select distinct city_a into :refer separated by ' '
from reference;
quit ;

data want ;
set have ;
refer_list = "&refer";
if indexw ( refer_list , city2 ) > 0 then city_a=city2;
else if indexw ( refer_list , city1 ) > 0 then city_a=city1;
else city_a = "ROC";
run;

Vince28_Statcan · Posted 08-03-2013 12:04 AM

If your reference table doesn't have millions of records, in can all be done in a single sql query as follow

proc sql;

create table want as

select a.city1, a.city2, (case

when a.city2 in (select distinct city_a from reftable) then city2

when (a.city2 not in (select distinct city_a from reftable) and a.city1 in (select distinct city_a from reftable)) then city1

else "ROC"

end) as city_A

from have;

quit;

Above would be the code with the best readability but if the (select distinct city_a from reftable) returns hundred of thousands of records, then the "in" operator lags significantly behind other processing strategies.

Typically, when I want to do some funky merge between a large DS and a smaller one, as you've described above, I tend to use hash tables. Here's an alternative solution, again in a single step with significantly faster processing but using the less natural hash object syntax

data want(rename=(city_c=city_a));

if _N_=1 then do;

length city_a $40.; /* placeholder length, preferably use the same as the length of your city_a variable in your reference table */

declare hash h(datasets: 'reftable');

h.defineKey('city_a');

h.defineData(); /* only necessary if you ought to retain additional columns from reference table in the merge */

h.defineDone(); /* I have a blank this may or may not require the () for appropriate syntax */

end;

set have;

if h.find(key: city2)=0 then city_c=city2;

else if h.find(key:city1)=0 then city_c=city1;

else city_c="ROC";

run;

Basically, hash table is a lookup table. RC is the return code of the find method on the hash table. It is 0 if the key was found, some code NE 0 otherwise. Hash table is all in memory which means great gains of efficiency for medium-large datasets. However, it is capped the RAM allocation of your computer to SAS. With a single key variable and no data you should have no problem get a hash table of 1M records though.

Vince

Patrick · Posted 08-03-2013 09:37 PM

Just as a variation to Vince's code:

data want;
if _N_=1 then
    do;
      if 0 then set reftable(keep=city_a);
      declare hash h(dataset: 'reftable');
      h.defineKey('city_a');
      h.defineData('city_a');
      h.defineDone();
    end;

set have;

if h.find(key: city2) ne 0 then if h.find(key:city1) ne 0 then city_a="ROC";
run;

podarum · Posted 08-06-2013 10:20 AM

Thank you all.. Great help..

Merge by different fields on a priority