Re: Merge criteria

podarum · Posted 01-12-2012 10:13 AM

Hi, I'm trying to merge 2 data sets but need a criteria that if one of the merges don't match (or it doesn't have a match) then merge another field.. this is what I mean :

Have:

SET1

AREA Prov

Tor Ont

Osh Ont

Mon Que

Van BC

Win Man

Ont Ont

Que Que

Man Man

BC BC

SET2

AREA VAL

Ont 2.0

Tor 3.5

Mon 4.4

Van 2.1

BC 3.2

Man 2.8

Que 1.9

Ont 2.5

Want: (merge SET1 and SET2)

SET3

AREA Prov VAL

Tor Ont 3.5

Osh Ont 2.5 <- here becasue there's no match for Osh in SET2 it uses the value for the PROV

Mon Que 4.4

Van BC 2.1

Win Man 2.8 <- here becasue there's no match for Win in SET2 it uses the value for the PROV

Ont Ont 2.5

Que Que 1.9

Man Man 2.8

BC BC 3.2

Thanks

art297 · Posted 01-12-2012 10:17 AM

Ontario has two records in set 2. Was that an error or is that the data you actually have? If it is the latter, what do you want to do in such situations?

art297 · Posted 01-12-2012 11:03 AM

You haven't responded to my questions, thus I'm answering based on the situation where the duplicate entry was an error. I think that the following accomplishes that which you want to do:

data SET1;

input (AREA Prov) ($);

cards;

Tor Ont

Osh Ont

Mon Que

Van BC

Win Man

Ont Ont

Que Que

Man Man

BC BC

;

data SET2;

input AREA $ VAL;

cards;

Tor 3.5

Mon 4.4

Van 2.1

BC 3.2

Man 2.8

Que 1.9

Ont 2.5

;

proc sort data=SET1;

by area;

run;

proc sort data=SET2;

by area;

run;

data found

notfound (drop=val

rename=(area=prov prov=area));

merge set1 (in=ina) set2 (in=inb);

by area;

if ina and inb then output found;

else if ina then output notfound;

run;

proc sort data=notfound;

by area;

run;

data need (rename=(area=prov prov=area));

merge notfound (in=ina) set2 (in=inb);

by area;

if ina and inb;

run;

data want;

set found need;

run;

podarum · Posted 01-12-2012 01:53 PM

Hi art.. is rename=(area=prov prov=area)); neccessary ?, since I have another field that I must match by (Date) as well. Thanks

art297 · Posted 01-12-2012 01:57 PM

If you use Tom's proc sql approach you avoid that need. There may be a way around it with a datastep approach, but I couldn't think of it. Using a datastep approach one has to do something to merge on the switched variables. My approach was simply to rename the variables.

As far as adding another variable, like date, you would have to again provide example data.

podarum · Posted 01-12-2012 11:57 AM

Sorry for the delay. One was an error....

art297 · Posted 01-12-2012 12:18 PM

Then either Tom's or my code should do what you want.

Tom · Posted 01-12-2012 11:03 AM

Perhaps we should let Art answer this one as those values look like Canadian provinces.

SQL is useful for doing complex joins.

create table SET3 as 
  select distinct a.area,a.prov,coalesce(b.val,c.val) as val
  from set1 a
   left join set2 b
   on a.area = b.area
   left join set2 c
   on a.prov = c.area
;

Message was edited by: Tom Abernathy to fix typo in last ON condition

art297 · Posted 01-12-2012 11:11 AM

Tom,

They are definitely Canadian cities and provinces. Your proc sql code would need something more, though (probably a case statement), as it doesn't get the values for the two records where the city is missing from the original match.

I think my convoluted datastep approach provides the right result, but I'd be interested in seeing the proc sql equivalent that achieves the same result.

Art

Tom · Posted 01-12-2012 11:22 AM

I helps if I join on the right condition.

2much2learn · Posted 01-12-2012 04:21 PM

I see you got some nice suggestions. It seems to me hash() could be one of possible approaches, since it does not require sorting and is capable of searching multiple variables at the same time.

data SET1;

infile cards;

input AREA $ Prov $;

cards;

Tor Ont

Osh Ont

Mon Que

Van BC

Win Man

Ont Ont

Que Que

Man Man

BC BC

;

data SET2;

infile cards;

input AREA $ val;

cards;

Tor 3.5

Mon 4.4

Van 2.1

BC 3.2

Man 2.8

Que 1.9

Ont 2.5

;

data merged;

if _n_=1 then do;

if 0 then set set2;

declare hash merg (dataset: "set2");

merg.definekey("area");

merg.definedata("val");

merg.definedone();

end;

set set1;

if merg.find(key:area)=0 then output;

else if merg.find(key:prov)=0 then output;

else delete;

run;

proc print;run;

Kindly Regards,

Haikuo

podarum · Posted 01-12-2012 04:37 PM

Thanks 2m2l, what if I also have different dates that I need to match for each of the obs in Set1 and Set2..

eg.

SET1

AREA Prov Date

Tor Ont 2009.01

Tor Ont 2009.02

Tor Ont 2009.03

Osh Ont 2009.01

Osh Ont 2009.02

Osh Ont 2009.03

Mon Que 2009.01

Mon Que 2009.02

Mon Que 20009.03

SET2

AREA VAL Date

Ont 2.0 2009.01

Ont 2.1 2009.02

Ont 2.5 2009.03

Tor 3.5 2009.01

Tor 3.6 2009.02

Tor 3.5 2009.03

Mon 4.4 2009.01

Mon 4.2 2009.02

Mon 4.3 2009.03

Que 1.9 2009.01

Que 0.6 2009.02

Que 0.9 2009.03

Ont 2.5 2009.01

Ont 2.8 2009.02

Ont 3.4 2009.03

Thanks

art297 · Posted 01-12-2012 05:39 PM

Using Tom's proc sql method, which I happen to prefer of the three methods that have been offered, I think that the following accomplishes what you need:

proc sql;

create table SET3 as

select distinct a.area,a.prov,a.date,

coalesce(b.val,c.val) as val

from set1 a

left join set2 b

on a.area = b.area and a.date = b.date

left join set2 c

on a.prov = c.area and a.date = c.date

;

quit;

Ksharp · Posted 01-13-2012 02:57 AM

Here is Hash method. if you like it.

data SET1 ;
input AREA  $  Prov  $ Date $ ;
datalines;
Tor         Ont      2009.01
Tor         Ont      2009.02
Tor         Ont      2009.03
Osh       Ont       2009.01
Osh       Ont       2009.02
Osh       Ont       2009.03
Mon       Que      2009.01
Mon       Que      2009.02
Mon       Que      2009.03
;
run;

data SET2;
input AREA $   VAL     Date $ ;
datalines;
Ont         2.0      2009.01
Ont         2.1      2009.02
Ont         2.5      2009.03
Tor          3.5     2009.01
Tor          3.6      2009.02
Tor          3.5      2009.03
Mon        4.4      2009.01
Mon        4.2      2009.02
Mon        4.3      2009.03
Que        1.9      2009.01
Que        0.6      2009.02
Que        0.9      2009.03
Ont         2.5      2009.01
Ont         2.8      2009.02
Ont         3.4      2009.03
;
run;
data want(drop=rc rx); 
 if _n_ eq 1 then do;
  if 0 then set set2;
  declare hash ha(dataset:'set2',hashexp:10);
   ha.definekey('area','date');
   ha.definedata('val');
   ha.definedone();
 end;
set set1;
call missing(val);               
rc=ha.find();
if rc ne 0 then rx=ha.find(key:prov,key:date);
run;

Ksharp

f_rederik · Posted 01-16-2012 03:58 PM

I can see that you have got a few replies already.

I've been quite happy using PROC SQL with a full outer join and the COALESCE() function on the common fields.

Suspect performance is less than ideal, but others will probably comment on that...

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away