Hi
I have two datasets like below
DATA T1;
INPUT ssid gender $ grade;
DATALINES ;
100 M 08
101 M 08
200 F 08
300 F 08
400 M 08
;
RUN;
DATA T2;
INPUT old new tmp $;
DATALINES ;
100 101 abc
200 201 bcd
400 401 def
;
RUN;
I want to merge these two data sets to get the below output
ssid gender grade
101 M 08
201 F 08
300 F 08
401 M 08
appreciate your help on this code, without "WARNING: Variable already exists on file" in the log as my data is large
thanks
@cm3:
If I understand your specs correctly, you want to update SSID in T1 to NEW from T2 if SSID=OLD but don't want duplicate records in the output. If so, the hash object is your friend:
data t1 ;
input ssid gender $ grade ;
cards ;
100 m 08
101 m 08
200 f 08
300 f 08
400 m 08
;
run ;
data t2 ;
input old new tmp $ ;
cards ;
100 101 abc
200 201 bcd
400 401 def
;
run ;
data want (keep = ssid gender grade) ;
if _n_ = 1 then do ;
if 0 then set t2 ;
dcl hash h (dataset:"t2") ;
h.definekey ("old") ;
h.definedata ("new") ;
h.definedone () ;
dcl hash x () ;
x.definekey ("ssid", "gender", "grade") ;
x.definedone () ;
end ;
set t1 ;
if h.find (key:ssid) = 0 then ssid = new ;
if x.check() ne 0 ;
x.add() ;
run ;
The hash table H is used to effectuate the updates; the hash table X is used to exclude duplicates.
Kind regards
Paul D.
Two methods: (1) using sql (2) using merge datastep:
/*1*/
proc sql;
create table want as select *
from t1 where
ssid not in (select old from t2);
quit;
/*2 - assuming data are sorted */
data want;
merge t1(in=in1)
t2(in=in2 rename=(ols=ssid));
by ssid;
if in1 and not in2;
run;
Appreciate your quick response.
I can't use subquery option as I have another variable, old to new update which I did not mention in my example.
And with the merge logic , new 201 and 401 records are missing in output data set.
thank you
Its not clear what table T2 represent and how to deal with.
Where from do you get the 201, 401 records' attributes if they are not exist in T1 tabel ?
@cm3:
If I understand your specs correctly, you want to update SSID in T1 to NEW from T2 if SSID=OLD but don't want duplicate records in the output. If so, the hash object is your friend:
data t1 ;
input ssid gender $ grade ;
cards ;
100 m 08
101 m 08
200 f 08
300 f 08
400 m 08
;
run ;
data t2 ;
input old new tmp $ ;
cards ;
100 101 abc
200 201 bcd
400 401 def
;
run ;
data want (keep = ssid gender grade) ;
if _n_ = 1 then do ;
if 0 then set t2 ;
dcl hash h (dataset:"t2") ;
h.definekey ("old") ;
h.definedata ("new") ;
h.definedone () ;
dcl hash x () ;
x.definekey ("ssid", "gender", "grade") ;
x.definedone () ;
end ;
set t1 ;
if h.find (key:ssid) = 0 then ssid = new ;
if x.check() ne 0 ;
x.add() ;
run ;
The hash table H is used to effectuate the updates; the hash table X is used to exclude duplicates.
Kind regards
Paul D.
Thanks Paul D, It worked.
Sorry @Shmuel, my specifications were not clear
I figured out the below code which worked as well.
proc sql;
create table want as
select A.*, B.*
from T1 A left outer join T2 B
on A.SSID = B.old
or A.SSID = B.new;
quit;
data want (drop= old new tmp);
set want;
if SSID=new then delete;
if old ^= . then do;
SSID=new;
end;
run;
Regards,
cm3
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.