Hello all,
I have a question about merging two datasets of different lengths. Datasets look like this:
data dat1; input x; datalines; 1 1 1 1 2 2 2 2 3 3 3 3 ; run; data dat2; input y; datalines; a b c d ; run;
And I want this:
data want; input x y; datalines; 1 a 1 b 1 c 1 d 2 a 2 b 2 c 2 d 3 a 3 b 3 c 3 d ; run;
Since I don't have a unique identifier, I'm not sure how to merge this and allow the variable y to repeat down.
I have tried:
data attempt; merge dat1 dat2; run;
but, this only repeats dat2 once, throuhg the "1"s of dat1. How do I allow dat2 to repeat all the way down the values of dat1?
Thank you!
One SQL way. this does cartesian join and duplicates are removed.
proc sql;
create table attempt as
select distinct x,y from dat1,dat2;
quit;
The proc sql is much more compact programming, but might take longer the a data step for large files. After all, it first does a cartesian crossing only to delete the majority of cross-records (3/4 in your case). Of course, that could be solved by having only unique X values in DAT1.
But here, as requested is a data step solution, assuming that DAT1 is sorted by X
data want (drop=_:);
set dat1 ;
by x;
if _n_=1 then do;
if 0 then set dat2 (obs=0) nobs=ndat2;
declare hash d2 (dataset:'need2', ordered:'Y');
d2.definekey('_key');
d2.definedata(all:'Y');
d2.definedone();
end;
_key+1;
if first.x then _key=1;
_rc=d2.find();
run;
This will work as expected if DAT1 has exactly NDAT2 observation per X value. (NDAT2 is numeber of obs in DAT2). It won't cause any "erroneous" values if a given X value has fewer than NDAT2 records. But if an X group has more than NDAT2, then note that the last value of Y in NDAT2 will prevail in all the "extra" dat1 records - i.e. you'll get a duplicate X/Y combination. If you want to protect against that, then precede the "_rc=d2.find()" statement with a "call missing(y)", protecting against duplicate non-missing Y's.
So what is your key for making the merge? Why not add one?
data dat1_fixed;
set dat1 ;
by x ;
rep+1;
if first.x then rep=1;
run;
data dat2_fixed;
set dat2 ;
rep+1;
run;
proc sql ;
create table want as
select a.*,b.y
from dat1_fixed a
inner join dat2_fixed b
on a.rep = b.rep
order by 1,2
;
quit;
Obs x rep y 1 1 1 a 2 1 2 b 3 1 3 c 4 1 4 d 5 2 1 a 6 2 2 b 7 2 3 c 8 2 4 d 9 3 1 a 10 3 2 b 11 3 3 c 12 3 4 d
@pamplemouse22 Hiter the hash can help here I think. Assuming your dat1 is sorted as shown in your example-
data dat1;
input x;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
;
run;
data dat2;
input y $;
datalines;
a
b
c
d
;
run;
data want;
if _N_ = 1 then do;
if 0 then set dat2;
declare hash h(dataset:"dat2", ordered: 'yes');
declare hiter iter('h');
h.defineKey('y');
h.defineData('y');
h.defineDone();
end;
set dat1;
by x;
if first.x then iter.first();
else iter.next();
run;
Regards,
Naveen Srinivasan
data want; do _N_=1 by 1 until(last.x); set dat1; by x; if _N_<=n2 then set dat2 point=_N_ nobs=n2; else y=.; output; end; run;
This solution puts Y values on as many X-s as there are, and if it runs out of Y values, it sets Y missing.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.