Hello all,
I have a question about merging two datasets of different lengths. Datasets look like this:
data dat1; input x; datalines; 1 1 1 1 2 2 2 2 3 3 3 3 ; run; data dat2; input y; datalines; a b c d ; run;
And I want this:
data want; input x y; datalines; 1 a 1 b 1 c 1 d 2 a 2 b 2 c 2 d 3 a 3 b 3 c 3 d ; run;
Since I don't have a unique identifier, I'm not sure how to merge this and allow the variable y to repeat down.
I have tried:
data attempt; merge dat1 dat2; run;
but, this only repeats dat2 once, throuhg the "1"s of dat1. How do I allow dat2 to repeat all the way down the values of dat1?
Thank you!
One SQL way. this does cartesian join and duplicates are removed.
proc sql;
create table attempt as
select distinct x,y from dat1,dat2;
quit;
The proc sql is much more compact programming, but might take longer the a data step for large files. After all, it first does a cartesian crossing only to delete the majority of cross-records (3/4 in your case). Of course, that could be solved by having only unique X values in DAT1.
But here, as requested is a data step solution, assuming that DAT1 is sorted by X
data want (drop=_:);
set dat1 ;
by x;
if _n_=1 then do;
if 0 then set dat2 (obs=0) nobs=ndat2;
declare hash d2 (dataset:'need2', ordered:'Y');
d2.definekey('_key');
d2.definedata(all:'Y');
d2.definedone();
end;
_key+1;
if first.x then _key=1;
_rc=d2.find();
run;
This will work as expected if DAT1 has exactly NDAT2 observation per X value. (NDAT2 is numeber of obs in DAT2). It won't cause any "erroneous" values if a given X value has fewer than NDAT2 records. But if an X group has more than NDAT2, then note that the last value of Y in NDAT2 will prevail in all the "extra" dat1 records - i.e. you'll get a duplicate X/Y combination. If you want to protect against that, then precede the "_rc=d2.find()" statement with a "call missing(y)", protecting against duplicate non-missing Y's.
So what is your key for making the merge? Why not add one?
data dat1_fixed;
set dat1 ;
by x ;
rep+1;
if first.x then rep=1;
run;
data dat2_fixed;
set dat2 ;
rep+1;
run;
proc sql ;
create table want as
select a.*,b.y
from dat1_fixed a
inner join dat2_fixed b
on a.rep = b.rep
order by 1,2
;
quit;
Obs x rep y 1 1 1 a 2 1 2 b 3 1 3 c 4 1 4 d 5 2 1 a 6 2 2 b 7 2 3 c 8 2 4 d 9 3 1 a 10 3 2 b 11 3 3 c 12 3 4 d
@pamplemouse22 Hiter the hash can help here I think. Assuming your dat1 is sorted as shown in your example-
data dat1;
input x;
datalines;
1
1
1
1
2
2
2
2
3
3
3
3
;
run;
data dat2;
input y $;
datalines;
a
b
c
d
;
run;
data want;
if _N_ = 1 then do;
if 0 then set dat2;
declare hash h(dataset:"dat2", ordered: 'yes');
declare hiter iter('h');
h.defineKey('y');
h.defineData('y');
h.defineDone();
end;
set dat1;
by x;
if first.x then iter.first();
else iter.next();
run;
Regards,
Naveen Srinivasan
data want; do _N_=1 by 1 until(last.x); set dat1; by x; if _N_<=n2 then set dat2 point=_N_ nobs=n2; else y=.; output; end; run;
This solution puts Y values on as many X-s as there are, and if it runs out of Y values, it sets Y missing.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.