09-27-2017 10:54 AM - edited 09-27-2017 11:14 AM
I have a question about merging two datasets of different lengths. Datasets look like this:
data dat1; input x; datalines; 1 1 1 1 2 2 2 2 3 3 3 3 ; run; data dat2; input y; datalines; a b c d ; run;
And I want this:
data want; input x y; datalines; 1 a 1 b 1 c 1 d 2 a 2 b 2 c 2 d 3 a 3 b 3 c 3 d ; run;
Since I don't have a unique identifier, I'm not sure how to merge this and allow the variable y to repeat down.
I have tried:
data attempt; merge dat1 dat2; run;
but, this only repeats dat2 once, throuhg the "1"s of dat1. How do I allow dat2 to repeat all the way down the values of dat1?
09-27-2017 12:12 PM
The proc sql is much more compact programming, but might take longer the a data step for large files. After all, it first does a cartesian crossing only to delete the majority of cross-records (3/4 in your case). Of course, that could be solved by having only unique X values in DAT1.
But here, as requested is a data step solution, assuming that DAT1 is sorted by X
data want (drop=_:); set dat1 ; by x; if _n_=1 then do; if 0 then set dat2 (obs=0) nobs=ndat2; declare hash d2 (dataset:'need2', ordered:'Y'); d2.definekey('_key'); d2.definedata(all:'Y'); d2.definedone(); end; _key+1; if first.x then _key=1; _rc=d2.find(); run;
This will work as expected if DAT1 has exactly NDAT2 observation per X value. (NDAT2 is numeber of obs in DAT2). It won't cause any "erroneous" values if a given X value has fewer than NDAT2 records. But if an X group has more than NDAT2, then note that the last value of Y in NDAT2 will prevail in all the "extra" dat1 records - i.e. you'll get a duplicate X/Y combination. If you want to protect against that, then precede the "_rc=d2.find()" statement with a "call missing(y)", protecting against duplicate non-missing Y's.
09-27-2017 12:24 PM
So what is your key for making the merge? Why not add one?
data dat1_fixed; set dat1 ; by x ; rep+1; if first.x then rep=1; run; data dat2_fixed; set dat2 ; rep+1; run; proc sql ; create table want as select a.*,b.y from dat1_fixed a inner join dat2_fixed b on a.rep = b.rep order by 1,2 ; quit;
Obs x rep y 1 1 1 a 2 1 2 b 3 1 3 c 4 1 4 d 5 2 1 a 6 2 2 b 7 2 3 c 8 2 4 d 9 3 1 a 10 3 2 b 11 3 3 c 12 3 4 d
09-27-2017 03:07 PM
@pamplemouse22 Hiter the hash can help here I think. Assuming your dat1 is sorted as shown in your example-
input y $;
if _N_ = 1 then do;
if 0 then set dat2;
declare hash h(dataset:"dat2", ordered: 'yes');
declare hiter iter('h');
if first.x then iter.first();
09-28-2017 06:05 AM
data want; do _N_=1 by 1 until(last.x); set dat1; by x; if _N_<=n2 then set dat2 point=_N_ nobs=n2; else y=.; output; end; run;
This solution puts Y values on as many X-s as there are, and if it runs out of Y values, it sets Y missing.