Re: Merge Data

pamplemouse22 · Posted 09-27-2017 10:54 AM

Hello all,

I have a question about merging two datasets of different lengths. Datasets look like this:

data dat1; 
input x; 
datalines; 
1
1
1
1
2
2
2
2
3
3
3
3
;
run; 

data dat2;
input y; 
datalines; 
a
b
c
d
;
run;

And I want this:

data want; 
input x y; 
datalines; 
1 a
1 b
1 c
1 d
2 a
2 b
2 c
2 d
3 a
3 b
3 c
3 d
;
run;

Since I don't have a unique identifier, I'm not sure how to merge this and allow the variable y to repeat down.

I have tried:

data attempt; 
merge dat1 dat2; 
run;

but, this only repeats dat2 once, throuhg the "1"s of dat1. How do I allow dat2 to repeat all the way down the values of dat1?

Thank you!

LinusH · Posted 09-27-2017 11:18 AM

I'm puzzled. What is the real life scenario here?
Merging without common key is usually hazardous.

Data never sleeps

PBsas · Posted 09-27-2017 11:21 AM

One SQL way. this does cartesian join and duplicates are removed.

proc sql;
create table attempt as
select distinct x,y from dat1,dat2;
quit;

mkeintz · Posted 09-27-2017 12:12 PM

The proc sql is much more compact programming, but might take longer the a data step for large files. After all, it first does a cartesian crossing only to delete the majority of cross-records (3/4 in your case). Of course, that could be solved by having only unique X values in DAT1.

But here, as requested is a data step solution, assuming that DAT1 is sorted by X

data want (drop=_:);
  set dat1 ;
  by x;

  if _n_=1 then do;
    if 0 then set dat2 (obs=0) nobs=ndat2;
    declare hash d2 (dataset:'need2', ordered:'Y');
      d2.definekey('_key');
      d2.definedata(all:'Y');
      d2.definedone();
  end;

  _key+1;
  if first.x then _key=1;
  _rc=d2.find();
run;

This will work as expected if DAT1 has exactly NDAT2 observation per X value. (NDAT2 is numeber of obs in DAT2). It won't cause any "erroneous" values if a given X value has fewer than NDAT2 records. But if an X group has more than NDAT2, then note that the last value of Y in NDAT2 will prevail in all the "extra" dat1 records - i.e. you'll get a duplicate X/Y combination. If you want to protect against that, then precede the "_rc=d2.find()" statement with a "call missing(y)", protecting against duplicate non-missing Y's.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 09-27-2017 12:24 PM

So what is your key for making the merge? Why not add one?

data dat1_fixed;
  set dat1 ;
  by x ;
  rep+1;
  if first.x then rep=1;
run;
data dat2_fixed;
  set dat2 ;
  rep+1;
run;
proc sql ;
  create table want as
    select a.*,b.y
    from dat1_fixed a
    inner join dat2_fixed b
   on a.rep = b.rep
   order by 1,2
 ;
quit;

Obs    x    rep    y

  1    1     1     a
  2    1     2     b
  3    1     3     c
  4    1     4     d
  5    2     1     a
  6    2     2     b
  7    2     3     c
  8    2     4     d
  9    3     1     a
 10    3     2     b
 11    3     3     c
 12    3     4     d

novinosrin · Posted 09-27-2017 03:07 PM

@pamplemouse22 Hiter the hash can help here I think. Assuming your dat1 is sorted as shown in your example-

data dat1;

input x;

datalines;

1

2

3

;

run;

data dat2;

input y $;

datalines;

a

b

c

d

;

run;

data want;

if _N_ = 1 then do;

if 0 then set dat2;

declare hash h(dataset:"dat2", ordered: 'yes');

declare hiter iter('h');

h.defineKey('y');

h.defineData('y');

h.defineDone();

end;

set dat1;

by x;

if first.x then iter.first();

else iter.next();

run;

Regards,

Naveen Srinivasan

s_lassen · Posted 09-28-2017 06:05 AM

data want;
  do _N_=1 by 1 until(last.x);
    set dat1;
    by x;
    if _N_<=n2 then
      set dat2 point=_N_ nobs=n2;
    else y=.;
    output;
    end;
run;

This solution puts Y values on as many X-s as there are, and if it runs out of Y values, it sets Y missing.

Classroom Training Available!