Solved: One to many merge error

lydiawawa · Posted 05-21-2019 06:39 PM

I have a very simple merge, but no matter what I do I cannot get to the desired outcome.

Dataset A:

session grp device 
1     2  mac
2     4  desk

Dataset B

session grp
1     2  
1     2 
2     4  
2     4  
2     4

Try to get:

session grp device 
1     2  mac
1     2  mac
2     4  desk
2     4  desk
2     4  desk

I have tried left merge on B with A by proc sql and datastep, and the normal merge.

 proc sql;
  create table want as
  select L.*, R.*
  from  datasetB as L
  left join  datasetA as R
   on L.session = R.session
   and L.grp = R.grp;
quit;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
run;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
 if a;
run;

I always get the following result (device does not repeat):

session grp device 
1     2  mac
1     2  
2     4  desk
2     4  
2     4

There are other variables in the dataset B, I'm just using the 3 major ones used in merge as an example. Eventually the final dataset will have more than 3 variables.

Thank you!

Tom · Posted 05-21-2019 07:04 PM

I always get the following result (device does not repeat):

There are two ways for that to happen.

Most likely the dataset that is not supposed have the variable DEVICE already does have the variable DEVICE. So when the second record for the group is read it overwrites the value of DEVICE pulled from the lookup dataset. To fix that drop the variable. Either in an earlier step or using DROP= dataset option in the merge step.

data want;
   merge datasetB (in=a drop=device) datasetA (in=b);
   by session grp;
run;

The second possible cause is that your "one" dataset has multiple observations per group. When you do a many to many merge the variables contributed by the last observation in the dataset with fewer observations for the group keep the same values onto the rest of result set for that group. So if for some really strange reason you had a second observation per SESSION*GRP with missing value of DEVICE then that missing value would be carried forward to all of the rest of the output observations for that group.

A third possibility is that your data step is not like what you showed and that the variable you are saying has missing values is NOT the one pulled from the "one" dataset but a variable that you are creating in the merge step. Variables that are pulled from datasets are retained (actually not reset to missing at the start of each iteration of the data step). But variables that are created in the current step are reset unless you use a RETAIN statement. So you might just need to add a RETAIN statement.

data want;
   merge datasetB (in=a ) datasetA (in=b);
   by session grp;
   length new_device $10;
   if first.grp then new_device=' ';
   if B then new_device=device;
   retain new_device;
run;

View solution in original post

Patrick · Posted 05-21-2019 06:48 PM

Below should return what you're after.

data A;
  input session grp device $;
  datalines;
1 2 mac
2 4 desk
;

data b;
  input session grp;
  datalines;
1 2 
1 2 
2 4 
2 4 
2 4
;

proc sql;
  select a.*
    from a left join b
      on a.session=b.session and a.grp=b.grp
  ;
quit;

You probably need to investigate how you treat same named variables from both sources. A SAS SQL will throw a warning in such a situation if you just use * in the select clause.

A data step merge statement will define the variable in the PDV using the source definition from the first table in the merge statement BUT it will use the value from the last table in the merge statement (the last value just overwrites whatever else came from earlier tables).

lydiawawa · Posted 05-21-2019 07:10 PM

for some reason this worked but it also removed all rows with missing session and grp values.

Reeza · Posted 05-21-2019 06:55 PM

Merge works fine for me. So either your data are characters that don't merge for some reason, or if you check your log you get a partial data set because you didn't sort or some other error in the log. As shown, the code is correct. I double checked and IN doesn't change the results.

@lydiawawa wrote:

I have a very simple merge, but no matter what I do I cannot get to the desired outcome.

Dataset A:
session grp device 
1     2  mac
2     4  desk
Dataset B
session grp
1     2  
1     2 
2     4  
2     4  
2     4 
Try to get:
session grp device 
1     2  mac
1     2  mac
2     4  desk
2     4  desk
2     4  desk
I have tried left merge on B with A by proc sql and datastep, and the normal merge.
 proc sql;
  create table want as
  select L.*, R.*
  from  datasetB as L
  left join  datasetA as R
   on L.session = R.session
   and L.grp = R.grp;
quit;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
run;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
 if a;
run;
I always get the following result (device does not repeat):
session grp device 
1     2  mac
1     2  
2     4  desk
2     4  
2     4  
There are other variables in the dataset B, I'm just using the 3 major ones used in merge as an example. Eventually the final dataset will have more than 3 variables.

Thank you!

Tom · Posted 05-21-2019 07:04 PM

I always get the following result (device does not repeat):

There are two ways for that to happen.

Most likely the dataset that is not supposed have the variable DEVICE already does have the variable DEVICE. So when the second record for the group is read it overwrites the value of DEVICE pulled from the lookup dataset. To fix that drop the variable. Either in an earlier step or using DROP= dataset option in the merge step.

data want;
   merge datasetB (in=a drop=device) datasetA (in=b);
   by session grp;
run;

The second possible cause is that your "one" dataset has multiple observations per group. When you do a many to many merge the variables contributed by the last observation in the dataset with fewer observations for the group keep the same values onto the rest of result set for that group. So if for some really strange reason you had a second observation per SESSION*GRP with missing value of DEVICE then that missing value would be carried forward to all of the rest of the output observations for that group.

A third possibility is that your data step is not like what you showed and that the variable you are saying has missing values is NOT the one pulled from the "one" dataset but a variable that you are creating in the merge step. Variables that are pulled from datasets are retained (actually not reset to missing at the start of each iteration of the data step). But variables that are created in the current step are reset unless you use a RETAIN statement. So you might just need to add a RETAIN statement.

data want;
   merge datasetB (in=a ) datasetA (in=b);
   by session grp;
   length new_device $10;
   if first.grp then new_device=' ';
   if B then new_device=device;
   retain new_device;
run;

One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Registration is open

Call for Content EXTENDED

One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Re: One to many merge error

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away