BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

I have a very simple merge, but no matter what I do I cannot get to the desired outcome.

 

Dataset A:

 

session grp device 
1     2  mac
2 4 desk

Dataset B

 

session grp
1     2  
1     2 
2 4
2 4
2 4

 

 

Try to get:

 

session grp device 
1     2  mac
1     2  mac
2 4 desk
2 4 desk
2 4 desk

 

I have tried left merge on B with A by proc sql and datastep, and the normal merge.

 proc sql;
  create table want as
  select L.*, R.*
  from  datasetB as L
  left join  datasetA as R
   on L.session = R.session
   and L.grp = R.grp;
quit;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
run;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
 if a;
run;

I always get the following result (device does not repeat):

 

session grp device 
1     2  mac
1     2  
2 4 desk
2 4
2 4

 

There are other variables in the dataset B, I'm just using the 3 major ones used in merge as an example. Eventually the final dataset will have more than 3 variables.

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

I always get the following result (device does not repeat):

There are two ways for that to happen.

Most likely the dataset that is not supposed have the variable DEVICE already does have the variable DEVICE. So when the second record for the group is read it overwrites the value of DEVICE pulled from the lookup dataset.  To fix that drop the variable. Either in an earlier step or using DROP= dataset option in the merge step.

data want;
   merge datasetB (in=a drop=device) datasetA (in=b);
   by session grp;
run;

The second possible cause is that your "one" dataset has multiple observations per group.  When you do a many to many merge the variables contributed by the last observation in the dataset with fewer observations for the group keep the same values onto the rest of result set for that group.  So if for some really strange reason you had a second observation per SESSION*GRP with missing value of DEVICE then that missing value would be carried forward to all of the rest of the output observations for that group.

 

A third possibility is that your data step is not like what you showed and that the variable you are saying has missing values is NOT the one pulled from the "one" dataset but a variable that you are creating in the merge step.  Variables that are pulled from datasets are retained (actually not reset to missing at the start of each iteration of the data step).  But variables that are created in the current step are reset unless you use a RETAIN statement.  So you might just need to add a RETAIN statement.

data want;
   merge datasetB (in=a ) datasetA (in=b);
   by session grp;
   length new_device $10;
   if first.grp then new_device=' ';
   if B then new_device=device;
   retain new_device;
run;

 

View solution in original post

4 REPLIES 4
Patrick
Opal | Level 21

Below should return what you're after.

data A;
  input session grp device $;
  datalines;
1 2 mac
2 4 desk
;

data b;
  input session grp;
  datalines;
1 2 
1 2 
2 4 
2 4 
2 4
;

proc sql;
  select a.*
    from a left join b
      on a.session=b.session and a.grp=b.grp
  ;
quit;

 

You probably need to investigate how you treat same named variables from both sources. A SAS SQL will throw a warning in such a situation if you just use * in the select clause.

A data step merge statement will define the variable in the PDV using the source definition from the first table in the merge statement BUT it will use the value from the last table in the merge statement (the last value just overwrites whatever else came from earlier tables).

 

lydiawawa
Lapis Lazuli | Level 10
for some reason this worked but it also removed all rows with missing session and grp values.
Reeza
Super User

Merge works fine for me. So either your data are characters that don't merge for some reason, or if you check your log you get a partial data set because you didn't sort or some other error in the log. As shown, the code is correct. I double checked and IN doesn't change the results. 

 

delete_sas_merge.PNG

 


@lydiawawa wrote:

I have a very simple merge, but no matter what I do I cannot get to the desired outcome.

 

Dataset A:

 

session grp device 
1     2  mac
2 4 desk

Dataset B

 

session grp
1     2  
1     2 
2 4
2 4
2 4

 

 

Try to get:

 

session grp device 
1     2  mac
1     2  mac
2 4 desk
2 4 desk
2 4 desk

 

I have tried left merge on B with A by proc sql and datastep, and the normal merge.

 proc sql;
  create table want as
  select L.*, R.*
  from  datasetB as L
  left join  datasetA as R
   on L.session = R.session
   and L.grp = R.grp;
quit;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
run;


data want;
 length device_type $10.;
 merge datasetB (in=a) datasetA (in=b);
 by session grp;
 if a;
run;

I always get the following result (device does not repeat):

 

session grp device 
1     2  mac
1     2  
2 4 desk
2 4
2 4

 

There are other variables in the dataset B, I'm just using the 3 major ones used in merge as an example. Eventually the final dataset will have more than 3 variables.

 

Thank you!


 

Tom
Super User Tom
Super User

I always get the following result (device does not repeat):

There are two ways for that to happen.

Most likely the dataset that is not supposed have the variable DEVICE already does have the variable DEVICE. So when the second record for the group is read it overwrites the value of DEVICE pulled from the lookup dataset.  To fix that drop the variable. Either in an earlier step or using DROP= dataset option in the merge step.

data want;
   merge datasetB (in=a drop=device) datasetA (in=b);
   by session grp;
run;

The second possible cause is that your "one" dataset has multiple observations per group.  When you do a many to many merge the variables contributed by the last observation in the dataset with fewer observations for the group keep the same values onto the rest of result set for that group.  So if for some really strange reason you had a second observation per SESSION*GRP with missing value of DEVICE then that missing value would be carried forward to all of the rest of the output observations for that group.

 

A third possibility is that your data step is not like what you showed and that the variable you are saying has missing values is NOT the one pulled from the "one" dataset but a variable that you are creating in the merge step.  Variables that are pulled from datasets are retained (actually not reset to missing at the start of each iteration of the data step).  But variables that are created in the current step are reset unless you use a RETAIN statement.  So you might just need to add a RETAIN statement.

data want;
   merge datasetB (in=a ) datasetA (in=b);
   by session grp;
   length new_device $10;
   if first.grp then new_device=' ';
   if B then new_device=device;
   retain new_device;
run;

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 3801 views
  • 4 likes
  • 4 in conversation