Solved: Re: Merge statement with a (where= ) option

dmosack · Posted 11-15-2023 01:16 PM

Hello!

I am working on someone else's code where two datasets are merged using an ID variable. They are a more advanced SAS programmer than I, and I'm trying to learn their methods, intent, etc. At a certain point they merge two datasets, and use a "where" option that I would expect to eliminate observations where certain values are missing, but I was very surprised to see that it does not.

The example below demonstrates this surprising behavior. What does this (where= ) option do, if anything? Why does it not eliminate observations with missing y values? If these questions are addressed in the documentation, apologies! I was unable to find them.

Thank you for your help!

data a;

input x z;

datalines;

101 23

102 21

103 10

;

run;

data b;

input x y zz;

datalines;

101 2 2

102 1 1

103 3 1

104 1 2

;

run;

data c;

merge a (in=ina) b (keep=x y where=(y^=.));

by x;

if ina;

run;

edit 1: I kept calling it a (keep= ) option by mistake.

edit 2: I had to fix a copy/paste error in my b dataset. Sorry, this is my first post.

Tom · Posted 11-15-2023 01:42 PM

It mainly has no impact because your data step for B is wrong:

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

If we fix it so there actually is an observation in B with a missing value of Y for the WHERE= options to exclude.

data b;
  input x y zz;
datalines;
101 . 2
102 1 1
103 3 1
104 1 2
;

Then the WHERE= will skip that first observation so the MERGE will use only 3 observations from B instead of all 4. But the X value for that observations existed in A so the result is the same as if the missing value of Y was read from B.

To see the difference add an IN= dataset option to B also and check the values of that variable with and without the WHERE= option.

It has not impact on the result because that value of the BY variable is being included in the output because it exists in A. The only impact it could have (given that A does not contain the extra variable Y that B is contributing) would be if there was a value of the BY variable that appeared in A and also appeared in B with MORE observations in A then in B. Then if some of those observations had missing values of Y eliminating them would reduce the number of observations written to C.

Run it with and without the WHERE= option using this version of B

data b;
  input x y zz;
datalines;
101 . 5
101 . 6
102 1 1
103 3 1
104 1 2
;

View solution in original post

PaigeMiller · Posted 11-15-2023 01:19 PM

The missing value of Y in the output is because when you do the merge with the WHERE option, x=101 from data set A has no record in data set B to merge with, so it will have no Y value.

The KEEP= option tells SAS only to use the variables shown. Thus one variable in data set B is not included in the merge or in the output data set.

--
Paige Miller

Tom · Posted 11-15-2023 01:27 PM

The WHERE= dataset option in the INPUT dataset B will have no impact on whether or not something is written to the OUTPUT dataset C.

dmosack · Posted 11-15-2023 01:28 PM

Hi Tom, thanks for your reply! If it doesn't impact the output dataset, what does it do here? It seems to me like it could change nothing

PaigeMiller · Posted 11-15-2023 01:35 PM

@dmosack wrote:

Hi Tom, thanks for your reply! If it doesn't impact the output dataset, what does it do here? It seems to me like it could change nothing

In many data sets, the WHERE= makes a difference to the merge. You simply happen to have a data set where it doesn't make a difference to the merge.

What does it do? It eliminates from the merge any observation in data set B that doesn't meet the WHERE criterion.

--
Paige Miller

Tom · Posted 11-15-2023 01:42 PM

It mainly has no impact because your data step for B is wrong:

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

If we fix it so there actually is an observation in B with a missing value of Y for the WHERE= options to exclude.

data b;
  input x y zz;
datalines;
101 . 2
102 1 1
103 3 1
104 1 2
;

Then the WHERE= will skip that first observation so the MERGE will use only 3 observations from B instead of all 4. But the X value for that observations existed in A so the result is the same as if the missing value of Y was read from B.

To see the difference add an IN= dataset option to B also and check the values of that variable with and without the WHERE= option.

It has not impact on the result because that value of the BY variable is being included in the output because it exists in A. The only impact it could have (given that A does not contain the extra variable Y that B is contributing) would be if there was a value of the BY variable that appeared in A and also appeared in B with MORE observations in A then in B. Then if some of those observations had missing values of Y eliminating them would reduce the number of observations written to C.

Run it with and without the WHERE= option using this version of B

data b;
  input x y zz;
datalines;
101 . 5
101 . 6
102 1 1
103 3 1
104 1 2
;

dmosack · Posted 11-15-2023 01:56 PM

Ah, my apologies, I made a typo when I copied the data step for b. Nonetheless, this is very informative. Based on your response, it appears then that the original intent is likely to prevent duplicates being written to the output dataset. The real data represented by dataset b has several million records, and duplicates are likely, so this would make sense.

I definitely did not know this about merging -- thank you!

Quentin · Posted 11-15-2023 01:55 PM

Your example data is a bit confusing, in part because the step that makes WORK.B has missing data that is likely causing for the data to be read in ways that you're not expecting. You should see this NOTE in the log, which should be considered an ERROR:

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

Here is an example similar to yours, which shows a case where you get different results with the use of a WHERE statement to subset the data being read in. I avoid the above problem by using a . to represent the missing value being read in:

data a;
input x z;
datalines;
101 23
102 21
103 10
;
run;
 
data b;
input x y zz;
datalines;
101 2 .
102 1 1
103 3 1
104 1 2
;
run;


data c1;
  merge a (in=ina)  b (keep=x y zz where=(zz^=.));
  by x;
  if ina;
run;

data c2;
  merge a (in=ina)  b (keep=x y zz);
  by x;
  if ina;
run;

proc print data=c1 ;
run ;
proc print data=c2 ;
run ;

The Boston Area SAS Users Group is hosting free webinars!

Register now at https://www.basug.org/events.

Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (keep= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Re: Merge statement with a (where= ) option

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away