BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dmosack
Fluorite | Level 6

Hello!

 

I am working on someone else's code where two datasets are merged using an ID variable. They are a more advanced SAS programmer than I, and I'm trying to learn their methods, intent, etc. At a certain point they merge two datasets, and use a "where" option that I would expect to eliminate observations where certain values are missing, but I was very surprised to see that it does not.

 

The example below demonstrates this surprising behavior. What does this (where= ) option do, if anything? Why does it not eliminate observations with missing y values? If these questions are addressed in the documentation, apologies! I was unable to find them. 

 

Thank you for your help!

 

data a;
input x z;
datalines;
101 23
102 21
103 10
;
run;
 
data b;
input x y zz;
datalines;
101 2 2
102 1 1
103 3 1
104 1 2
;
run;
 
data c;
merge a (in=ina)  b (keep=x y  where=(y^=.));
  by x;
  if ina;
run;
 
dmosack_0-1700071735029.png

 

 

edit 1: I kept calling it a (keep= ) option by mistake.

edit 2: I had to fix a copy/paste error in my b dataset. Sorry, this is my first post. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

It mainly has no impact because your data step for B is wrong:

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

If we fix it so there actually  is an observation in B with a missing value of Y for the WHERE= options to exclude.

data b;
  input x y zz;
datalines;
101 . 2
102 1 1
103 3 1
104 1 2
;

Then the WHERE= will skip that first observation  so the MERGE will use only 3 observations from B instead of all 4.  But the X value for that observations existed in A so the result is the same as if the missing value of Y was read from B.

 

To see the difference add an IN= dataset option to B also and check the values of that variable with and without the WHERE= option.

 

It has not impact on the result because that value of the BY variable is being included in the output because it exists in A.  The only impact it could have (given that A does not contain the extra variable Y that B is contributing) would be if there was a value of the BY variable that appeared in A and also appeared in B with MORE observations in A then in B.  Then if some of those observations had missing values of Y eliminating them would reduce the number of observations written to C. 

 

Run it with and without the WHERE= option using this version of B

data b;
  input x y zz;
datalines;
101 . 5
101 . 6
102 1 1
103 3 1
104 1 2
;

 

View solution in original post

7 REPLIES 7
PaigeMiller
Diamond | Level 26

The missing value of Y in the output is because when you do the merge with the WHERE option, x=101 from data set A has no record in data set B to merge with, so it will have no Y value.

 

The KEEP= option tells SAS only to use the variables shown. Thus one variable in data set B is not included in the merge or in the output data set.

--
Paige Miller
Tom
Super User Tom
Super User

The WHERE= dataset option in the INPUT dataset B will have no impact on whether or not something is written to the OUTPUT dataset C.

dmosack
Fluorite | Level 6

Hi Tom, thanks for your reply! If it doesn't impact the output dataset, what does it do here? It seems to me like it could change nothing

PaigeMiller
Diamond | Level 26

@dmosack wrote:

Hi Tom, thanks for your reply! If it doesn't impact the output dataset, what does it do here? It seems to me like it could change nothing


In many data sets, the WHERE= makes a difference to the merge. You simply happen to have a data set where it doesn't make a difference to the merge.

 

What does it do? It eliminates from the merge any observation in data set B that doesn't meet the WHERE criterion.

--
Paige Miller
Tom
Super User Tom
Super User

It mainly has no impact because your data step for B is wrong:

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

If we fix it so there actually  is an observation in B with a missing value of Y for the WHERE= options to exclude.

data b;
  input x y zz;
datalines;
101 . 2
102 1 1
103 3 1
104 1 2
;

Then the WHERE= will skip that first observation  so the MERGE will use only 3 observations from B instead of all 4.  But the X value for that observations existed in A so the result is the same as if the missing value of Y was read from B.

 

To see the difference add an IN= dataset option to B also and check the values of that variable with and without the WHERE= option.

 

It has not impact on the result because that value of the BY variable is being included in the output because it exists in A.  The only impact it could have (given that A does not contain the extra variable Y that B is contributing) would be if there was a value of the BY variable that appeared in A and also appeared in B with MORE observations in A then in B.  Then if some of those observations had missing values of Y eliminating them would reduce the number of observations written to C. 

 

Run it with and without the WHERE= option using this version of B

data b;
  input x y zz;
datalines;
101 . 5
101 . 6
102 1 1
103 3 1
104 1 2
;

 

dmosack
Fluorite | Level 6

Ah, my apologies, I made a typo when I copied the data step for b. Nonetheless, this is very informative. Based on your response, it appears then that the original intent is likely to prevent duplicates being written to the output dataset. The real data represented by dataset b has several million records, and duplicates are likely, so this would make sense. 

 

I definitely did not know this about merging -- thank you!

Quentin
Super User

Your example data is a bit confusing, in part because the step that makes WORK.B has missing data that is likely causing for the data to be read in ways that you're not expecting.  You should see this NOTE in the log, which should be considered an ERROR:

 

NOTE: SAS went to a new line when INPUT statement reached past the end of a line.

Here is an example similar to yours, which shows a case where you get different results with the use of a WHERE statement to subset the data being read in. I avoid the above problem by using a . to represent the missing value being read in:

 

data a;
input x z;
datalines;
101 23
102 21
103 10
;
run;
 
data b;
input x y zz;
datalines;
101 2 .
102 1 1
103 3 1
104 1 2
;
run;


data c1;
  merge a (in=ina)  b (keep=x y zz where=(zz^=.));
  by x;
  if ina;
run;

data c2;
  merge a (in=ina)  b (keep=x y zz);
  by x;
  if ina;
run;

proc print data=c1 ;
run ;
proc print data=c2 ;
run ;
BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1020 views
  • 5 likes
  • 4 in conversation