BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Obsidian | Level 7

## Merge statement giving wrong result

I have two tables-

I want to remove all the year+enrolids combinations which are common in both tables. Table1 has more observations than table2.

code-

``````data merge;
merge table1( in=a) table2(in=b);
by year enrolid;
if a and not b then output;
run;

``````

Now the merge dataset should have observation count equal to count of table1 - count of table2. but that isnt the case.

Is there anything wrong with the code?

1 ACCEPTED SOLUTION

Accepted Solutions
Jade | Level 19

## Re: Merge statement giving wrong result

Hi @riyaaora275,

Let's look at one BY group, i.e., one YEAR-ENROLID combination which may occur n times in TABLE1 and k times in TABLE2 (n, k >= 0). Using your selection criterion (which looks reasonable), the number of observations with this particular YEAR-ENROLID combination in the output dataset will be either

• n (if k = 0) or
• 0 (if k > 0).

Now you see that for this generic BY group alone your expected number of observations, n − k, will equal the actual number only if k = 0 or n = k. Given that the total number of observations in the output dataset is simply the sum of the numbers in all BY groups, it's clear that this observation count can differ from "count of table1 count of table2" in both directions. Kurt Bremser has shown an example of the case n = 0, k = 1 and Astounding has mentioned the cases 1 = n < k and n > k = 1 ("many-to-one match"). Cynthia's examples cover all possible cases with n, k <= 1.

13 REPLIES 13
Super User

## Re: Merge statement giving wrong result

You most probably have a combination in table2 that is not present in table1, so that will up-end your calculation:

``````data table1;
input year enrolid;
cards;
2017 1
2018 1
2019 1
2017 2
2017 3
2018 3
;
run;

data table2;
input year enrolid;
cards;
2019 1
2018 4
;
run;

proc sort data=table1;
by year enrolid;
run;

proc sort data=table2;
by year enrolid;
run;

data merged;
merge table1( in=a) table2(in=b);
by year enrolid;
if a and not b then output;
run;
``````

since "2018 4" is not present in table one, only one of the observations in table2 results in a delete.

PROC Star

## Re: Merge statement giving wrong result

This statement is incorrect if you want to remove the matches:

``if a and not b then output;``

Instead, use:

``if a and b then delete;``

Super User

## Re: Merge statement giving wrong result

@Astounding wrote:

This statement is incorrect if you want to remove the matches:

``if a and not b then output;``

Instead, use:

``if a and b then delete;``

But that would also keep all observations where only b is true.

If you want to use the delete statement, you'd need

``````if a;
if b then delete;``````
PROC Star

## Re: Merge statement giving wrong result

I stand by my original post.  I know you know this, so perhaps it's time to take a coffee break?

Super User

## Re: Merge statement giving wrong result

I see. Your suggestion satisfies

"remove all the year+enrolids combinations which are common in both tables"

But it also does not satisfy

"observation count equal to count of table1 - count of table2"

when you apply it to my example data.

(both quotes from the original post)

PROC Star

## Re: Merge statement giving wrong result

If only I could write a program that would revise someone's expectations!  Alas, I'm limited to writing programs that get the correct result, and then educating to influence the expectations.  So far, the issue of mismatches has come up, but there is also the possibility of a many-to-one match.

Super User

## Re: Merge statement giving wrong result

As always, it boils down to the issue of no available example data against which code can be tested to get the expected results.

SAS Super FREQ

## Re: Merge statement giving wrong result

Hi:

For the benefit of the original poster. With 2 datasets going into the MERGE and understanding that SAS does not really do "MANY-to-MANY merges, you have 5 possible output datasets, as shown below:

Hope this helps put the MERGE in perspective and gives you some ideas about how to fix your MERGE. Without data from you, and a clear explanation of the rules for what you want, it is very hard for others to help you.

Cynthia

Obsidian | Level 7

## Re: Merge statement giving wrong result

You mean I should give some sample data ? Because I cannot give the datafiles.
Super User

## Re: Merge statement giving wrong result

Creating example data for testing with a data step is an essential SAS skill.

PROC Star

## Re: Merge statement giving wrong result

@Astounding wrote:

This statement is incorrect if you want to remove the matches:

``if a and not b then output;``

Instead, use:

``if a and b then delete;``

Or just use a subsetting IF

``  if a^=b;``

which can only be true when the observation originates from a single data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Obsidian | Level 7

## Re: Merge statement giving wrong result

Both of these give the same number of observations.

Jade | Level 19

## Re: Merge statement giving wrong result

Hi @riyaaora275,

Let's look at one BY group, i.e., one YEAR-ENROLID combination which may occur n times in TABLE1 and k times in TABLE2 (n, k >= 0). Using your selection criterion (which looks reasonable), the number of observations with this particular YEAR-ENROLID combination in the output dataset will be either

• n (if k = 0) or
• 0 (if k > 0).

Now you see that for this generic BY group alone your expected number of observations, n − k, will equal the actual number only if k = 0 or n = k. Given that the total number of observations in the output dataset is simply the sum of the numbers in all BY groups, it's clear that this observation count can differ from "count of table1 count of table2" in both directions. Kurt Bremser has shown an example of the case n = 0, k = 1 and Astounding has mentioned the cases 1 = n < k and n > k = 1 ("many-to-one match"). Cynthia's examples cover all possible cases with n, k <= 1.

Discussion stats
• 13 replies
• 2020 views
• 0 likes
• 6 in conversation