Solved: Re: Merge statement giving wrong result

riyaaora275 · Posted 01-16-2019 02:46 AM

I have two tables-

I want to remove all the year+enrolids combinations which are common in both tables. Table1 has more observations than table2.

code-

data merge;
merge table1( in=a) table2(in=b);
by year enrolid;
if a and not b then output;
run;

Now the merge dataset should have observation count equal to count of table1 - count of table2. but that isnt the case.

Is there anything wrong with the code?

FreelanceReinh · Posted 01-16-2019 10:20 AM

Hi @riyaaora275,

Let's look at one BY group, i.e., one YEAR-ENROLID combination which may occur n times in TABLE1 and k times in TABLE2 (n, k >= 0). Using your selection criterion (which looks reasonable), the number of observations with this particular YEAR-ENROLID combination in the output dataset will be either

n (if k = 0) or
0 (if k > 0).

Now you see that for this generic BY group alone your expected number of observations, n − k, will equal the actual number only if k = 0 or n = k. Given that the total number of observations in the output dataset is simply the sum of the numbers in all BY groups, it's clear that this observation count can differ from "count of table1 − count of table2" in both directions. Kurt Bremser has shown an example of the case n = 0, k = 1 and Astounding has mentioned the cases 1 = n < k and n > k = 1 ("many-to-one match"). Cynthia's examples cover all possible cases with n, k <= 1.

View solution in original post

Kurt_Bremser · Posted 01-16-2019 03:02 AM

You most probably have a combination in table2 that is not present in table1, so that will up-end your calculation:

data table1;
input year enrolid;
cards;
2017 1
2018 1
2019 1
2017 2
2017 3
2018 3
;
run;

data table2;
input year enrolid;
cards;
2019 1
2018 4
;
run;

proc sort data=table1;
by year enrolid;
run;

proc sort data=table2;
by year enrolid;
run;

data merged;
merge table1( in=a) table2(in=b);
by year enrolid;
if a and not b then output;
run;

since "2018 4" is not present in table one, only one of the observations in table2 results in a delete.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 01-16-2019 07:56 AM

This statement is incorrect if you want to remove the matches:

if a and not b then output;

Instead, use:

if a and b then delete;

Kurt_Bremser · Posted 01-16-2019 08:01 AM

@Astounding wrote:

This statement is incorrect if you want to remove the matches:
if a and not b then output;
Instead, use:
if a and b then delete;

But that would also keep all observations where only b is true.

If you want to use the delete statement, you'd need

if a;
if b then delete;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 01-16-2019 08:04 AM

@Kurt_Bremser,

I stand by my original post. I know you know this, so perhaps it's time to take a coffee break?

Kurt_Bremser · Posted 01-16-2019 08:36 AM

I see. Your suggestion satisfies

"remove all the year+enrolids combinations which are common in both tables"

But it also does not satisfy

"observation count equal to count of table1 - count of table2"

when you apply it to my example data.

(both quotes from the original post)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 01-16-2019 08:46 AM

If only I could write a program that would revise someone's expectations! Alas, I'm limited to writing programs that get the correct result, and then educating to influence the expectations. So far, the issue of mismatches has come up, but there is also the possibility of a many-to-one match.

Kurt_Bremser · Posted 01-16-2019 08:50 AM

As always, it boils down to the issue of no available example data against which code can be tested to get the expected results.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Cynthia_sas · Posted 01-16-2019 09:07 AM

Hi:

For the benefit of the original poster. With 2 datasets going into the MERGE and understanding that SAS does not really do "MANY-to-MANY merges, you have 5 possible output datasets, as shown below:

Hope this helps put the MERGE in perspective and gives you some ideas about how to fix your MERGE. Without data from you, and a clear explanation of the rules for what you want, it is very hard for others to help you.

Cynthia

riyaaora275 · Posted 01-18-2019 12:21 AM

You mean I should give some sample data ? Because I cannot give the datafiles.

Kurt_Bremser · Posted 01-18-2019 02:11 AM

Creating example data for testing with a data step is an essential SAS skill.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 01-16-2019 09:42 AM

@Astounding wrote:

This statement is incorrect if you want to remove the matches:
if a and not b then output;
Instead, use:
if a and b then delete;

Or just use a subsetting IF

  if a^=b;

which can only be true when the observation originates from a single data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

riyaaora275 · Posted 01-18-2019 12:20 AM

Both of these give the same number of observations.

FreelanceReinh · Posted 01-16-2019 10:20 AM

Hi @riyaaora275,

Let's look at one BY group, i.e., one YEAR-ENROLID combination which may occur n times in TABLE1 and k times in TABLE2 (n, k >= 0). Using your selection criterion (which looks reasonable), the number of observations with this particular YEAR-ENROLID combination in the output dataset will be either

n (if k = 0) or
0 (if k > 0).

Now you see that for this generic BY group alone your expected number of observations, n − k, will equal the actual number only if k = 0 or n = k. Given that the total number of observations in the output dataset is simply the sum of the numbers in all BY groups, it's clear that this observation count can differ from "count of table1 − count of table2" in both directions. Kurt Bremser has shown an example of the case n = 0, k = 1 and Astounding has mentioned the cases 1 = n < k and n > k = 1 ("many-to-one match"). Cynthia's examples cover all possible cases with n, k <= 1.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away