Solved: Match merge understanding

David_Billa · Posted 06-16-2022 06:59 AM

Appreciate if someone of you help me understand this match merge. I want to know in the final result MFGCOMBINE3, why the variables MATRL_NBR and PLANNED_ZNL values are missing in second record and why the value of REASDES got updated in first record.

data MFGCOMBINE1;
	MATRL_NBR=.;
	FISC_YR=2019;
	FISC_PD='012/2019';
	FISC_WK='049/2019';
	PLANT=337;
	MATERIAL2=3010030074;
	PLANNED_ZNL=.;
	REASDES='DO NOT USE Line Performance(LP)';
run;

data MFGCOMBINE2;
	MATRL_NBR=.;
	FISC_YR=2019;
	FISC_PD='012/2019';
	FISC_WK='049/2019';
	PLANT=337;
	MATERIAL2=3010030074;
	PLANNED_ZNL=.;
	REASDES='DO NOT USE Line Performance(LP)';
run;

data MFGCOMBINE;
	set MFGCOMBINE1 MFGCOMBINE2;
run;

data PRIORACTUALS2019A;
	MATRL_NBR=3010030074;
	FISC_YR=2019;
	FISC_PD='012/2019';
	FISC_WK='049/2019';
	PLANT=337;
	MATERIAL2=3010030074;
	PLANNED_ZNL=68478;
	REASDES='Line Performance';
run;

DATA MFGCOMBINE3;
	MERGE WORK.MFGCOMBINE
		WORK.PRIORACTUALS2019A;
	BY FISC_YR FISC_PD FISC_WK PLANT MATERIAL2;
RUN;

mkeintz · Posted 06-16-2022 09:27 AM

Like the SET statement, the MERGE statement is, at its core, based on sequential access of the named datasets, where the sequential processing occurs within matching BY groups of the merged datasets.

So, here's what happens from MERGE mfgcombine prioractuals2019a; BY ....; :

SAS reads the first matching record from mfgcombine.
SAS reads the first matching record (i.e. same BY values) from prioractuals2019a. Any other common variables (MATRL_NBR, PLANNED_ZNL, and REASDES in your case) will have the values read from mfgcombine overwritten by values from prioractuals2019a.
The first record is output. This produced the results you were presumably expecting. All the variables read in from mfgcombine and prioractuals2019a are not reset to missing after output. They are retained until and unless they are replaced in step 4 or 5 below.
SAS reads the next matching record (if any) from mfgcombine. If there isn't such a record, the variables named in mfgcombine are not (yet) overwritten, nor set to missing. But in your case there is a second mfgcombine record, which replaces the retained values for MATRL_NBR, PLANNED_ZNL, and REASDES. That is, it replaces the values that were retrieved in step1 and then step 2 (from prioractuals2019a) above.
SAS finds no second record in prioractuals2019a, so no values were modified.
SAS outputs the second record.

A couple of notes: Your example contains no variables that belong to just one of the data sets. If it did, then values of those variables would be retained, like all others, from step 3 to step 4. But in your case there is no second record from prioractuals2019a. So any variable present only in the single prioractuals2019a record would be unchanged in all subsequent output records.

Also the retain behavior is overwritten when there is a change in any BY variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

David_Billa · Posted 06-16-2022 07:18 AM

So SAS match merge is like update statement on by fields and it's neither a full join nor inner/left join?

Tom · Posted 06-16-2022 08:16 AM

@David_Billa wrote:

So SAS match merge is like update statement on by fields and it's neither a full join nor inner/left join?

Ignore the SQL concepts of joins. Just process the observations one by one following the rules.

Consider each dataset like a face up deck of cards so you can see the value of the next card before you pick it up.

The MERGE is going to combine the decks so that the ones with the same BY variable values match.

The general process is you look at the top cards in each input and find the "smallest" one. If that starts a new group then erase everything. Take the first card for this new group (scanning the decks from left to right) and read the values from the card into the corresponding variables. If there are top cards from the other decks for this group then read them and update the variables. So if the same variables are coming from two decks (dataset) then the value of the last one read "wins". Once you have scanned across all decks copy the current values to a card and stick into the output deck. Now you are ready to start the next card. Are there any more cards visible on the input decks for this by group? If so again read one card for this group from left to write as before and then write out the result. When there are no more visible cards for this group then clear the variables and find the next lowest group and repeat.

So if there is a variable that is only in the second deck and that deck only had one card for this BY group then its value is never changed when the second card for this by group is read from the first deck. That is how a MERGE statement can work add lookup values into a dataset in a many to 1 situation.

tarheel13 · Posted 06-16-2022 07:47 AM

reading the documentation may help

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p1phdzzlrc1wi8n1bpigstwt851o.htm

David_Billa · Posted 06-16-2022 07:53 AM

There are multiple examples in the provided link. May I know which example
should I refer for the program which I shown in my post?

Kurt_Bremser · Posted 06-16-2022 07:52 AM

You have a two-to-one merge, with two variables in common that are not part of the BY.

In the first match, the values from the first dataset are read, then overwritten with the values from the second; for the second match, the values from the second dataset are retained (as no further matching observations are found in the second dataset), and then overwritten with the values from the first dataset.

You will do best by not thinking of a data step MERGE as some kind of join, as it isn't (although it can be used for such an operation, with proper diligence). It's also not an update, the data step provides the UPDATE statement for this.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 06-16-2022 07:56 AM

I specifically point you to this example in @tarheel13 's link:

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p1phdzzlrc1wi8n1bpigstwt851o.htm#p1gxq53...

which explicitly covers the mechanics of a one-to-many merge with common variables.

Edit: fixed typo.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

tarheel13 · Posted 06-16-2022 07:57 AM

Thank you @Kurt_Bremser . I referred to the documentation because the illustrated example explains what happens better than I can 🙂

Tom · Posted 06-16-2022 08:22 AM

What are you trying to do?

When you MERGE two datasets it is normally to add additional VARIABLES from the second dataset to the first dataset.

In your example the two dataset have the exact same variables.

So what is it that you are trying to accomplish by merging them?

One situation where it makes sense to have both datasets have the same set of variables (or the second dataset to have a subset of the variables in the first) is when the intent is to apply transactions to the first dataset. If that is what you are trying to do then look at the UPDATE statement. That will ignore the missing values in the transaction dataset, leaving the values in the original dataset for that by group unchanged.

mkeintz · Posted 06-16-2022 09:27 AM

Like the SET statement, the MERGE statement is, at its core, based on sequential access of the named datasets, where the sequential processing occurs within matching BY groups of the merged datasets.

So, here's what happens from MERGE mfgcombine prioractuals2019a; BY ....; :

SAS reads the first matching record from mfgcombine.
SAS reads the first matching record (i.e. same BY values) from prioractuals2019a. Any other common variables (MATRL_NBR, PLANNED_ZNL, and REASDES in your case) will have the values read from mfgcombine overwritten by values from prioractuals2019a.
The first record is output. This produced the results you were presumably expecting. All the variables read in from mfgcombine and prioractuals2019a are not reset to missing after output. They are retained until and unless they are replaced in step 4 or 5 below.
SAS reads the next matching record (if any) from mfgcombine. If there isn't such a record, the variables named in mfgcombine are not (yet) overwritten, nor set to missing. But in your case there is a second mfgcombine record, which replaces the retained values for MATRL_NBR, PLANNED_ZNL, and REASDES. That is, it replaces the values that were retrieved in step1 and then step 2 (from prioractuals2019a) above.
SAS finds no second record in prioractuals2019a, so no values were modified.
SAS outputs the second record.

A couple of notes: Your example contains no variables that belong to just one of the data sets. If it did, then values of those variables would be retained, like all others, from step 3 to step 4. But in your case there is no second record from prioractuals2019a. So any variable present only in the single prioractuals2019a record would be unchanged in all subsequent output records.

Also the retain behavior is overwritten when there is a change in any BY variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

yabwon · Posted 06-16-2022 11:21 AM

Hi,

I would go with this paper: "How MERGE Really Works" by Bob Virgile

https://stats.oarc.ucla.edu/wp-content/uploads/2016/02/ad155.pdf

It's old but not obsolete, and does what it suppose to do, i.e. explains how MERGE really woks 🙂

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Re: Match merge understanding

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away