Merge with by

David_Billa · Posted 05-13-2022 02:29 PM

We're in process of migrating SAS code to R. I'd like to know how to explain the following step to non SAS users. I believe it's similar to full join. But when R coders tried to convert this step in R language I see that R results are not matching with SAS.

DATA WORK.HAVE;	
	MERGE WORK.HAVE1  WORK.HAVE2;
	BY ID;
RUN;

Rick_SAS · Posted 05-13-2022 02:46 PM

This is a match merge. It combines observations from two data sets based on the common values of the ID variable (the variable on the BY statement).

The SAS documentation contains a section that describes the details of the match-merge operation,
as well as an example with each step explained.

Tom · Posted 05-13-2022 02:50 PM

It is very similar to a full join, but only when there are no by groups that have multiple observations in both input datasets. So it will match if you are doing 1-1 or 1-many matching. But not when doing many-many matching.

mkeintz · Posted 05-13-2022 03:01 PM

I believe this is called a "match merge".

This is not the equivalent to a "full join" as I understand the term. In particular, when there is a many-to-many match. In such a case, the MERGE matches the first record for a given ID in HAVE1 with the first matching ID in HAVE2, then the second record in HAVE1 with the 2nd in HAVE2.

If one of the datasets (say HAVE1) has more records for the given ID, then the "excess" HAVE1 records are matched with the last record with the same ID in HAVE2. This implies what happens in a one-to-many (or zero-to-many) match would replicate a full join.

I don't know how you would replicate this behavior in SQL, especially since, unlike MERGE, no SQL (known to me) explicitly relies on the physical order of matching records.

So you need a way for R to honor the physical order of records in the two datasets, to do a one-record-to-one-record match until the less-populated dataset is exhausted and then propagate that last record of the less-populous into all additionally required links.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

David_Billa · Posted 05-13-2022 03:47 PM

@mkeintz @Tom In R language it is not required to sort the by variables before we do merge. Are you saying that we still order the input tables with BY variables before we do match merge in R?

Tom · Posted 05-13-2022 03:53 PM

@David_Billa wrote:

@mkeintz @Tom In R language it is not required to sort the by variables before we do merge. Are you saying that we still order the input tables with BY variables before we do match merge in R?

No. That is not at all the take away you should get from this discussion.

Use whatever method in R works best for your overall goal.

Do not get too caught up in trying to translate a SAS program step by step into an R program.

Consider the entire task and code to perform that task in R.

Understanding how the SAS MERGE statement works will help you get a better understanding of how the SAS code is working. But the real goal is to understand the larger problem and then coding a R program that can achieve the same objective.

Kurt_Bremser · Posted 05-13-2022 03:31 PM

For ETL and data preparation, R is not nearly as useful and powerful as SAS, especially the data step language.

Anway, you are much better off re-implementing the design instead of migrating the code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

David_Billa · Posted 05-13-2022 03:43 PM

@Kurt_Bremser I know it's not the right forum to ask questions about R programming. But I'd like to seek your help to understand how will you redesign this step in migration to other language like R.

Kurt_Bremser · Posted 05-13-2022 04:00 PM

@David_Billa wrote:

@Kurt_Bremser I know it's not the right forum to ask questions about R programming. But I'd like to seek your help to understand how will you redesign this step in migration to other language like R.

That is exactly what you should not do. You need to take the "big picture" and re-implement the process in the new language. Meddling with single SAS steps in R will not get you far. The only thing in SAS which can be translated to R is IML code, IMO.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Sajid01 · Posted 05-14-2022 08:16 AM

A simple explanation is that in this code two datasets have1 and have2 are joined to produce the third dataset have based on the common variable ID. A close equivalent is SQL join.

Reeza · Posted 05-14-2022 12:15 PM

Isolate cases where it doesn't match and determine why and then we can provide alternative approaches.

I typically use tidyverse joins, which are more similar to SQL unless you have a many to many merge in which case you'd have a difficult time replicating the merge. In fact, I'd question if the original code wasn't incorrect if it was a many to many merge within a data step by merge.

As others have indicated, code conversion line by line/proc to step isn't efficient in the long run. You aren't programming optimally then in R and your gains from going to R are solely costs, no efficiency from using RAM or anything else. It also makes the code more difficult to maintain and update. Instead, you take a SAS program, convert it to a set of requirements in terms of business needs (ie not find last record in data set, but identify last occurrence of event for recent report) and allow your R programmers to efficiently code in R.

library(tidyverse)

ID = c(10, 20, 30)
A = c(1, 3, 5)
B = c(2, 4, 6)

dset1 = as.tibble(cbind(ID, A, B))

ID=c(10, 30, 40)
C = c(0, 1, 1)

dset2 = as.tibble(cbind(ID, C))

want <- dset1 %>%
  full_join(dset2, by="ID")

view(want)

Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Re: Merge with by

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away