BookmarkSubscribeRSS Feed
David_Billa
Rhodochrosite | Level 12

We're in process of migrating SAS code to R. I'd like to know how to explain the following step to non SAS users. I believe it's similar to full join. But when R coders tried to convert this step in R language I see that R results are not matching with SAS. 

 

DATA WORK.HAVE;	
	MERGE WORK.HAVE1  WORK.HAVE2;
	BY ID;
RUN;	
10 REPLIES 10
Rick_SAS
SAS Super FREQ

This is a match merge. It combines observations from two data sets based on the common values of the ID variable (the variable on the BY statement).  

 

The SAS documentation contains a section that describes the details of the match-merge operation,
as well as an example with each step explained.

Tom
Super User Tom
Super User

It is very similar to a full join, but only when there are no by groups that have multiple observations in both input datasets.  So it will match if you are doing 1-1 or 1-many matching.  But not when doing many-many matching.

mkeintz
PROC Star

I believe this is called a "match merge".

 

This is not the equivalent to a "full join" as I understand the term.  In particular, when there is a many-to-many match.  In such a case, the MERGE matches the first record for a given ID in HAVE1 with the first matching ID in HAVE2, then the second record in HAVE1 with the 2nd in HAVE2.

 

If one of the datasets (say HAVE1) has more records for the given ID, then the "excess" HAVE1 records are matched with the last record with the same ID in HAVE2.  This implies what happens in a one-to-many (or zero-to-many) match would replicate a full join.

 

I don't know how you would replicate this behavior in SQL, especially since, unlike MERGE, no SQL (known to me) explicitly relies on the physical order of matching records.

 

So you need a way for R to honor the physical order of records in the two datasets, to do a one-record-to-one-record match until the less-populated dataset is exhausted and then propagate that last record of the less-populous into all additionally required links.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
David_Billa
Rhodochrosite | Level 12

@mkeintz @Tom In R language it is not required to sort the by variables before we do merge. Are you saying that we still order the input tables with BY variables before we do match merge in R?  

Tom
Super User Tom
Super User

@David_Billa wrote:

@mkeintz @Tom In R language it is not required to sort the by variables before we do merge. Are you saying that we still order the input tables with BY variables before we do match merge in R?  


No. That is not at all the take away you should get from this discussion.

Use whatever method in R works best for your overall goal.

Do not get too caught up in trying to translate a SAS program step by step into an R program.

Consider the entire task and code to perform that task in R.

Understanding how the SAS MERGE statement works will help you get a better understanding of how the SAS code is working.  But the real goal is to understand the larger problem and then coding a R program that can achieve the same objective.

Kurt_Bremser
Super User

For ETL and data preparation, R is not nearly as useful and powerful as SAS, especially the data step language.

 

Anway, you are much better off re-implementing the design instead of migrating the code.

David_Billa
Rhodochrosite | Level 12

@Kurt_Bremser I know it's not the right forum to ask questions about R programming. But I'd like to seek your help to understand how will you redesign this step in migration to other language like R.

Kurt_Bremser
Super User

@David_Billa wrote:

@Kurt_Bremser I know it's not the right forum to ask questions about R programming. But I'd like to seek your help to understand how will you redesign this step in migration to other language like R.


That is exactly what you should not do. You need to take the "big picture" and re-implement the process in the new language. Meddling with single SAS steps in R will not get you far. The only thing in SAS which can be translated to R is IML code, IMO.

Sajid01
Meteorite | Level 14

A simple explanation is that in this code two datasets have1 and have2 are joined to produce the third dataset have based on the common variable ID. A close equivalent is SQL join.

 

Reeza
Super User

Isolate cases where it doesn't match and determine why and then we can provide alternative approaches. 

 

I typically use tidyverse joins, which are more similar to SQL unless you have a many to many merge in which case you'd have a difficult time replicating the merge. In fact, I'd question if the original code wasn't incorrect if it was a many to many merge within a data step by merge. 

 

As others have indicated, code conversion line by line/proc to step isn't efficient in the long run. You aren't programming optimally then in R and your gains from going to R are solely costs, no efficiency from using RAM or anything else. It also makes the code more difficult to maintain and update. Instead, you take a SAS program, convert it to a set of requirements in terms of business needs (ie not find last record in data set, but identify last occurrence of event for recent report) and allow your R programmers to efficiently code in R. 

 

library(tidyverse)

ID = c(10, 20, 30)
A = c(1, 3, 5)
B = c(2, 4, 6)

dset1 = as.tibble(cbind(ID, A, B))

ID=c(10, 30, 40)
C = c(0, 1, 1)

dset2 = as.tibble(cbind(ID, C))

want <- dset1 %>%
  full_join(dset2, by="ID")

view(want)

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1906 views
  • 10 likes
  • 7 in conversation