data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp
Following is how a data step merge works.
1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)
2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .
3. if there is a merge statement with a by and the datasets are sorted Then it will read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.
In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above.
Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets
1 - many join there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.
many-many join: there are duplicate entries for each variable combination on the by statement in two datasets
@veda8 wrote:
data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp
@Tom @Paige @Kurt_Bremser @AnnaBrown
Please post the code in a code box opened with either the </> or "running man" icon. Looking at code like that makes my head hurt.
Better, post the LOG of what you get when you run the code.
You can only have one variable of a given name in a data set. When you have the same variable other than the the BY variable(s) in both sets, only one of the values ends up in the output, typically the one from the right most data set.
Change the name of the Product variable in one of the sets and you can see what goes on a bit better.
Use the proper subwindows for posting code and logs.
Following is how a data step merge works.
1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)
2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .
3. if there is a merge statement with a by and the datasets are sorted Then it will read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.
In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above.
Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets
1 - many join there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.
many-many join: there are duplicate entries for each variable combination on the by statement in two datasets
The attached deck shows how merge works
This video tutorial on merging data sets in SAS using SQL may be something to consider as well:
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.