Solved: Re: is this merge one to many,many to one or many to many? and how doe...

veda8 · Posted 07-21-2020 12:41 PM

data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp

@Tom @Paige @Kurt_Bremser @AnnaBrown

smantha · Posted 07-21-2020 03:41 PM

Following is how a data step merge works.

1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)

2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .

3. if there is a merge statement with a by and the datasets are sorted Then it will read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.

In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above.

Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets

1 - many join there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.

many-many join: there are duplicate entries for each variable combination on the by statement in two datasets

View solution in original post

veda8 · Posted 07-21-2020 01:18 PM

@Jagadishkatam

ballardw · Posted 07-21-2020 01:53 PM

@veda8 wrote:

data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp

@Tom @Paige @Kurt_Bremser @AnnaBrown

Please post the code in a code box opened with either the </> or "running man" icon. Looking at code like that makes my head hurt.

Better, post the LOG of what you get when you run the code.

You can only have one variable of a given name in a data set. When you have the same variable other than the the BY variable(s) in both sets, only one of the values ends up in the output, typically the one from the right most data set.

Change the name of the Product variable in one of the sets and you can see what goes on a bit better.

Kurt_Bremser · Posted 07-21-2020 02:10 PM

Use the proper subwindows for posting code and logs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

smantha · Posted 07-21-2020 03:41 PM

Following is how a data step merge works.

1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)

2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .

3. if there is a merge statement with a by and the datasets are sorted Then it will read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.

In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above.

Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets

1 - many join there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.

many-many join: there are duplicate entries for each variable combination on the by statement in two datasets