BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
veda8
Fluorite | Level 6

data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp

@Tom @Paige @Kurt_Bremser @AnnaBrown 

1 ACCEPTED SOLUTION

Accepted Solutions
smantha
Lapis Lazuli | Level 10

Following is how a data step merge works.

 

1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)

2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .

3. if there is a merge statement with a by and the datasets are sorted Then it will   read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.

 

In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above. 

Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets

          1 - many join  there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.

many-many join:  there are duplicate entries for each variable combination on the by statement in two datasets

 

View solution in original post

6 REPLIES 6
ballardw
Super User

@veda8 wrote:

data GroupA; input ID$ Product$; datalines; A Apple B Grapes B Orange C Banana E Apple run; data GroupB; input ID$ Product$ Size$; datalines; B Shirt S C Pants M D Jacket L E Tie M E Wallet L run; can someone please explain elaborately why does b grapes c banana e apple doesnt come in the output table data merged; merge GroupA GroupB; by ID; run; @ballardw @Ksharp

@Tom @Paige @Kurt_Bremser @AnnaBrown 


Please post the code in a code box opened with either the </> or "running man" icon. Looking at code like that makes my head hurt.

Better, post the LOG of what you get when you run the code.

 

You can only have one variable of a given name in a data set. When you have  the same variable other than the the BY variable(s) in both sets, only one of the values ends up in the output, typically the one from the right most data set. 

Change the name of the Product variable in one of the sets and you can see what goes on a bit better.

smantha
Lapis Lazuli | Level 10

Following is how a data step merge works.

 

1. Build a pdv of all common variables (imagine an array with one slot per each unique variable across the datasets)

2. SAS looks at the set or merge statement . If set statement and there are multiple datasets it will process all rows from each of the datasets in teh order in which it is specified .

3. if there is a merge statement with a by and the datasets are sorted Then it will   read a row from first data set and read a row from second dataset checks if order wise the by variables align. If by variables align then it will read all variables from first dataset and populates the pdv. Then it will read all the values from the second dataset. If the overlapping values are non missing then it will overwrite the values in the pdv. Then the pointer is moved to the next record on the first dataset and the same happens on the second dataset . The pdv is not flushed yet. Once it is determined that the by variables are different in both the datasets then pdv is flushed and the process repeats.

 

In your case or atleast in your example you do not have many to many joins. Your case is a simple overwrite as explained above. 

Eg of 1 -1 join: there is only unique entry for each variable combination on the by statement in two datasets

          1 - many join  there is only unique entry for each variable combination on the by statement in one of the two datasets and there is a duplicate entry in the other.

many-many join:  there are duplicate entries for each variable combination on the by statement in two datasets

 

smantha
Lapis Lazuli | Level 10

The attached deck shows how merge works

AnnaBrown
Community Manager

This video tutorial on merging data sets in SAS using SQL may be something to consider as well:

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 2787 views
  • 6 likes
  • 5 in conversation