BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
prasanna_sk
Calcite | Level 5

Hello

I have a basic question on how SAS Merge works when there are duplicates and Spaces and common fields between 2 files that I am merging. Here is an e.g.

Please note the occurrence of STATE=PA is different in INPUT0 & INPUT1.

The Output of MERGE1 hsd only STATE=PA in one of the obs but output on MERGE2 has STATE=PA in both Observation. Why? Is the occurrence of PA on OBS=1 - a deciding factor on how STATE is populated in the output?

Input0/1 - sorted on ID & NUM

INPUT2 - NUM

                INPUT0                    

    Obs    ID      NUM       STATE        

     1     4A    12345678     PA          

     2     83    12345678  

              

                INPUT1                    

  Obs     ID        NUM       STATE       

   1     00083    12345678              

   2     00401    12345678     PA

            

                INPUT2                    

       Obs      NUM       STATE           

        1     12345678     PA      


       MERGE1 OF INPUT0 & INPUT2 (BY NUM)          

    Obs    ID      NUM       STATE        

     1     4A    12345678     PA          

     2     83    12345678           

    

       MERGE2 OF INPUT1 & INPUT2 (BY NUM)          

  Obs     ID        NUM       STATE       

   1     00083    12345678     PA         

   2     00401    12345678     PA           

Thanks in Advance for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

OK, "missing" is a form of value.  In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.

HTH,

Haikuo

View solution in original post

11 REPLIES 11
Haikuo
Onyx | Level 15

Show your code, please.

prasanna_sk
Calcite | Level 5

Here you go.

  DATA INPUT0;                                                      

   INPUT ID $1-2 NUM $3-11 STATE $13-14;                            

  CARDS;                                                            

83 12345678                                                         

4A 12345678 PA                                                      

  ;                                                                 

  RUN;                                                              

  PROC SORT DATA=INPUT0; BY ID NUM;                                 

  PROC PRINT DATA=INPUT0; TITLE 'INPUT0';                           

                                                                    

  DATA INPUT1;                                                      

   INPUT ID $1-5 NUM $7-14 STATE $16-17;                            

  CARDS;                                                            

00083 12345678                                                      

00401 12345678 PA                                                   

  ;                                                                 

  RUN;                                                              

  PROC SORT DATA=INPUT1; BY ID NUM;                                 

  PROC PRINT DATA=INPUT1; TITLE 'INPUT1';                           

                                                                    

  DATA INPUT2;                                                      

   INPUT NUM $1-8 STATE $10-11;                                     

  CARDS;                                                            

12345678 PA                                                         

  ;                                                                 

  PROC SORT DATA=INPUT1; BY NUM;                                    

  PROC PRINT DATA = INPUT2; TITLE 'INPUT2';                         

                                                                    

  DATA ALL (KEEP=ID NUM STATE);                                     

   MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;                       

    PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2';            

                                                                    

  DATA ALL1 (KEEP=ID NUM STATE);

   MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;                 

                                                              

  PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2';     

Haikuo
Onyx | Level 15

OK, if you rename STATE variable in INPUT2, then it would be a lot easier for you to observer the whole process.

DATA INPUT0;                                                     

   INPUT ID $1-2 NUM $3-11 STATE $13-14;                           

  CARDS;                                                           

83 12345678                                                        

4A 12345678 PA                                                     

  ;                                                                

  RUN;                                                             

  PROC SORT DATA=INPUT0; BY ID NUM;                                

  PROC PRINT DATA=INPUT0; TITLE 'INPUT0';                          

                                                                   

  DATA INPUT1;                                                     

   INPUT ID $1-5 NUM $7-14 STATE $16-17;                           

  CARDS;                                                           

00083 12345678                                                     

00401 12345678 PA                                                  

  ;                                                                

  RUN;                                                             

  PROC SORT DATA=INPUT1; BY ID NUM;                                

  PROC PRINT DATA=INPUT1; TITLE 'INPUT1';                          

                                                                   

  DATA INPUT2;                                                     

   INPUT NUM $1-8 STATE_1 $10-11;                                    

  CARDS;                                                           

12345678 PA                                                        

  ;                                                                

  PROC SORT DATA=INPUT1; BY NUM;                                   

  PROC PRINT DATA = INPUT2; TITLE 'INPUT2';  run;                      

                                                                   

  DATA ALL;                                    

   MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;                      

    PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2'; run;          

                                                                   

  DATA ALL1 ;

   MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;                

                                                             

  PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2';  run;  

So you can see, in your original code, the first row of merge, the STATE will take the value from the first table initially, then override by the value in the second table; the second row of merge, STATE will do the same thing, except there is no value from second table to override, the value of STATE from the first table retained.

Haikuo

prasanna_sk
Calcite | Level 5

Thanks for the reply.

If that is the case, then in Merge2 - the first row does not have a value for STATE - yet it is still overridden by the STATE from second table. Why is that?

Haikuo
Onyx | Level 15

OK, "missing" is a form of value.  In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.

HTH,

Haikuo

prasanna_sk
Calcite | Level 5

Thanks.

So - the first row will always be overridden with the value from Table2 for the common fields. The rest of the rows will not be overridden (if there are more than 1 or 2 rows). Is there a way to stop the first row from always getting overridden?

If not - its not logical for a code to override just the first row's missing value and ignore the rest - right?

Haikuo
Onyx | Level 15

Say the first table has N rows within an ID, and the second table has M rows within the same ID, the number of rows being override is M.

If you don't want any overriding at all, rename your common variables; if you just don't want override table1, and you don't care too much of table2, then you can choose to reverse the merge order.

Haikuo

prasanna_sk
Calcite | Level 5

Thanks a million. Smiley Happy

prasanna_sk
Calcite | Level 5

Here is an example for the same code.

   

                 INPUT1                      

   Obs     ID        NUM       STATE         

    1     00043    12345678                  

    2     00083    12345678                  

    3     00401    12345678     PA    

      

                 INPUT2                      

        Obs      NUM       STATE             

         1     12345678     PA               

      

        MERGE OF INPUT1 & INPUT2             

   Obs     ID        NUM       STATE         

    1     00043    12345678     PA   (override from INPUT2's STATE)        

    2     00083    12345678          (Spaces - no override from INPUT2's STATE)       

    3     00401    12345678     PA           


For a code to be consistent - it should either override the common field's missing values in all the rows - or should not override at all.


I am surprised how SAS just overrides the first row alone (irrespective of whether there is a missing value or not).


Thanks for your time and help - it helped me to understand how Merge works in this bizarre situations.

Astounding
PROC Star

A good resource for this and similar questions:

http://www.ats.ucla.edu/stat/sas/library/nesug99/ad155.pdf

It's old but still relevant.

Good luck.

Haikuo
Onyx | Level 15

"Consistency" is in the eyes of the beholder. Depending on your purpose, Merge is very flexible to do what you want. In the case of many to many merge, you many want to consider Proc SQL or Hash table. In this case, it could be tricky to maintain your "consistency", say if non-override is an less ambiguous option, all-override could be disastrous, which row you want to choose to Override from in the second table? the first, the last, or somewhere in the middle?

However, if just doing many to one, Merge is capable to provide both options:

For non-override,  reversing the merge order will do; for all-override, then you need rename the STATE in the second table, and add a simple assign statement:

STATE=STATE_1;

of course you can choose  dropping STATE_1 in the end.

Haikuo

Update: if you only want to Override those missing values, check UPDATE statement.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2501 views
  • 8 likes
  • 3 in conversation