Hello
I have a basic question on how SAS Merge works when there are duplicates and Spaces and common fields between 2 files that I am merging. Here is an e.g.
Please note the occurrence of STATE=PA is different in INPUT0 & INPUT1.
The Output of MERGE1 hsd only STATE=PA in one of the obs but output on MERGE2 has STATE=PA in both Observation. Why? Is the occurrence of PA on OBS=1 - a deciding factor on how STATE is populated in the output?
Input0/1 - sorted on ID & NUM
INPUT2 - NUM
INPUT0
Obs ID NUM STATE
1 4A 12345678 PA
2 83 12345678
INPUT1
Obs ID NUM STATE
1 00083 12345678
2 00401 12345678 PA
INPUT2
Obs NUM STATE
1 12345678 PA
MERGE1 OF INPUT0 & INPUT2 (BY NUM)
Obs ID NUM STATE
1 4A 12345678 PA
2 83 12345678
MERGE2 OF INPUT1 & INPUT2 (BY NUM)
Obs ID NUM STATE
1 00083 12345678 PA
2 00401 12345678 PA
Thanks in Advance for your help.
OK, "missing" is a form of value. In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.
HTH,
Haikuo
Show your code, please.
Here you go.
DATA INPUT0;
INPUT ID $1-2 NUM $3-11 STATE $13-14;
CARDS;
83 12345678
4A 12345678 PA
;
RUN;
PROC SORT DATA=INPUT0; BY ID NUM;
PROC PRINT DATA=INPUT0; TITLE 'INPUT0';
DATA INPUT1;
INPUT ID $1-5 NUM $7-14 STATE $16-17;
CARDS;
00083 12345678
00401 12345678 PA
;
RUN;
PROC SORT DATA=INPUT1; BY ID NUM;
PROC PRINT DATA=INPUT1; TITLE 'INPUT1';
DATA INPUT2;
INPUT NUM $1-8 STATE $10-11;
CARDS;
12345678 PA
;
PROC SORT DATA=INPUT1; BY NUM;
PROC PRINT DATA = INPUT2; TITLE 'INPUT2';
DATA ALL (KEEP=ID NUM STATE);
MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;
PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2';
DATA ALL1 (KEEP=ID NUM STATE);
MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;
PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2';
OK, if you rename STATE variable in INPUT2, then it would be a lot easier for you to observer the whole process.
DATA INPUT0;
INPUT ID $1-2 NUM $3-11 STATE $13-14;
CARDS;
83 12345678
4A 12345678 PA
;
RUN;
PROC SORT DATA=INPUT0; BY ID NUM;
PROC PRINT DATA=INPUT0; TITLE 'INPUT0';
DATA INPUT1;
INPUT ID $1-5 NUM $7-14 STATE $16-17;
CARDS;
00083 12345678
00401 12345678 PA
;
RUN;
PROC SORT DATA=INPUT1; BY ID NUM;
PROC PRINT DATA=INPUT1; TITLE 'INPUT1';
DATA INPUT2;
INPUT NUM $1-8 STATE_1 $10-11;
CARDS;
12345678 PA
;
PROC SORT DATA=INPUT1; BY NUM;
PROC PRINT DATA = INPUT2; TITLE 'INPUT2'; run;
DATA ALL;
MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;
PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2'; run;
DATA ALL1 ;
MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;
PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2'; run;
So you can see, in your original code, the first row of merge, the STATE will take the value from the first table initially, then override by the value in the second table; the second row of merge, STATE will do the same thing, except there is no value from second table to override, the value of STATE from the first table retained.
Haikuo
Thanks for the reply.
If that is the case, then in Merge2 - the first row does not have a value for STATE - yet it is still overridden by the STATE from second table. Why is that?
OK, "missing" is a form of value. In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.
HTH,
Haikuo
Thanks.
So - the first row will always be overridden with the value from Table2 for the common fields. The rest of the rows will not be overridden (if there are more than 1 or 2 rows). Is there a way to stop the first row from always getting overridden?
If not - its not logical for a code to override just the first row's missing value and ignore the rest - right?
Say the first table has N rows within an ID, and the second table has M rows within the same ID, the number of rows being override is M.
If you don't want any overriding at all, rename your common variables; if you just don't want override table1, and you don't care too much of table2, then you can choose to reverse the merge order.
Haikuo
Thanks a million.
Here is an example for the same code.
INPUT1
Obs ID NUM STATE
1 00043 12345678
2 00083 12345678
3 00401 12345678 PA
INPUT2
Obs NUM STATE
1 12345678 PA
MERGE OF INPUT1 & INPUT2
Obs ID NUM STATE
1 00043 12345678 PA (override from INPUT2's STATE)
2 00083 12345678 (Spaces - no override from INPUT2's STATE)
3 00401 12345678 PA
For a code to be consistent - it should either override the common field's missing values in all the rows - or should not override at all.
I am surprised how SAS just overrides the first row alone (irrespective of whether there is a missing value or not).
Thanks for your time and help - it helped me to understand how Merge works in this bizarre situations.
A good resource for this and similar questions:
http://www.ats.ucla.edu/stat/sas/library/nesug99/ad155.pdf
It's old but still relevant.
Good luck.
"Consistency" is in the eyes of the beholder. Depending on your purpose, Merge is very flexible to do what you want. In the case of many to many merge, you many want to consider Proc SQL or Hash table. In this case, it could be tricky to maintain your "consistency", say if non-override is an less ambiguous option, all-override could be disastrous, which row you want to choose to Override from in the second table? the first, the last, or somewhere in the middle?
However, if just doing many to one, Merge is capable to provide both options:
For non-override, reversing the merge order will do; for all-override, then you need rename the STATE in the second table, and add a simple assign statement:
STATE=STATE_1;
of course you can choose dropping STATE_1 in the end.
Haikuo
Update: if you only want to Override those missing values, check UPDATE statement.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.