Solved: Re: Question on how SAS Merge works with Duplicates/spaces

prasanna_sk · Posted 07-23-2013 11:02 AM

Hello

I have a basic question on how SAS Merge works when there are duplicates and Spaces and common fields between 2 files that I am merging. Here is an e.g.

Please note the occurrence of STATE=PA is different in INPUT0 & INPUT1.

The Output of MERGE1 hsd only STATE=PA in one of the obs but output on MERGE2 has STATE=PA in both Observation. Why? Is the occurrence of PA on OBS=1 - a deciding factor on how STATE is populated in the output?

Input0/1 - sorted on ID & NUM

INPUT2 - NUM

INPUT0

Obs ID NUM STATE

1 4A 12345678 PA

2 83 12345678

INPUT1

Obs ID NUM STATE

1 00083 12345678

2 00401 12345678 PA

INPUT2

Obs NUM STATE

1 12345678 PA

MERGE1 OF INPUT0 & INPUT2 (BY NUM)

Obs ID NUM STATE

1 4A 12345678 PA

2 83 12345678

MERGE2 OF INPUT1 & INPUT2 (BY NUM)

Obs ID NUM STATE

1 00083 12345678 PA

2 00401 12345678 PA

Thanks in Advance for your help.

Haikuo · Posted 07-23-2013 11:36 AM

OK, "missing" is a form of value. In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.

HTH,

Haikuo

View solution in original post

Haikuo · Posted 07-23-2013 11:08 AM

Show your code, please.

prasanna_sk · Posted 07-23-2013 11:13 AM

Here you go.

DATA INPUT0;

INPUT ID $1-2 NUM $3-11 STATE $13-14;

CARDS;

83 12345678

4A 12345678 PA

;

RUN;

PROC SORT DATA=INPUT0; BY ID NUM;

PROC PRINT DATA=INPUT0; TITLE 'INPUT0';

DATA INPUT1;

INPUT ID $1-5 NUM $7-14 STATE $16-17;

CARDS;

00083 12345678

00401 12345678 PA

;

RUN;

PROC SORT DATA=INPUT1; BY ID NUM;

PROC PRINT DATA=INPUT1; TITLE 'INPUT1';

DATA INPUT2;

INPUT NUM $1-8 STATE $10-11;

CARDS;

12345678 PA

;

PROC SORT DATA=INPUT1; BY NUM;

PROC PRINT DATA = INPUT2; TITLE 'INPUT2';

DATA ALL (KEEP=ID NUM STATE);

MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;

PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2';

DATA ALL1 (KEEP=ID NUM STATE);

MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;

PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2';

Haikuo · Posted 07-23-2013 11:25 AM

OK, if you rename STATE variable in INPUT2, then it would be a lot easier for you to observer the whole process.

DATA INPUT0;

INPUT ID $1-2 NUM $3-11 STATE $13-14;

CARDS;

83 12345678

4A 12345678 PA

;

RUN;

PROC SORT DATA=INPUT0; BY ID NUM;

PROC PRINT DATA=INPUT0; TITLE 'INPUT0';

DATA INPUT1;

INPUT ID $1-5 NUM $7-14 STATE $16-17;

CARDS;

00083 12345678

00401 12345678 PA

;

RUN;

PROC SORT DATA=INPUT1; BY ID NUM;

PROC PRINT DATA=INPUT1; TITLE 'INPUT1';

DATA INPUT2;

INPUT NUM $1-8 STATE_1 $10-11;

CARDS;

12345678 PA

;

PROC SORT DATA=INPUT1; BY NUM;

PROC PRINT DATA = INPUT2; TITLE 'INPUT2'; run;

DATA ALL;

MERGE INPUT0 (IN=A) INPUT2 (IN=B); BY NUM;

PROC PRINT DATA=ALL; TITLE 'MERGE OF INPUT0 & INPUT2'; run;

DATA ALL1 ;

MERGE INPUT1 (IN=A) INPUT2 (IN=B); BY NUM;

PROC PRINT DATA=ALL1; TITLE 'MERGE OF INPUT1 & INPUT2'; run;

So you can see, in your original code, the first row of merge, the STATE will take the value from the first table initially, then override by the value in the second table; the second row of merge, STATE will do the same thing, except there is no value from second table to override, the value of STATE from the first table retained.

Haikuo

prasanna_sk · Posted 07-23-2013 11:30 AM

Thanks for the reply.

If that is the case, then in Merge2 - the first row does not have a value for STATE - yet it is still overridden by the STATE from second table. Why is that?

Haikuo · Posted 07-23-2013 11:36 AM

OK, "missing" is a form of value. In your data context, the first row will ALWAYS be override by the value from table2, because table2 can provide the corresponding value for the specific variable. And the second row will NEVER be override,because there is only one row in table2.

HTH,

Haikuo

prasanna_sk · Posted 07-23-2013 11:44 AM

Thanks.

So - the first row will always be overridden with the value from Table2 for the common fields. The rest of the rows will not be overridden (if there are more than 1 or 2 rows). Is there a way to stop the first row from always getting overridden?

If not - its not logical for a code to override just the first row's missing value and ignore the rest - right?

Haikuo · Posted 07-23-2013 11:49 AM

Say the first table has N rows within an ID, and the second table has M rows within the same ID, the number of rows being override is M.

If you don't want any overriding at all, rename your common variables; if you just don't want override table1, and you don't care too much of table2, then you can choose to reverse the merge order.

Haikuo

prasanna_sk · Posted 07-23-2013 11:51 AM

Thanks a million.

prasanna_sk · Posted 07-23-2013 11:48 AM

Here is an example for the same code.

INPUT1

Obs ID NUM STATE

1 00043 12345678

2 00083 12345678

3 00401 12345678 PA

INPUT2

Obs NUM STATE

1 12345678 PA

MERGE OF INPUT1 & INPUT2

Obs ID NUM STATE

1 00043 12345678 PA (override from INPUT2's STATE)

2 00083 12345678 (Spaces - no override from INPUT2's STATE)

3 00401 12345678 PA

For a code to be consistent - it should either override the common field's missing values in all the rows - or should not override at all.

I am surprised how SAS just overrides the first row alone (irrespective of whether there is a missing value or not).

Thanks for your time and help - it helped me to understand how Merge works in this bizarre situations.

Astounding · Posted 07-23-2013 11:51 AM

A good resource for this and similar questions:

http://www.ats.ucla.edu/stat/sas/library/nesug99/ad155.pdf

It's old but still relevant.

Good luck.

Haikuo · Posted 07-23-2013 12:10 PM

"Consistency" is in the eyes of the beholder. Depending on your purpose, Merge is very flexible to do what you want. In the case of many to many merge, you many want to consider Proc SQL or Hash table. In this case, it could be tricky to maintain your "consistency", say if non-override is an less ambiguous option, all-override could be disastrous, which row you want to choose to Override from in the second table? the first, the last, or somewhere in the middle?

However, if just doing many to one, Merge is capable to provide both options:

For non-override, reversing the merge order will do; for all-override, then you need rename the STATE in the second table, and add a simple assign statement:

STATE=STATE_1;

of course you can choose dropping STATE_1 in the end.

Haikuo

Update: if you only want to Override those missing values, check UPDATE statement.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away