DATA Step, Macro, Functions and more

strange merge issue

Accepted Solution Solved
Reply
Occasional Contributor LHV
Occasional Contributor
Posts: 15
Accepted Solution

strange merge issue

Hi,

I come across a strange SAS merge issue.

I have two SAS datasets : ds1 and ds2.

ds1 has number of variables along with CUST_ID & CATEGORY. where CUST_ID & CATEGORY are numeric & character variables respectively.

ds2 has only these two variables with same length and type. ds1 has more CUST_IDs then ds2.

Both data sets are sorted by CUST_ID.

In ds1, CUST_ID may be appeared in more than one observation with different values for CATEGORY.

e.g. CUST_ID CATEGORY

       2043       NEW

       2043       OLD

       2043       NEW

In ds2, CUST_ID is unique, i.e. a CUST_ID appears only once with a single value for CATEGORY.

e.g. CUST_ID  CATEGORY

          2043      OLD

when ds1 & ds2 are merged by CUST_ID using the below data step, i get a strange result.

data ds3;

     merge ds1(in=in1) ds2(in=in2);

     by CUST_ID;

if in1;

run;

in the resulting data set (ds3) there are many CUST_IDs which are matched, but populated with more than one CATEGORY value against them.

I am only able to get the expected result, if i modify the step like below.

data ds3;

     merge ds1(in=in1) ds2(in=in2 rename=(CATEGORY =CATEGORY_1))

    by CUST_ID;

if in1;

if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;

run;

Can any one explain, what are the possible issues which prevent obtaining unique CATEGORY value against a given CUST_ID, without adding

an additional statement like (if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1Smiley Wink

Many thanks inadvance for your help.



Accepted Solutions
Solution
‎08-14-2012 12:21 PM
Super User
Posts: 5,082

Re: strange merge issue

The process is a little more complex than that.  Each data value gets read only once.  For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1.  For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value.  Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.

All of the tools you have seen to program around this are valid tools.  You just have to pick the one that matches what you need.

Good luck.

View solution in original post


All Replies
PROC Star
Posts: 7,363

Re: strange merge issue

The code, without the additional assignment, doesn't provide SAS with any indication of which category value to keep.  An easier method, I think (given your fix), would simply be to include a drop option when you merge the files.  i.e.:

merge ds1(in=in1 drop=CATEGORY) ds2(in=in2);


Occasional Contributor LHV
Occasional Contributor
Posts: 15

Re: strange merge issue

Well, my understanding is, in a PDV there will be only one variable with name CATEGORY, and the values(of variable CATEGORY) from ds2 replaces the values of CATEGORY from ds1, whenever there is a match. by CUST_ID.

If that is the default mechanism for SAS match merge, why do we need  indicate which one to use?

PROC Star
Posts: 7,363

Re: strange merge issue

Timing!  Yes, only one value can be in the pdv at a time, but which one?

You can see by running the following two merges:

data ds1;

  input id category;

  cards;

1 1

1 1

1 1

1 1

2 1

2 1

2 1

2 1

;

data ds2;

  input id category;

  cards;

1 2

2 2

;

data want1;

  merge ds1(in=in1) ds2(in=in2);

  by id;

run;

data want2;

  merge ds1(in=in1 drop=category) ds2(in=in2);

  by id;

run;

Occasional Contributor LHV
Occasional Contributor
Posts: 15

Re: strange merge issue

Many thanks Arthur,

I understood the issue very well, even with my original data.

for the timing, i am under impression that the sequence in the merge statement decides the timing, i.e. which data values are written to PDV first.

In fact, in my case the ds1 is main data set and has lots of additional CUST_IDs which are not covered in ds2.

Here, i want to do left join and don't want to loose any thing from ds1, except replacing the CATEGORY only when matched to ds2.

PROC Star
Posts: 7,363

Re: strange merge issue

Then your original approach (with a slight modification) is one possibility, the other is to control it directly using proc sql.

In your post your code reflected that you wanted to set non-matching entries as missing.  i.e.:

if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;


I think, if you use a datastep, that you would want to replace that with something like:


if in1 and in2 then CATEGORY= CATEGORY_1;


Solution
‎08-14-2012 12:21 PM
Super User
Posts: 5,082

Re: strange merge issue

The process is a little more complex than that.  Each data value gets read only once.  For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1.  For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value.  Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.

All of the tools you have seen to program around this are valid tools.  You just have to pick the one that matches what you need.

Good luck.

PROC Star
Posts: 1,231

Re: strange merge issue

Agree with Astounding's response.

To understand what happens when you have a variable collission in a merge, you need to think in terms of the PDV, and when records are read into the PDV, when the PDV is initialized, etc.

I think the bottom line is that variable collissions in a merge are BAD, and there is no rule that variables from the second dataset overwrite the first.

I turn msglevel=i on to notify me of collissions, which I treat as an error (to be avoided with drop/keep/rename etc).  I find the note to the log to be one of the most misleading messages that SAS produces, as it no doubt reinforces the belief that variables from the second dataset will overwrite the first, when in fact you can end up with a mix of the two.

Using Art's data with msglevel=i turned on:

52 options msglevel=i;

53 data want1;

54 merge ds1(in=in1) ds2(in=in2);

55 by id;

56 run;

INFO: The variable category on data set WORK.DS1 will be overwritten by data set WORK.DS2.

NOTE: There were 8 observations read from the data set WORK.DS1.

NOTE: There were 2 observations read from the data set WORK.DS2.

NOTE: The data set WORK.WANT1 has 8 observations and 2 variables.

Occasional Contributor LHV
Occasional Contributor
Posts: 15

Re: strange merge issue

Many thanks every one for your input.

When I learned SAS, the SAS Certification Training Core Concepts V8  document  says, " DATA step match-merging overwrites values of the like-named variable in the first data set in which it appears with values of the like-name variable in subsequent data sets."

This wording is the main source for my confusion.


☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 399 views
  • 7 likes
  • 4 in conversation