BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LHV
Calcite | Level 5 LHV
Calcite | Level 5

Hi,

I come across a strange SAS merge issue.

I have two SAS datasets : ds1 and ds2.

ds1 has number of variables along with CUST_ID & CATEGORY. where CUST_ID & CATEGORY are numeric & character variables respectively.

ds2 has only these two variables with same length and type. ds1 has more CUST_IDs then ds2.

Both data sets are sorted by CUST_ID.

In ds1, CUST_ID may be appeared in more than one observation with different values for CATEGORY.

e.g. CUST_ID CATEGORY

       2043       NEW

       2043       OLD

       2043       NEW

In ds2, CUST_ID is unique, i.e. a CUST_ID appears only once with a single value for CATEGORY.

e.g. CUST_ID  CATEGORY

          2043      OLD

when ds1 & ds2 are merged by CUST_ID using the below data step, i get a strange result.

data ds3;

     merge ds1(in=in1) ds2(in=in2);

     by CUST_ID;

if in1;

run;

in the resulting data set (ds3) there are many CUST_IDs which are matched, but populated with more than one CATEGORY value against them.

I am only able to get the expected result, if i modify the step like below.

data ds3;

     merge ds1(in=in1) ds2(in=in2 rename=(CATEGORY =CATEGORY_1))

    by CUST_ID;

if in1;

if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;

run;

Can any one explain, what are the possible issues which prevent obtaining unique CATEGORY value against a given CUST_ID, without adding

an additional statement like (if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;)

Many thanks inadvance for your help.


1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

The process is a little more complex than that.  Each data value gets read only once.  For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1.  For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value.  Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.

All of the tools you have seen to program around this are valid tools.  You just have to pick the one that matches what you need.

Good luck.

View solution in original post

8 REPLIES 8
art297
Opal | Level 21

The code, without the additional assignment, doesn't provide SAS with any indication of which category value to keep.  An easier method, I think (given your fix), would simply be to include a drop option when you merge the files.  i.e.:

merge ds1(in=in1 drop=CATEGORY) ds2(in=in2);


LHV
Calcite | Level 5 LHV
Calcite | Level 5

Well, my understanding is, in a PDV there will be only one variable with name CATEGORY, and the values(of variable CATEGORY) from ds2 replaces the values of CATEGORY from ds1, whenever there is a match. by CUST_ID.

If that is the default mechanism for SAS match merge, why do we need  indicate which one to use?

art297
Opal | Level 21

Timing!  Yes, only one value can be in the pdv at a time, but which one?

You can see by running the following two merges:

data ds1;

  input id category;

  cards;

1 1

1 1

1 1

1 1

2 1

2 1

2 1

2 1

;

data ds2;

  input id category;

  cards;

1 2

2 2

;

data want1;

  merge ds1(in=in1) ds2(in=in2);

  by id;

run;

data want2;

  merge ds1(in=in1 drop=category) ds2(in=in2);

  by id;

run;

LHV
Calcite | Level 5 LHV
Calcite | Level 5

Many thanks Arthur,

I understood the issue very well, even with my original data.

for the timing, i am under impression that the sequence in the merge statement decides the timing, i.e. which data values are written to PDV first.

In fact, in my case the ds1 is main data set and has lots of additional CUST_IDs which are not covered in ds2.

Here, i want to do left join and don't want to loose any thing from ds1, except replacing the CATEGORY only when matched to ds2.

art297
Opal | Level 21

Then your original approach (with a slight modification) is one possibility, the other is to control it directly using proc sql.

In your post your code reflected that you wanted to set non-matching entries as missing.  i.e.:

if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;


I think, if you use a datastep, that you would want to replace that with something like:


if in1 and in2 then CATEGORY= CATEGORY_1;


Astounding
PROC Star

The process is a little more complex than that.  Each data value gets read only once.  For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1.  For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value.  Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.

All of the tools you have seen to program around this are valid tools.  You just have to pick the one that matches what you need.

Good luck.

Quentin
Super User

Agree with Astounding's response.

To understand what happens when you have a variable collission in a merge, you need to think in terms of the PDV, and when records are read into the PDV, when the PDV is initialized, etc.

I think the bottom line is that variable collissions in a merge are BAD, and there is no rule that variables from the second dataset overwrite the first.

I turn msglevel=i on to notify me of collissions, which I treat as an error (to be avoided with drop/keep/rename etc).  I find the note to the log to be one of the most misleading messages that SAS produces, as it no doubt reinforces the belief that variables from the second dataset will overwrite the first, when in fact you can end up with a mix of the two.

Using Art's data with msglevel=i turned on:

52 options msglevel=i;

53 data want1;

54 merge ds1(in=in1) ds2(in=in2);

55 by id;

56 run;

INFO: The variable category on data set WORK.DS1 will be overwritten by data set WORK.DS2.

NOTE: There were 8 observations read from the data set WORK.DS1.

NOTE: There were 2 observations read from the data set WORK.DS2.

NOTE: The data set WORK.WANT1 has 8 observations and 2 variables.

LHV
Calcite | Level 5 LHV
Calcite | Level 5

Many thanks every one for your input.

When I learned SAS, the SAS Certification Training Core Concepts V8  document  says, " DATA step match-merging overwrites values of the like-named variable in the first data set in which it appears with values of the like-name variable in subsequent data sets."

This wording is the main source for my confusion.


hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2269 views
  • 7 likes
  • 4 in conversation