Hi,
I come across a strange SAS merge issue.
I have two SAS datasets : ds1 and ds2.
ds1 has number of variables along with CUST_ID & CATEGORY. where CUST_ID & CATEGORY are numeric & character variables respectively.
ds2 has only these two variables with same length and type. ds1 has more CUST_IDs then ds2.
Both data sets are sorted by CUST_ID.
In ds1, CUST_ID may be appeared in more than one observation with different values for CATEGORY.
e.g. CUST_ID CATEGORY
2043 NEW
2043 OLD
2043 NEW
In ds2, CUST_ID is unique, i.e. a CUST_ID appears only once with a single value for CATEGORY.
e.g. CUST_ID CATEGORY
2043 OLD
when ds1 & ds2 are merged by CUST_ID using the below data step, i get a strange result.
data ds3;
merge ds1(in=in1) ds2(in=in2);
by CUST_ID;
if in1;
run;
in the resulting data set (ds3) there are many CUST_IDs which are matched, but populated with more than one CATEGORY value against them.
I am only able to get the expected result, if i modify the step like below.
data ds3;
merge ds1(in=in1) ds2(in=in2 rename=(CATEGORY =CATEGORY_1))
by CUST_ID;
if in1;
if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;
run;
Can any one explain, what are the possible issues which prevent obtaining unique CATEGORY value against a given CUST_ID, without adding
an additional statement like (if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;)
Many thanks inadvance for your help.
The process is a little more complex than that. Each data value gets read only once. For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1. For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value. Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.
All of the tools you have seen to program around this are valid tools. You just have to pick the one that matches what you need.
Good luck.
The code, without the additional assignment, doesn't provide SAS with any indication of which category value to keep. An easier method, I think (given your fix), would simply be to include a drop option when you merge the files. i.e.:
merge ds1(in=in1 drop=CATEGORY) ds2(in=in2);
Well, my understanding is, in a PDV there will be only one variable with name CATEGORY, and the values(of variable CATEGORY) from ds2 replaces the values of CATEGORY from ds1, whenever there is a match. by CUST_ID.
If that is the default mechanism for SAS match merge, why do we need indicate which one to use?
Timing! Yes, only one value can be in the pdv at a time, but which one?
You can see by running the following two merges:
data ds1;
input id category;
cards;
1 1
1 1
1 1
1 1
2 1
2 1
2 1
2 1
;
data ds2;
input id category;
cards;
1 2
2 2
;
data want1;
merge ds1(in=in1) ds2(in=in2);
by id;
run;
data want2;
merge ds1(in=in1 drop=category) ds2(in=in2);
by id;
run;
Many thanks Arthur,
I understood the issue very well, even with my original data.
for the timing, i am under impression that the sequence in the merge statement decides the timing, i.e. which data values are written to PDV first.
In fact, in my case the ds1 is main data set and has lots of additional CUST_IDs which are not covered in ds2.
Here, i want to do left join and don't want to loose any thing from ds1, except replacing the CATEGORY only when matched to ds2.
Then your original approach (with a slight modification) is one possibility, the other is to control it directly using proc sql.
In your post your code reflected that you wanted to set non-matching entries as missing. i.e.:
if CATEGORY_1 ne ' ' then CATEGORY= CATEGORY_1;
I think, if you use a datastep, that you would want to replace that with something like:
if in1 and in2 then CATEGORY= CATEGORY_1;
The process is a little more complex than that. Each data value gets read only once. For the first match on CUST_ID, the CATEGORY from ds_2 overwrites the CATEGORY from ds_1. For any additional records for the same CUST_ID, the value from ds_1 is read in and replaces the previous value. Since there is no additional matching observation to read from ds_2, you are left with the value from ds_1.
All of the tools you have seen to program around this are valid tools. You just have to pick the one that matches what you need.
Good luck.
Agree with Astounding's response.
To understand what happens when you have a variable collission in a merge, you need to think in terms of the PDV, and when records are read into the PDV, when the PDV is initialized, etc.
I think the bottom line is that variable collissions in a merge are BAD, and there is no rule that variables from the second dataset overwrite the first.
I turn msglevel=i on to notify me of collissions, which I treat as an error (to be avoided with drop/keep/rename etc). I find the note to the log to be one of the most misleading messages that SAS produces, as it no doubt reinforces the belief that variables from the second dataset will overwrite the first, when in fact you can end up with a mix of the two.
Using Art's data with msglevel=i turned on:
52 options msglevel=i;
53 data want1;
54 merge ds1(in=in1) ds2(in=in2);
55 by id;
56 run;
INFO: The variable category on data set WORK.DS1 will be overwritten by data set WORK.DS2.
NOTE: There were 8 observations read from the data set WORK.DS1.
NOTE: There were 2 observations read from the data set WORK.DS2.
NOTE: The data set WORK.WANT1 has 8 observations and 2 variables.
Many thanks every one for your input.
When I learned SAS, the SAS Certification Training Core Concepts V8 document says, " DATA step match-merging overwrites values of the like-named variable in the first data set in which it appears with values of the like-name variable in subsequent data sets."
This wording is the main source for my confusion.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.