- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am having some trouble understanding the behavior of a one-to-many merge situation I am working through on SAS 9.4.
Imagine that our business previously had separate sales targets for each employee within each of our divisions. In the ACCT department, Sally had to make 1 sale per week of item1 and Steve had to make 2 sales per week of item1.
Now we want to assign new standard targets for all items we sell, for all employees in all departments. In the ACCT department, Sally and Steve will now have to make 3 sales per week of item1, their new item1_target.
In the following example1, I notice that when I merge my new_targets onto my old_targets within a department, only the first record containing an old_target gets overwritten with the new_target, instead of all old_targets. By dropping the old_targets in example2, I all employee records correctly receive the new new_target:
data old_targets;
input department $ name $ item1_target;
datalines;
ACCT Sally 1
ACCT Steve 2
MGMT Tommy 1
MGMT Terri 1
;
run;
data new_targets;
input department $ item1_target;
datalines;
ACCT 3
MGMT 3
;
run;
proc sort data=old_targets; by department; run;
proc sort data=new_targets; by department; run;
data example1;
merge old_targets (in=a)
new_targets (in=b);
by department;
if a;
run;
proc print data=example1; title "example1"; run;
proc sort data=old_targets(drop=item1_target); by department; run;
proc sort data=new_targets; by department; run;
data example2;
merge old_targets (in=a)
new_targets (in=b);
by department;
if a;
run;
proc print data=example2; title "example2"; run;
Can someone help me understand why the merge operation is stopping after the first record for each department, in example1, and if there is any way to avoid this without having to drop/keep/rename any variables? In the above examples, I have only included a few variables. In reality, we have dozens of variables containing information about each employee and hundreds of sales targets for various items and it is quite time-consuming (and error-prone) to have to either drop every variable which we want to update/overwrite or to keep only variables that we don't want updated/overwitten.
Is there any way to achieve this kind of one-to-many merge such that it updates all records for the given BY variable, rather than just updating the first record in each BY group, without dropping/keeping/renaming variables?
Thank you for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Because it only reads each record in both datasets once each. So when both datasets are contributing for the frist observation in a group then one listed last in the MERGE statement "wins". But for the second, third, etc observation in the group then only the new obsveraions from the datasets that are still contributing are read. So the values will overwrite the values read before.
DROP or RENAME one of the variables and you can see what is happening. If you want to keep the old value if there is no match then use COALESCE() function.
data example1;
merge
old_targets (in=a rename=(item1_target=old_item1_target))
new_targets (in=b rename=(item1_target=new_item1_target))
;
by department;
if a;
item1_target = coalesce(new_item1_target,old_item1_target);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So there is no way to force SAS to apply the ONE merge for as many observations as exist in the MANY dataset, without dropping/keeping/renaming variables?
It's a lot of work to rename all of those hundreds of items that we have, just to make SAS perform a simple behavior of matching this one record per department to all of the obversations in the MANY dataset. Is there really no other way to do this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An approach I'd suggest (I'm sure somebody can code it for you):
Read your new targets into a hash table. Check for matches, bring in all variables from the hash table when you have a match.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An easy way is have an INDEX on the transaction file.
data old_targets;
input department $ name $ item1_target;
datalines;
ACCT Sally 1
ACCT Steve 2
MGMT Tommy 1
MGMT Terri 1
UNKN New 2
;
data new_targets ( index=(department) );
input department $ item1_target;
datalines;
ACCT 3
MGMT 3
UNUSED 4
;
data example3 ;
set old_targets ;
set new_targets key=department /unique;
run;
proc print; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This can be done much more easily with PROC SQL. Here's an example:
PROC SQL;
SELECT new_targets.department, name, new_targets.item1_target
FROM old_targets, new_targets
WHERE old_targets.department = new_targets.department
GROUP BY new_targets.department;
QUIT;
Or using aliasing to shorten the code considerably:
PROC SQL;
SELECT n.department, name, n.item1_target
FROM old_targets as o, new_targets AS n
WHERE o.department = n.department
GROUP BY n.department;
QUIT;
You don't need to sort anything beforehand and the syntax is fairly simple. Joins like this are an area where SQL really shines above the equivalent DATA step code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The trouble with that SQL solutions is that it will yield different results from the merge if it happens that there are not "new_targets" for some departments.
Also SQL requires that you type the individual variable names and the request was to avoid having to do that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for pointing out this issue with different results, Tom, and indeed I was hoping to avoid typing out each name. I'll have to do some experimentation with each method.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@GinaRepole Mam, I can agree to a certain extent: "Joins like this are an area where SQL really shines above the equivalent DATA step code" but doesn't SQL still needs a cartesian and then apply whatever join we specify. I think that's an overhead
Just my 2 cents,
Naveen Srinivasan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Awesome question that made me feel I am so dumb at 11:00 PM in Chennai. I will have to read through the documentation thoroughly as I recall something like a many to one merge to happen in your case which did not. I hope Tom, @Astounding may spare a little more time in explaining "merge" if they don't mind.
Anyway here is your hash solution if you like. I am so used to using hashes everyday, i totally forgot basics of merge:(
data want;
if (_n_ = 1) then do;
if 0 then do;
set old_targets;
set new_targets;
end;
declare hash myhash(dataset: "new_targets");
myhash.definekey('department');
myhash.definedata('item1_target');
myhash.definedone();
end;
set old_targets;
if myhash.find() ne 0 then call missing(item1_target);
run;
Regards,
Naveen Srinivasan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Naveen, thanks for coding this. Just a couple of notes ...
If there are dozens of new targets instead of just 1, how would you code the definedata line?
When finding a match, the objective is to replace the old values with the new values. When there is no match, the old values should remain in place. I'm not sure that call missing() does that, but there should be an easy fix.
Regarding MERGE, Tom mentioned the crucial feature in one of his posts: each observation gets read once. So the process for each value of the BY variable is (a) see if there is another observation to read from the first data set (if so read it in), then (b) see if there is another observation to read from the second data set (if so, read it in). That's what produces the original, strange-if-you-haven't-seen-it-before, result. My recollection is that the documentation skirts around the issue (in some places), saying something like, "The value you end up with is the last value read."
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@AstoundingThank you Sir for the notes:
2. -->When finding a match, the objective is to replace the old values with the new values. When there is no match, the old values should remain in place. I'm not sure that call missing() does that, but there should be an easy fix.
Answer:
data want;
if (_n_ = 1) then do;
if 0 then do;
set old_targets;
set new_targets;
end;
declare hash myhash(dataset: "new_targets");
myhash.definekey('department');
myhash.definedata('item1_target');
myhash.definedone();
end;
set old_targets;
_item1_target=item1_target; /*simple reset*/
if myhash.find() ne 0 then item1_target=_item1_target; /*simple reset*/
drop _:;
run;
1.If there are dozens of new targets instead of just 1, how would you code the definedata line?
* if i understand you correctly
answer:
declare hash myhash(dataset: "new_targets");/*this argument tag accepts dataset options just like keep= or drop= inside literal*/
myhash.definekey('department');
myhash.definedata(all:'yes');
myhash.definedone();
Regards,
Naveen Srinivasan