05-10-2017 10:09 PM - edited 05-10-2017 10:10 PM
I saw that example at the SAS User Group conference in Quebec City last week. They were adamant that this was normal behaviour and could be explained by how the program data vector works.
Look at the last row of TEST3. "Joe Demoted" was a professional in 2016, yet his name has been removed.
Workarounds include doing 2 different data steps (one for merging and one for masking names) or just using PROC SQL.
This is super dangerous behaviour and I'll just stick to PROC SQL for merges as I can't trust myself to remember this trap, but how could this behaviour be desirable?
data jobs; emp_id = 1; year=2015; cat_emp="Professionnal"; output; emp_id = 1; year=2016; cat_emp="Manager"; output; emp_id = 2; year=2015; cat_emp="Professionnal"; output; emp_id = 2; year=2016; cat_emp="Professionnal"; output; emp_id = 3; year=2015; cat_emp="Manager"; output; emp_id = 3; year=2016; cat_emp="Manager"; output; emp_id = 4; year=2015; cat_emp="Manager"; output; emp_id = 4; year=2016; cat_emp="Professionnal"; output; run; data employees; length name $20.; emp_id = 1; name="Joe Promoted"; output; emp_id = 2; name="Joe Professionnal"; output; emp_id = 3; name="Joe Manager"; output; emp_id = 4; name="Joe Demoted"; output; run; data test3; merge jobs(in=a) employees; by emp_id; if cat_emp="Manager" then name=""; nn = _n_; run;
05-10-2017 10:44 PM
Well, it is normal behavior. Some of the principles involved:
The first bullet point is the key one here. In a many-to-one merge, the "one" observation gets read only once. The variables are retained as more observations are read from the "many" data set. But if you change a value that came from the "one" data set, it stays changed. The changed value then exists in the PDV and is retained. The original value is never re-read.
There are simpler solutions than what was presented to you. For example:
merge jobs (in=a) employees;
if cat_emp ne "Manager" then newname = name;
nn = _n_;
It's too late at night for me to think of an example where this behavior is useful, but I might come up with one.
05-11-2017 10:59 AM - edited 05-11-2017 11:30 AM
This is a great use case for the DATA step debugger in SAS Enterprise Guide. Stepping through that last DATA step, you can see that the "name" variable value is retained for the last two records, and the logic clears the value when the first one is processed.
One workaround (I think) that keeps it to one step is don't replace that name value, but instead designate a different "output name" value.
data test3; merge jobs(in=a) employees; length out_name $ 20; by emp_id; if cat_emp="Manager" then out_name=""; else out_name=name; nn = _n_; drop name; run;
And @morglum, as to your question "why is it this way?" -- think of it this way. You have SAS variables that serve two purposes: one for driving program logic (conditionals and so on) and one for building the records you want in output. In the original program, you're mixing the two roles and the inline modification of the "name" variable affects its utility as an output. So the solution is to separate those roles, and keep one variable for logic, and one purely for output.
05-11-2017 12:45 PM
Thanks Chris for this answer. I'll use that explanation along the two others and will use different variables for input and output.
I also wasnt aware of the data step debugger, which is interesting.
05-11-2017 11:23 AM
Thanks a lot Astouding for the detailed explanation of the mechanics. It is now a lot clearer for me on "how" this is happening. What I still need help on is the "why" you would want this behaviour to occur. I've been using SAS for a decade and it's the first time I've heard of it. I 'm pretty sure I have tons of invalid merges in my programs...
Your solution was also presented, maybe I should also have mentionned it. I don't like it that much, as coming up with temporary names can be an hassle in a case where there are more than 1 variable that you want to work one.
05-11-2017 04:15 AM
Simple rule of thumb: never change a variable that is read from the "1" dataset in a 1:n merge; instead create a new one conditionally and have that replace the original one.
BTW, that is what happens when you use a "case/when/end as" in a SQL select. SQL select never modifies an input variable, it creates new ones (although those may be assigned the same name)
05-11-2017 11:24 AM
Rule of thumb seems simple enough, thanks.
I think I'll just stick to PROC SQL. There might be other unexpected quirks in the datastep that I never heard about.
05-11-2017 11:30 AM
That is why it is so important to read the documentation. The Data Step Concepts are the most important piece to really internalize for any aspiring SAS programmer. Just like the fact that C needs a break statement in a switch.