DATA Step, Macro, Functions and more

Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirable?

Reply
Frequent Contributor
Posts: 96

Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirable?

[ Edited ]

Hi everyone,

 

I saw that example at the SAS User Group conference in Quebec City last week.  They were adamant that this was normal behaviour and could be explained by how the program data vector works.

 

Look at the last row of TEST3.  "Joe Demoted" was a professional in 2016, yet his name has been removed.

 

Workarounds include doing 2 different data steps (one for merging and one for masking names) or just using PROC SQL.

 

This is super dangerous behaviour and  I'll just stick to PROC SQL for merges as I can't trust myself to remember this trap, but how could this behaviour be desirable?  

 

Thanks

 

 

data jobs;
     emp_id = 1;     year=2015; cat_emp="Professionnal"; output;
     emp_id = 1;     year=2016; cat_emp="Manager"; output;
     emp_id = 2;     year=2015; cat_emp="Professionnal"; output;
     emp_id = 2;     year=2016; cat_emp="Professionnal"; output;
     emp_id = 3;     year=2015; cat_emp="Manager"; output;
     emp_id = 3;     year=2016; cat_emp="Manager"; output;
     emp_id = 4;     year=2015; cat_emp="Manager"; output;
     emp_id = 4;     year=2016; cat_emp="Professionnal"; output;
run;
 
data employees;
     length name  $20.;
     emp_id = 1; name="Joe Promoted"; output;
     emp_id = 2; name="Joe Professionnal"; output;
     emp_id = 3; name="Joe Manager"; output;
     emp_id = 4; name="Joe Demoted"; output;
run;
 
data test3;
     merge jobs(in=a) employees;
     by emp_id;
     if cat_emp="Manager" then name="";
     nn = _n_;
run;

 

test3.png

jobs.png

emp.png

 

Super User
Posts: 5,085

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirab

Well, it is normal behavior.  Some of the principles involved:

 

  • Each incoming observation gets read exactly once.
  • "Gets read" means copied from the incoming data set to the PDV (sort of, there are intermediate buffers along the way)
  • Calculations in the DATA step (such as IF/THEN are carried on values in the PDV)
  • Outputting an observation means copying the values from the PDV to the output data set (again, there are intermediate buffers along the way)

The first bullet point is the key one here.  In a many-to-one merge, the "one" observation gets read only once.  The variables are retained as more observations are read from the "many" data set.  But if you change a value that came from the "one" data set, it stays changed.  The changed value then exists in the PDV and is retained.  The original value is never re-read.

 

There are simpler solutions than what was presented to you.  For example:

 

data test3;

merge jobs (in=a) employees;

by emp_id;

if cat_emp ne "Manager" then newname = name;

nn = _n_;

drop name;

rename newname=name;

run;

 

It's too late at night for me to think of an example where this behavior is useful, but I might come up with one.

Community Manager
Posts: 2,764

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desira

[ Edited ]

This is a great use case for the DATA step debugger in SAS Enterprise Guide.  Stepping through that last DATA step, you can see that the "name" variable value is retained for the last two records, and the logic clears the value when the first one is processed.

 

dsd.png

 

One workaround (I think) that keeps it to one step is don't replace that name value, but instead designate a different "output name" value.

 

data test3;
     merge jobs(in=a) employees;
	 length out_name $ 20;
     by emp_id;
     if cat_emp="Manager" then out_name=""; else out_name=name;
     nn = _n_;
     drop name;
run;

 

And @morglum, as to your question "why is it this way?" -- think of it this way.  You have SAS variables that serve two purposes: one for driving program logic (conditionals and so on) and one for building the records you want in output.  In the original program, you're mixing the two roles and the inline modification of the "name" variable affects its utility as an output.  So the solution is to separate those roles, and keep one variable for logic, and one purely for output. 

 

Chris

Frequent Contributor
Posts: 96

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desira

Thanks Chris for this answer.   I'll use that explanation along the two others and will use different variables for input and output.

 

I also wasnt aware of the data step debugger, which is interesting.

 

Best

 

Frequent Contributor
Posts: 96

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirab

Thanks a lot Astouding for the detailed explanation of the mechanics.  It is now a lot clearer for me on "how" this is happening.  What I still need help on is the "why" you would want this behaviour to occur.  I've been using SAS for a decade and it's the first time I've heard of it. I 'm pretty sure I have tons of invalid merges in my programs...

 

Your solution was also presented, maybe I should also have mentionned it.  I don't like it that much, as coming up with temporary names can be an hassle in a case where there are more than 1 variable that you want to work one.

 

Super User
Posts: 6,946

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirab

Simple rule of thumb: never change a variable that is read from the "1" dataset in a 1:n merge; instead create a new one conditionally and have that replace the original one.

BTW, that is what happens when you use a "case/when/end as" in a SQL select. SQL select never modifies an input variable, it creates new ones (although those may be assigned the same name)

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Frequent Contributor
Posts: 96

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirab

Hi Kurt,

 

Rule of thumb seems simple enough, thanks.

 

I think I'll just stick to PROC SQL.  There might be other unexpected quirks in the datastep that I never heard about.

 

 

Super User
Posts: 6,946

Re: Bizarre data step behaviour (merge followed by If) -- in what universe is this behaviour desirab

That is why it is so important to read the documentation. The Data Step Concepts are the most important piece to really internalize for any aspiring SAS programmer. Just like the fact that C needs a break statement in a switch.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 7 replies
  • 149 views
  • 1 like
  • 4 in conversation