BookmarkSubscribeRSS Feed
morglum
Quartz | Level 8

Hi everyone,

 

I saw that example at the SAS User Group conference in Quebec City last week.  They were adamant that this was normal behaviour and could be explained by how the program data vector works.

 

Look at the last row of TEST3.  "Joe Demoted" was a professional in 2016, yet his name has been removed.

 

Workarounds include doing 2 different data steps (one for merging and one for masking names) or just using PROC SQL.

 

This is super dangerous behaviour and  I'll just stick to PROC SQL for merges as I can't trust myself to remember this trap, but how could this behaviour be desirable?  

 

Thanks

 

 

data jobs;
     emp_id = 1;     year=2015; cat_emp="Professionnal"; output;
     emp_id = 1;     year=2016; cat_emp="Manager"; output;
     emp_id = 2;     year=2015; cat_emp="Professionnal"; output;
     emp_id = 2;     year=2016; cat_emp="Professionnal"; output;
     emp_id = 3;     year=2015; cat_emp="Manager"; output;
     emp_id = 3;     year=2016; cat_emp="Manager"; output;
     emp_id = 4;     year=2015; cat_emp="Manager"; output;
     emp_id = 4;     year=2016; cat_emp="Professionnal"; output;
run;
 
data employees;
     length name  $20.;
     emp_id = 1; name="Joe Promoted"; output;
     emp_id = 2; name="Joe Professionnal"; output;
     emp_id = 3; name="Joe Manager"; output;
     emp_id = 4; name="Joe Demoted"; output;
run;
 
data test3;
     merge jobs(in=a) employees;
     by emp_id;
     if cat_emp="Manager" then name="";
     nn = _n_;
run;

 

test3.png

jobs.png

emp.png

 

7 REPLIES 7
Astounding
PROC Star

Well, it is normal behavior.  Some of the principles involved:

 

  • Each incoming observation gets read exactly once.
  • "Gets read" means copied from the incoming data set to the PDV (sort of, there are intermediate buffers along the way)
  • Calculations in the DATA step (such as IF/THEN are carried on values in the PDV)
  • Outputting an observation means copying the values from the PDV to the output data set (again, there are intermediate buffers along the way)

The first bullet point is the key one here.  In a many-to-one merge, the "one" observation gets read only once.  The variables are retained as more observations are read from the "many" data set.  But if you change a value that came from the "one" data set, it stays changed.  The changed value then exists in the PDV and is retained.  The original value is never re-read.

 

There are simpler solutions than what was presented to you.  For example:

 

data test3;

merge jobs (in=a) employees;

by emp_id;

if cat_emp ne "Manager" then newname = name;

nn = _n_;

drop name;

rename newname=name;

run;

 

It's too late at night for me to think of an example where this behavior is useful, but I might come up with one.

ChrisHemedinger
Community Manager

This is a great use case for the DATA step debugger in SAS Enterprise Guide.  Stepping through that last DATA step, you can see that the "name" variable value is retained for the last two records, and the logic clears the value when the first one is processed.

 

dsd.png

 

One workaround (I think) that keeps it to one step is don't replace that name value, but instead designate a different "output name" value.

 

data test3;
     merge jobs(in=a) employees;
	 length out_name $ 20;
     by emp_id;
     if cat_emp="Manager" then out_name=""; else out_name=name;
     nn = _n_;
     drop name;
run;

 

And @morglum, as to your question "why is it this way?" -- think of it this way.  You have SAS variables that serve two purposes: one for driving program logic (conditionals and so on) and one for building the records you want in output.  In the original program, you're mixing the two roles and the inline modification of the "name" variable affects its utility as an output.  So the solution is to separate those roles, and keep one variable for logic, and one purely for output. 

 

Chris

It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
morglum
Quartz | Level 8

Thanks Chris for this answer.   I'll use that explanation along the two others and will use different variables for input and output.

 

I also wasnt aware of the data step debugger, which is interesting.

 

Best

 

morglum
Quartz | Level 8

Thanks a lot Astouding for the detailed explanation of the mechanics.  It is now a lot clearer for me on "how" this is happening.  What I still need help on is the "why" you would want this behaviour to occur.  I've been using SAS for a decade and it's the first time I've heard of it. I 'm pretty sure I have tons of invalid merges in my programs...

 

Your solution was also presented, maybe I should also have mentionned it.  I don't like it that much, as coming up with temporary names can be an hassle in a case where there are more than 1 variable that you want to work one.

 

Kurt_Bremser
Super User

Simple rule of thumb: never change a variable that is read from the "1" dataset in a 1:n merge; instead create a new one conditionally and have that replace the original one.

BTW, that is what happens when you use a "case/when/end as" in a SQL select. SQL select never modifies an input variable, it creates new ones (although those may be assigned the same name)

morglum
Quartz | Level 8

Hi Kurt,

 

Rule of thumb seems simple enough, thanks.

 

I think I'll just stick to PROC SQL.  There might be other unexpected quirks in the datastep that I never heard about.

 

 

Kurt_Bremser
Super User

That is why it is so important to read the documentation. The Data Step Concepts are the most important piece to really internalize for any aspiring SAS programmer. Just like the fact that C needs a break statement in a switch.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1195 views
  • 1 like
  • 4 in conversation