Quick Question,
I am working with code someone else wrote. The programs have multiple data steps, and the previous coder used the following coding scheme. My question is, why did this person use " DATA © SET ©
", essentially setting the new dataset the same as the old each time?
Standard convention, as I was taught, is to use a new name for each data step. So I don't know if this is just a more advanced way to code or if there are potential benefits, drawbacks, etc to this methodology. I can think of some problems that could occur....
The file is for cleaning data, so it is broken up into dozens of data steps with procs for testing in between - if that helps.
* Create copy of dataset ;
DATA data_copy; SET IN.data; RUN;
* Set new SAS dataset name as macro variable ;
%LET copy = work.data_copy;
DATA © SET ©
<code here>;
RUN;
DATA © SET ©
<code here>;
RUN;
My question is, why did this person use "
DATA © SET ©
", essentially setting the new dataset the same as the old each time?
I think this is poor practice, something beginners do because they are unfamiliar with the ramifications of doing so. I certainly would not advise this.
But the whole thing, as written, is poorly done. Here is the code as written
DATA data_copy; SET IN.data; RUN;
* Set new SAS dataset name as macro variable ;
%LET copy = work.data_copy;
DATA © SET ©
<code here>;
RUN;
DATA © SET ©
<code here>;
RUN;
and here is a better way, if you ask me
* Set new SAS dataset name as macro variable ;
%LET copy = work.data_copy;
DATA ©
SET in.data;
<code here>;
<code here>;
RUN;
Where <code here> appears twice, I have taken the original code with <code here> in two different places and put all that code into one data step. Maybe you don't even need the macro variable, but maybe you do.
So the original code passes through the data 3 times, my re-written code passes though the data 1 time.
They did not want multiple copies of the dataset.
They wanted to treat the series of steps as a single logical block. DATA_COPY is created from IN.DATA by a series of steps that we can just consider an single "step" in our data flow.
Whether or not that is good or bad depends on a ton of other factors you have not disclosed. How large is the data? How well tested is this code? Has been fully debugged and tested? Is there any value in examining any of the intermediate steps later? etc. etc.
This isn't a finished, cleaned, stock program for routine processing. I suspect this was done because the programmer was testing the code/variables in pieces. The raw dataset is 20K records by 300 or so variables. As each variable or set of variables was cleaned and tested, they created a new data step to clean and test the next set. There are about a dozen data steps like this, followed by proc freqs/means/prints, working through a handful of variables at a time. Overall, the program is about 2K lines of code. There is value in re-visiting the blocks of code as I'm still cleaning and analyzing variables. There are still dozens of variables to go. Of course, when I find a mistake in one of the data steps and need to compare between the previous and current dataset, I have to reset the data from the beginning and start over (rare).
I'm mostly just trying to determine if I need to take the time to restructure the code. I understand why it was done it blocks. It's far easier to read, troubleshoot, and debug small sets of code/variables instead of running one giant data step then having to go back and forth between the data step and the output code. But I've had other programmers tell me that block coding like this is poor practice and that I should restructure it or break it out into separate programs.
I don't know. I'm just trying to make sure I don't continue something that is poor practice or could cause a problem; however, if there is a good reason for it, I don't want to spend time fussing with it.
Do you have any idea why they felt the need to use a macro variable to store the dataset name? The reason for that might explain why they used the same name over and over. If they weren't sure when they were going to be done adding steps they might have just found it easier to keep generating the dataset with the target name so that they didn't have to then add a final step to copy it once more (or rename it) just to make sure it used the correct final name.
If you are going to repeat that type of data cleaning (and the data is small like you say) then using different names for each step might help you in debugging. If you are running interactively, or even just the pseudo interactivity of SAS/Studio or Enterprise Guide, then you can adjust the code for individual cleaning steps and test it without having to re-run the preceding steps.
There is no "one size fits all" dataset naming convention.
If you want to see all intermediate results in a program, then create a new dataset in each DATA step.
If you want to conserve disk space in your SAS libraries and you don't need to see intermediate results, then keeping the current name is fine.
The code fragment
DATA © SET ©
is of common occurrence in SAS programming practice.
After all programming is both an art and science. Programmers develop their own style based on their experiences and needs
There is one thing that I would like to say -for a programmer timely delivery of a working code is far more important than writing the "perfect code" which takes a few more days to complete. "The few more days" cost more to the company than a few more milliseconds of execution time.
Therefore sometimes often we see code with a scope for improvement.
Hi,
I would say this is bad/ugly/scary code, for the reasons you mention. Of course it's not unusual to inherit such code.
When you inherit such code, you need to balance the risks of leaving it as is, the benefits of updating it, and the costs (time) it will take to update it. Sometimes you're lucky and time will allow you to address such 'technical debt' immediately. Other times if the code 'works' you have to leave it as is, and hope to find time down the road to refactor the code.
If this is a simple as you showed, I would probably refactor it right away. Because the way I work, the idea of reading in and overwriting the same data set multiple times in series would make it VERY hard for me to debug a program. If you ran this program interactively, there wouldn't be a good way to know which version of work.data_copy you have at any time. So just for my own sanity, the cost of adding a numeric suffix to each dataset seems low.
But also, if you're working in a group with other programmers, you might want to show the code around, to see if other people have an explanation, and see if they agree that time spent refactoring this code would be well-spent.
--Q.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.