I have a SAS program (multiple macros deep) that has:
1. Several input datasets with several thousands of fields, all being read in.
2. Few output flat files with hundreds of fields
The issue is that the intermediate datasets are so large it is causing space issues and the performance is really bad. I want to identify only those fields on the input side that are essential, and use a keep statement.
Besides manual tracing, is there any tool or method that I should use? We have sas (r) 9.1.3 on aix 5.3
I appreciate any guidance
Biplob
The below code creates TABLE_A with the records that produces rows that are part of the first query only.
PROC SQL;
CREATE TABLE TABLE_A AS
SELECT * FROM TABLE1
EXCEPT
SELECT * FROM TABLE 2;
QUIT;
What happens between initial input and final output. Obviously, you have to keep the fields that are on the final output's keep statement, but what other processes/datasteps/sql code/procs are used during the process?
Such tasks can either be trivial or extremely difficult. Depends upon your code.
The below code creates TABLE_A with the records that produces rows that are part of the first query only.
PROC SQL;
CREATE TABLE TABLE_A AS
SELECT * FROM TABLE1
EXCEPT
SELECT * FROM TABLE 2;
QUIT;
Biplob,
The first thing to consider is whether you want to improve disk space usage, performance, or both. I suspect it's "both", since macros tend to be used repeatedly.
The disk space issue can probably be solved by creating views instead of data sets. Both data steps and SQL can create views, and it is possible to use a view as input and another view as output of the same step. Without seeing the code (and I'm not really asking for that), it's hard to be more specific.
Performance will improve by adding KEEP=, but there is no shortcut. You have to manually work through the code to figure out what variables are needed when. You could start with the variables in the final output, but other variables might be needed as the program begins. For example, variables might be used to subset observations, or to make calculations, but might not be needed after that point. The only tool that can figure this out is the human brain.
One style that I like is to add to the outermost macro a set of %LET statements:
%let keeplist1 = a long list of variables;
%let keeplist2 = a different list;
Then refer to &KEEPLIST1 and &KEEPLIST2 in later code. This makes the programming easier to read, update, and debug.
Note that there is a difference between KEEP= on the SET statement and KEEP= on the DATA statement. The first limits what you read in, and the second limits what you save. You might find steps that use KEEP= on both, with different sets of variables.
To test your results, I suggest you run on either small data sets or on just one input data set using your current set of macros. Save the result (preferably as a SAS data set). Then after modifying the code, use PROC COMPARE to see if the new output differs from the old.
Good luck.
Thanks to everyone who responded. I now know what to do.
This is the first time I used a forum for help and am very impressed by the depth and promptness of the answers.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
