Help using Base SAS procedures

Comparing Input/Output Datasets

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Comparing Input/Output Datasets

I have a SAS program (multiple macros deep) that has:

1. Several input datasets with several thousands of fields, all being read in.

2. Few output flat files with hundreds of fields

The issue is that the intermediate datasets are so large it is causing space issues and the performance is really bad. I want to identify only those fields on the input side that are essential, and use a keep statement.

Besides manual tracing, is there any tool or method that I should use? We have sas (r) 9.1.3 on aix 5.3

I appreciate any guidance

Biplob


Accepted Solutions
Solution
‎02-02-2012 10:30 AM
Regular Contributor
Posts: 233

Comparing Input/Output Datasets

The below code creates TABLE_A with the records that produces rows that are part of the first query only.

PROC SQL;

CREATE TABLE TABLE_A AS

SELECT * FROM TABLE1

EXCEPT

SELECT * FROM TABLE 2;

QUIT;

View solution in original post


All Replies
PROC Star
Posts: 7,468

Comparing Input/Output Datasets

What happens between initial input and final output.  Obviously, you have to keep the fields that are on the final output's keep statement, but what other processes/datasteps/sql code/procs are used during the process?

Such tasks can either be trivial or extremely difficult.  Depends upon your code.

Solution
‎02-02-2012 10:30 AM
Regular Contributor
Posts: 233

Comparing Input/Output Datasets

The below code creates TABLE_A with the records that produces rows that are part of the first query only.

PROC SQL;

CREATE TABLE TABLE_A AS

SELECT * FROM TABLE1

EXCEPT

SELECT * FROM TABLE 2;

QUIT;

Super User
Posts: 5,498

Comparing Input/Output Datasets

Biplob,

The first thing to consider is whether you want to improve disk space usage, performance, or both.  I suspect it's "both", since macros tend to be used repeatedly.

The disk space issue can probably be solved by creating views instead of data sets.  Both data steps and SQL can create views, and it is possible to use a view as input and another view as output of the same step.  Without seeing the code (and I'm not really asking for that), it's hard to be more specific.

Performance will improve by adding KEEP=, but there is no shortcut.  You have to manually work through the code to figure out what variables are needed when.  You could start with the variables in the final output, but other variables might be needed as the program begins.  For example, variables might be used to subset observations, or to make calculations, but might not be needed after that point.  The only tool that can figure this out is the human brain.

One style that I like is to add to the outermost macro a set of %LET statements:

%let keeplist1 = a long list of variables;

%let keeplist2 = a different list;

Then refer to &KEEPLIST1 and &KEEPLIST2 in later code.  This makes the programming easier to read, update, and debug.

Note that there is a difference between KEEP= on the SET statement and KEEP= on the DATA statement.  The first limits what you read in, and the second limits what you save.  You might find steps that use KEEP= on both, with different sets of variables.

To test your results, I suggest you run on either small data sets or on just one input data set using your current set of macros.  Save the result (preferably as a SAS data set).  Then after modifying the code, use PROC COMPARE to see if the new output differs from the old.

Good luck.

New Contributor
Posts: 2

Comparing Input/Output Datasets

Thanks to everyone who responded. I now know what to do.

This is the first time I used a forum for help and am very impressed by the depth and promptness of the answers.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 161 views
  • 6 likes
  • 4 in conversation