04-11-2016 01:34 PM
I have a general question about how to structure my sas program, it goes like this:
First, I import the original data and do some data cleaning and data transformation and get data A, and I continue to do some analysis to get data B. Before I go futher, I realized I need data X which is generated by data A. So I write code to generate data X with which I can combine with data B to continue...
So the process of generateing X is like a branch. I am wondering it is better to put this branch in a separate SAS file and call it when needed or just leave this branch in the main code. I am afraid if I need to generate too many branches, these branches would interfere with the main code.
04-11-2016 02:10 PM
Details, details, details.
Some of this is style choice. I tend to separate the read, clean and initial recoding into separate program files. Partially to keep the amount of code readable but also if reading large files then re-reading is inefficient. I also prefer to have analysis steps segregated, possibly by type of analysis (simple summaries together, regressions together, other statistical tests). But the scope of the project may not require all of that.
I also try to minimize the number of source datasets. I can't tell whether you are creating separate data files A and X where X may be A plus some variables. If that is the case, I tend to go back to an earlier code generated data set such as A and add additional recoded variables or whatever is needed, then rerun previous analysis to ensure I haven't changed things inadvertently before proceeding to the steps that needed the additional variables. I find this tends to minimize the "oops, I used the wrong data set".
If you have lots of dependencies then 1) document 2) take care naming things, 3) test everything and 4)document (at this point you'll like need to add to the previous documentation)
04-11-2016 02:32 PM
A SAS program is always a series of DATA and PROC steps. But the steps need not be saved as one huge program. You could create a series of programs, such as:
You would just run them in the order indicated by the names.
Whatever works, and makes it clear for you to understand where to find the code.
If you need to re-use some of the code for many different incoming data sets, that becomes a different question with a different answer.
04-11-2016 04:20 PM - edited 04-11-2016 04:20 PM
An interesting question.
I find that as your SAS programs get larger it becomes quite natural and logical to split them based on common functionality. I also use a number at the start of each program name to indicate the running order as already suggested by @Astounding.
With the larger SAS applications I work with, I also use the concept of levels as used in data warehousing. For example all programs starting with the number 100 relate to reading and sourcing data, 200 - level programs primarily transform, and derive data, and 300 - level programs primarily prepare data for analysis and reporting.