I'm using a SAS macro to analyze my simulation data.
I generate 1000 random samples for my simulation. Right now the macro can analyze the samples one by one. It is time consuming. I'm thinking to put all the 1000 random samples together and indexed by a variable r. I hope I can modify the macro so that it can analyze all the samples together and run through the samples following the sorted variable r. This is exactly what by statement does in many SAS procedures.
Before you go too far down this road, some clarifications.
You said: "Right now the macro can analyze the samples one by one. It is time consuming. "
The SAS Macro facility is NOT doing any analyzing. It is generating code -- essentially doing your typing for you. The code that is generated by the macro facility has had all the macro references resolved -- what goes to the SAS compiler to be executed does not have ANY macro references & or % at all.
Possibly the reason your macro process is time consuming is that you did not understand this fundamental behavior of the Macro facility when you wrote the first macro program. Or, it could be that your analysis just takes a long time and the macro facility has nothing to do with the time needed -- or maybe a fractional amount of overhead is due to the macro facility being involved in the process of generating code for your analysis.
You say you want to modify the macro so it uses pseudo-by group processing. But, remember that it is NOT the macro that is doing the processing or the analysis. The Macro program will ONLY generate code for you. If the end result, the generated code, is going to be the same with the new process as when using your current process, then there's nothing to be gained out of pursuing this course of action. What is the underlying process -- a DATA step, a SAS/STAT procedure? How is THAT code going to be different with your new approach.
Your idea may be one possible solution; but first you have to start with a single working SAS program that does NOT use macro variables or macro programs at all. Benchmark that program and see how long that program takes on its own. Then slowly introduce macro variables or put the program inside a macro program and benchmark again....on a limited number of groups -- not the whole 1000 samples. Careful benchmarking should reveal whether it is the underlying random sample generation process or the addition of the macro processing that is adding the additional time.
I think my real question is how to handle repeated analysis for grouped data in a big data set. That is I want to do the same analysis for each group in the big data set. What I can think of is to subset the first group and analyze it, then subset the second group and analyze it, and so on. The problem with this solution is that subsetting a big data set takes time. It may take more time than the analysis itself.
In contrast for by statement in a SAS procedure such as proc glm doesn't take this route. It appears to me after sorting the data, the subsetting step is not necessary any more in a procedure with a by statement.
Hi: I think this is one of those situations where more information is needed. Many procedures that use BY group processing treat each BY group as a separate entity. Proc Tabulate does that. And you might think this is a good thing.
Well, sometimes it is a good thing. But if you do use BY group processing with Proc Tabulate, you sometimes get undesired side effects from each BY group being treated as a separate entity...for example... each BY group is 100% of itself. So there's no calculating the BY group as a percent of ALL the by groups. Not a big deal if you're NOT doing percents. But when you are doing percents and you need for each BY group not to be treated as a separate entity, then you have to use PAGE processing with PROC TABULATE to get a percent of the whole and a percent of each group.
So it comes down to how does YOUR procedure of choice handle BY groups??? Does your use of the BY statement affect the analysis in a desirable way, an undesirable way or with no effect? Do you get the exact same results with a BY statement in your procedure as you do with the subsetting approach? Nobody can answer this question for you. You have the data. You know what procedure you're using. You can make up a test.
Make a test dataset with 2 by groups. Run your subsetting code and your analysis on this test dataset. Print these results. Then make a second program without the subsetting approach, but with a BY statement added to the analytic procedure and run your analysis with BY group processing (on the same test dataset). Print these results. Carefully and methodically compare the results of the 2 different methods. You have the SAME data going into two different versions of the same procedure's analysis. Comparing the results of these 2 versions of the program should be instructive in helping you make your decision.
You can do all of this without involving SAS Macro facility -- and probably should avoid Macro questions/issues -- in order to make your decision. Using BY group processing or a BY statement really doesn't have anything to do with the SAS Macro facility. The reasons that some folks DO use the Macro facility or a macro program to subset and run each BY group separately are:
1) they want to be able to stop and start or repeat the analysis for a group without rerunning the whole group
2) there are some title/footnote and output file creation reasons for using Macro ..as for example, when you want the name of the BY group to be used for the name of the analytical report output file.
3) they want to alter some parameters used by the analytical procedure, as when they know they need some procedure option with one by group, but not with another.
4) you're not just running one analytic procedure as part of your analysis -- but you are doing some data manipulation, perhaps some data cleanup and/or some restructuring of the data or some merging -- and it's cleaner to do the process on every group separately -- before it goes into your procedure of choice -- so you are repeating these cleanup and data manipulation steps for every subset.
It's really hard to comment on your question without knowing how BIG is big and what your procedure of choice for the analysis is. If you have a LOT of code or the code leans heavily toward the ins and outs of a particular statistical procedure, then you might consider contacting Tech Support with the question about whether BY group processing would have an adverse effect on the analysis from your procedure of choice.
As you have read already, analysing a population of data as a series of smaller sub groups usually means you lose the population level analysis. Whether you subset the data, or use by group processing, it is likely you will lose the population analysis. You need to consider whether or not that is a problem.
If I had a table of student results and wanted to analyse them for each class, then I'd sort and index the table by the class identifier. Assuming that my stat procedure allowed by group processing, I'd now be in a position to use a by statement, or a macro to write the code for each analysis accepting a parameter which was the class identifier. Then I would resolve the parameter in a "where" option on the input data set.
I do think that subsetting the data first is often a bad idea. If you have an identifier, then you should use it and keep the data together. "Where" and "By" statements are powerful allies in getting segmented analyses.