OK. So far it seems the problem has been solved. Thanks to @Rick_SAS @Reeza, @FreelanceReinh and @ballardw for your help to me and contribution to this post.
Let me summarize how it gets solved. For simplicity I will just use J and I, rather than the macro variable names in my code.
My goal is to create a j=1000 * i=1000 samples of a size, e.g, 50, and go through the following analysis as you can see in my code. Following @Rick_SAS's book, I created the entire data set of 1000 * 1000 samples and analyzed with BY statement. It ran for a week without reaching an end (Showing "TTEST running"), and I terminated it.
At that time I am kind of guessing maybe the data set is too large to be processed as a whole. so, I cut it down into 10 data sets created by j=100 * i = 1000 (1/10 of the original), Still, 1/10 cannot be processed in days.
So I post this question about TTEST. At the same time, I looked for other procedure to replace TTEST (because I only need the mean difference, not the test result). finally I picked up SQL. I further cut the data set down into j=10 * i= 1000 (1/100 of the original). OK, that runs, but, like I described earlier, get increasingly slower, and finally stuffed up the disk.
Then I seek to clean the Work library after each procedure. which helped little.
Until @Reeza suggested that I look for the huge files in my disk, I did not realize the even 1/100 of the original data set can produce gigantic output data sets. When I looked it up, I found 15 data sets of over 7GB (product of the parallel macros, for later use) and 1 of 14 GB(data set a from proc logistic). I deleted them then the disk space come back.
So, the sizes of the data sets could be the reason why TTEST was blocked. TTEST here was for data=a 14 GB, so there is no way it can finish. SQL get through, but later procedures are blocked because of the 15 7GB data sets. Then I know even 1/100 was still large. Inspired by @FreelanceReinh's finding that doubled size data set may increase the processing time by a factor, I did the following trial runs: (i=1000 for all)
computer 1: j =5 time: > 2 hours
computer 2: j=2 time: 7mins
computer 3: j=1 time: 1.5 mins
So, finally I started over with all the analysis procedures called on each J (so only 1000 samples in each analyzed data set), and things run well so far. We can tentatively conclude that By statement does not serve very large data set. However, @Rick_SAS is correct in encouraging the use of BY, because, even though only 1 j each time, this is almost 500 times faster than calling analysis for each sample.
Glad you resolved the issues! Good example of listening to lots of advice and the performing some detective work.
Since you have a copy of my book Simulating Data with SAS, you might want to read Section 6.4.5, "When macros are useful" (p. 101). In that section, I discuss the issue of extremely large data sets. I say "For huge simulation studies ...it makes sense to run the simulation as a series of smaller sub-studies." Also see section 4.6.7, "Tips for Shortening Simulation Times."
In the blog post, "Simulation in SAS: The slow way or the BY way," there are several comments that address huge data sets. See the comments and my responses to "Geoffrey", "Doc", and "Kat." The size of the "block of data" on which to have SAS operate depends on the parameters of your computer system and the analysis that you are performing. Your conclusions mimic the "hybrid method" proposed by Novikov and Oberman (2007), which combines the macro and BY-group approaches when doing massive simulations.
Thank you so much for following up. Believe it or not, I accurately read this chapter when the problem occurred. However, the point it I never think my simulation is a HUGE project :). Novikov and Oberman (2007) mentioned millions of samples. I have 1 million exactly, so I thought it would not be big for SAS, until I see how large the middle-step data sets were hiding in my computer, given that they are only resulted from 1/100 of the whole data. I also didn't think processing large blocks can be so "worse than running K analysis" as to the analysis can never end. And they suggest k range from 20 to 50 ---- I was using 100. In fact k=1000 is working well. Now I know my simulation is really huge:)
Today I have reported the performance issues to SAS Technical Support and will update this post once they respond.
Subject: Performance issues with PROC TTEST when processing many BY groups
Summary:
Update 2018-09-11: Tech Support and I are still working on the problem. My most important finding so far is:
(@wang267: This might help in your case, too.) The ever-growing SAS Item Store sastmp-000000002.sas7bitm (which I suspect has to do with ODS) does not even exist there.
Update 2018-09-13: Problem solved! To avoid the existence of sastmp-000000002.sas7bitm and hence its disastrous consequences on PROC TTEST run times (and possibly disk space), the checkbox "Use ODS Graphics", to be found in Tools --> Options --> Preferences... --> Results, must be deselected when the SAS session starts. In other words, this setting should be stored in the SAS registry. Deselecting the checkbox while the session is open doesn't help because sastmp-000000002.sas7bitm continues to exist.
(@wang267: Can you check this setting in your Preferences?)
More details to follow tomorrow (CEST time zone) in a separate post.
To follow up the announcement in my previous post:
The solution to all four issues in my case was either to run the program in batch mode or:
Interestingly, the SAS session reverts to the "bad behavior" (long run time, growing .sas7bitm file) once "Use ODS Graphics" is selected and confirmed in the Preferences window, whereas deselecting "Use ODS Graphics" (in the current session) does not stop that behavior.
Even worse, "bad behavior" is also triggered by submitting "ods graphics on;", regardless of a subsequent "ods graphics off;"! That means, the test program as a whole (which includes "ods graphics on;" at the end) should not be run twice.
An important insight (revealed by a hex editor after disabling the WORKTERM system option) was that the grown .sas7bitm file contained text snippets such as "The TTEST Procedure", "TEMPLATE", "TITLE" and BY variable names and values. This was strong evidence that some kind of ODS output had still been produced in spite of the ODS statements.
SAS Technical Support (Heidelberg, Germany) was nice and responsive, but not very helpful. In their sixth message -- the response to my solution -- they described how differently their SAS 9.4M3 on Windows 10 behaves:
They have "Use ODS Graphics" always activated. On each run of the test program the existing small .sas7bitm file (96 KB) is replaced by a new one (e.g. sastmp-000000042.sas7bitm by sastmp-000000043.sas7bitm), whose size is again 96 KB and which doesn't grow (run time 3 - 4 s).
The reasons for these differences remain unclear.
Thanks for updating the IT support response. Yes, the ODS option was selected at my time of run. However, I wonder what is difference between disselecting it and using Graphics off option.
I mentioned in my previous post that the C disk got filled even TTest procedure was replaced with SQL. So, it may not be a problem for TTest but all/many the procedures in general.
@wang267 wrote:
I wonder what is difference between disselecting it and using Graphics off option.
It seems that in our SAS installations -- for some strange reason -- switching ODS graphics off doesn't work properly when it's done in an open SAS session. Both methods (deselecting the checkbox in the Preferences and the "ods graphics off" statement) failed to stop the .sas7bitm file from being cluttered up with unwanted ODS data, unless ODS graphics were switched off from the very beginning (i.e. via SAS registry setting) and never switched on in the current session.
@wang267 wrote:
I mentioned in my previous post that the C disk got filled even TTest procedure was replaced with SQL. So, it may not be a problem for TTest but all/many the procedures in general.
I think these disk space issues had other reasons (possibly large numbers of big datasets etc.).
The SAS Tech Support representative wrote that she didn't understand why only PROC TTEST should be so slow on my computer. Indeed, I assume that other procedures which produce ODS graphics (e.g. PROC REG) are likely affected by the same issue, but I haven't tested this yet.* Luckily, a solution has been found and I hope it will work on your system as well.
* Edit: The results of a quick check suggest that PROC REG is not affected.
Finally, the most recent suggestion of SAS Technical Support (17 September) has brought the best solution in my case:
Earlier today I updated my SAS installation from TS1M2 to TS1M5 and the problem does no longer occur.
The Tech Support rep used TS1M3, which worked for her as well.
@wang267: I guess it was that sastmp-000000002.sas7bitm file which kept growing with every PROC TTEST call (as it did in my repeated test runs) until the WORK library and hence the C: drive was full. (This is also what the error message from your third post had suggested.) Can you see this file?
Another test run of Rick's program (only the TTEST step), now with NumSamples=2000 (i.e. doubled), took 13:14 minutes on my workstation (after reboot). So, doubling the number of BY groups increased run-time by a factor of 5.3, similar to the factor of 4.7 observed between NumSamples=1000 vs. 500. Now, run-times of several hours or even days with larger input files and more BY groups seem more and more plausible to me. This could be a case for Tech Support.
(I have also checked the Windows Event Log -- no findings -- and temporarily disabled anti-virus software -- no improvement.)
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.