BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wang267
Obsidian | Level 7

HI everyone, 

 

have you ever had the experience that the TTest procedure becomes very slow on large data set (e.g. 1000 ID1* 1000 ID2 observations) with By statement? It took me a week but still did not finish. So I cut the data set into smaller ones (i.e. 10 ID1 * 1000 ID2, and 100 ID1* 100 ID2). However, TTest procedure took at least 3 hours to run for each smaller data set. At meantime, other procedures (e.g. proc logistic, sort ...) on the same data sets could all finish in a second. Any one havs an idea why TTest is so slow? 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Press Windows + F to bring up the Windows Search.

In the search, type in Size:Gigantic and see what pops up. Sadly their sizes are pretty small. 

 

https://www.dummies.com/computers/pcs/how-to-locate-large-files-on-a-hard-drive-using-windows-7/

View solution in original post

39 REPLIES 39
Reeza
Super User

Seems too slow. 

Please post your code and the log from a 3 hour run. 

 

 

wang267
Obsidian | Level 7

Thanks for replying. The program is running right now, so I will post some code tomorrow when it finish its current TTest. The TTest is only one of the procedures enclosed in a macro. Other procedures all run good. It looks like 99.99% of the time is "proc TTest running". 

FreelanceReinh
Jade | Level 19

Hi @wang267,

 

Yes, I had the experience that the TTEST procedure was (relatively) slow even on a tiny dataset such as sashelp.class. PROC REG was even worse, whereas other procedures (e.g. PROC LOGISTIC) ran as quickly as could be expected (milliseconds).

 

It turned out that (at least on my workstation) it was the creation of (default) graphs that slowed down these procedures dramatically.

 

With SAS 9.4 (Windows 7) default settings a simple PROC TTEST call creates two graphs: "Summary Panel" and "Q-Q Plots". (PROC REG creates three and they are fairly complicated.) In contrast, PROC LOGISTIC creates none. When a BY statement is used, those graphs are produced for each BY group.

 

So, if you're only interested in text output, you can disable ODS Graphics processing generally:

ods graphics off;

(and switch it on again with ods graphics on; only when you need it -- not necessary for PROC SGPLOT etc., see documentation). This is also the default if you run SAS in batch mode.

 

Or you can stop the procedure from creating those plots on a call-by-call basis:

proc ttest data=... plots=none;

(same option for PROC REG).

 

A third option is to instruct SAS to create only selected outputs in selected ODS destinations, e.g., only the ODS output datasets from PROC TTEST, but no printed output and no graphs:

ods select none;
ods output ttests=ttt conflimits=cl statistics=stats equality=eqty;
proc ttest data=...;
...
run;
ods select all;

 

Example: 100 copies of sashelp.class (i.e. merely 1900 observations in total)

data test;
do j=1 to 100;
  do i=1 to 19;
    set sashelp.class point=i;
    output;
  end;
end;
stop;
run;

options fullstimer;
ods graphics on; /* This is the default! */

proc ttest data=test;
by j;
class sex;
var height;
run;

The resulting SAS log is more than 1800 (!) lines long because it ridiculously repeats the message

NOTE: Multiple concurrent threads will be used to summarize data.

16 times for each of the 100 BY groups. The relevant part is here:

NOTE: PROCEDURE TTEST used (Total process time):
      real time           5:24.41
      user cpu time       2:05.12
      system cpu time     59.10 seconds
      memory              92036.09k
      OS Memory           112820.00k

Compare this to the measurements with ods graphics off; (which also avoids the "Multiple concurrent threads ..." notes!):

NOTE: PROCEDURE TTEST used (Total process time):
      real time           0.23 seconds
      user cpu time       0.14 seconds
      system cpu time     0.09 seconds
      memory              1331.81k
      OS Memory           35564.00k

Thus, ODS graphics had increased the run-time by a factor of ~1400! (For PROC REG I've observed factors >10,000 -- which is insane!).

 

 

 

wang267
Obsidian | Level 7

Thank you so much for the respond. Yes, for repeated analysis we tried to minimize the possible output that are not used. I already turned off all the graphics, notes, outputs ,as such 

ods graphics off;
ods exclude all;
ods noresults;
options nonotes;

however, the hours of running Ttest occurs after all these were tuned down!! I also notice the Reg procedure is also slow, but at least it runs, The ttest seems not moving at all, and the running time doubled each time for the next loop. And at a time before I noticed, it gave me error like this: 

7969     %stage_analysis_srs;
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002
FreelanceReinh
Jade | Level 19

I'm not sure I understand what your data and code look like. You mentioned both BY-group processing (with ID1 and ID2 as BY variables?) and a "loop" and apparently a macro is involved, too.

 

Can you please provide a small sample dataset (fake data is fine) and an example of your PROC TTEST step(s) and then explain the "dimensions" of your real data and code, e.g.

  • size of the analysis dataset(s)
  • (approximate) number and size of BY groups
  • number of analysis variables
  • how many times PROC TTEST is called

 

 

 

 

wang267
Obsidian | Level 7

Hi 

 

 

Reeza
Super User

I simulated a 1 million record data set and ran a t-test. I had results back in less than a second with ODS Graphics Off. 

Something else is going on here. 

It would really help if you just showed the code and log at this point. 

 

 

478 data demo;
479 do i=1 to 1000000;
480 before = rand('normal', 0, 6);
481 after = rand('normal', 1, 5);
482 output;
483 end;
484 run;

NOTE: The data set WORK.DEMO has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.23 seconds
cpu time 0.15 seconds


485
486 ods graphics off;
487 proc ttest data=demo;
488 paired before*after;
489 run;

NOTE: PROCEDURE TTEST used (Total process time):
real time 0.24 seconds
cpu time 0.18 seconds


490 ods graphics on;

wang267
Obsidian | Level 7

@Reeza

 

Did you use BY statement?  That might be a problem. 

Reeza
Super User

Nope, 5 million processed in under a second still with a BY statement. 

Please show us your code and log from the PROC TTEST.

 

And my computer is really basic, 4GB of RAM and 300GBHD and 4 years old so the CPU is slow as hell....

wang267
Obsidian | Level 7

@FreelanceReinh

 

To provide some more information so that you know how to help:

  • the full data set has 1 samples with 50 obs each and 6 variables; but I cut it down into 10000 samples  and 6 variables 
  • each ID1 * ID2 group  (a sample) size is 50, so, 50 * 10000 obs in the data set 
  • only one variable is analyzed 
  • TTEst is called 12 times in all, with other procedures in the between, not in loop .  
Reeza
Super User

@wang267 wrote:

@FreelanceReinh

 

To provide some more information so that you know how to help:

  • the full data set has 1 samples with 50 obs each and 6 variables; but I cut it down into 10000 samples  and 6 variables 
  • each ID1 * ID2 group  (a sample) size is 50, so, 50 * 10000 obs in the data set 
  • only one variable is analyzed 
  • TTEst is called 12 times in all, with other procedures in the between, not in loop .  

I really think you need to show some code now. Specifically the PROC TTEST at this point at minimum. Otherwise we're really just guessing.

wang267
Obsidian | Level 7

OK. I just terminate the procedures and now am able to post some code so that you can help to solve the problem. ]

%macro samplingconti_srs; 
%do CI = &ncistart %to &nci;
%do sample = 1 %to &nsample;
proc surveyselect data=logireg method=srs seed=&sample.&ci n=&size out=sp_&sample;
run;
data sp_&sample; set sp_&sample; n_sample = &sample; run;
%end;
data cov&covmatrix.size&size.CI&CI;
set sp_1 - sp_&nsample;
n_CI = &CI;
run;
%end;
data disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci;
set cov&covmatrix.size&size.CI&ncistart - cov&covmatrix.size&size.CI&nci;
run;
proc sort data=disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci; by n_CI n_sample; run;
%mend samplingconti_srs;

%macro modelmerge;
data modelfit;
set mdfitx1x2x3x4 mdfitx1x2x3 mdfitx1x2x4 mdfitx1x3x4 mdfitx2x3x4
mdfitx1x2 mdfitx1x3 mdfitx1x4 mdfitx2x3 mdfitx2x4 mdfitx3x4 mdfitx1 mdfitx2 mdfitx3 mdfitx4 mdfit;
run;
%mend modelmerge;



%macro DAModelFit_srs (p1, p2, p3, p4); /*add sample ID for sample data in order to locate mistake, if any*/ /*need 1000 samplings * sampling 1000 times for each sample to construct one percentile CI*/ proc logistic data=disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci; by n_CI n_sample; model y(event="1")= &p1 &p2 &p3 &p4; output out=a pred=yhat; /*for Tjur coefficient*/ ods output FitStatistics = LogLFits;/*create logl0 and loglm for McFadden*/ run; proc ttest data=a; class y; var yhat;
by n_CI n_sample;
ods output conflimits=meand1; /*for Tjur coefficient*/
run; /*for Tjur coefficient*/ data modelfitMcF (keep = n_CI n_sample x1 x2 x3 x4 subset McFadden); set Loglfits; if Criterion = '-2 Log L'; subset = "&p1&p2&p3&p4"; %do i =1 %to 4; if "&&p&i"="x&i" then x&i=1; else x&i=0; %end; ** to obtain Likelihoods from -2 log Likelihoods; LogL0 = InterceptOnly/-2; LogLM = InterceptAndCovariates/-2; L0 = exp(InterceptOnly/-2); LM = exp(InterceptAndCovariates/-2); McFadden = 1- (LogLM/LogL0); run; data modelfitTjur (keep = n_CI n_sample x1 x2 x3 x4 subset Tjur); set meand1; if method = "Pooled"; Tjur = abs(Mean); *subset = "&&modelRhs&i"; subset = "&p1&p2&p3&p4"; %do i =1 %to 4; if "&&p&i"="x&i" then x&i=1; else x&i=0; %end; run; data mdfit&p1&p2&p3&p4; merge modelfitMcF modelfitTjur; by subset; run; %mend DAModelFit_srs;
%macro GDmeasure_sp(spmethod);
data disser.conti_GDmscov&covmatrix.size&size.__&spmethod&ncistart._&nci;
merge GD_x1 GD_x2 GD_x3 GD_x4;
by n_CI n_sample variable;
D12 = GD_x1 - GD_x2;
D13 = GD_x1 - GD_x3;
D14 = GD_x1 - GD_x4;
D23 = GD_x2 - GD_x3;
D24 = GD_x2 - GD_x4;
D34 = GD_x3 - GD_x4;
R12 = GD_x1 / GD_x2;
R13 = GD_x1 / GD_x3;
R14 = GD_x1 / GD_x4;
R23 = GD_x2 / GD_x3;
R24 = GD_x2 / GD_x4;
R34 = GD_x3 / GD_x4;
run;
%mend GDmeasure_sp; %macro predictorGD_sp(predictor=, rest=); data full&predictor (keep=n_CI n_sample subset &predictor &rest McFfull_&predictor Tjurfull_&predictor); set modelfit (rename=(McFadden=McFfull_&predictor Tjur=Tjurfull_&predictor)); where &predictor=1; run; data sub&predictor (keep=n_CI n_sample subset &predictor &rest McFsub_&predictor Tjursub_&predictor); set modelfit (rename=(McFadden=McFsub_&predictor Tjur=Tjursub_&predictor)); where &predictor=0; run; /* proc sort data=full&predictor; by descending x2 descending x3 descending x4; run;*/ /* proc sort data=sub&predictor; by descending x2 descending x3 descending x4; run;*/ proc sort data=full&predictor; by &rest; run; proc sort data=sub&predictor; by &rest; run; data add_&predictor; merge full&predictor sub&predictor; by &rest; McFGD = McFfull_&predictor - McFsub_&predictor; TjurGD = Tjurfull_&predictor - Tjursub_&predictor; keep n_CI n_sample &rest McFfull_&predictor McFsub_&predictor Tjurfull_&predictor Tjursub_&predictor McFGD TjurGD; run; proc sql; /*obtaining variable means by sql*/ create table GD_&predictor as select n_CI, n_sample, "McFGD" as Variable, avg(McFGD) as GD_&predictor format 8.7 from add_&predictor group by n_CI, n_sample union select n_CI, n_sample, "TjurGD" as Variable, avg(TjurGD) as GD_&predictor format 8.7 from add_&predictor group by n_CI, n_sample; quit; %mend predictorGD_sp; %macro stage_analysis_srs; %do ncistart = &start %to &end %by &by; %let nci = %eval(%eval(&ncistart) + %eval(&by)-1); %samplingconti_srs; proc datasets lib=work memtype=data nolist; delete a: cov: ful: gd: m: parent: sp: sub:; quit; %DAModelFit_srs (x1, x2, x3, x4); %DAModelFit_srs (x1, x2, x3, ); %DAModelFit_srs (x1, x2, , x4); %DAModelFit_srs (x1, , x3, x4); %DAModelFit_srs ( , x2, x3, x4); %DAModelFit_srs (x1, x2, , ); %DAModelFit_srs (x1, , x3, ); %DAModelFit_srs (x1, , , x4); %DAModelFit_srs ( , x2, x3, ); %DAModelFit_srs ( , x2, , x4); %DAModelFit_srs ( , , x3, x4); %DAModelFit_srs (x1, , , ); %DAModelFit_srs ( , x2, , ); %DAModelFit_srs ( , , x3, ); %DAModelFit_srs ( , , , x4); %DAModelFit_srs ( , , , ); %modelmerge; proc datasets lib=work memtype=data nolist; delete mdfit:; quit; %predictorGD_sp(predictor=x1, rest=x2 x3 x4); %predictorGD_sp(predictor=x2, rest=x1 x3 x4); %predictorGD_sp(predictor=x3, rest=x1 x2 x4); %predictorGD_sp(predictor=x4, rest=x1 x2 x3); /*to obtain the dominance statistics to be analyzed*/ %GDmeasure_sp(srs); %end; %mend stage_analysis_srs;

I put all the macros and the final macro wrapping up all the macros to run. (To run it, you need to simulate a population with x1 -x4 and y, and define some macro variables in the final macro). The TTEST procedure is only in one of the macros, where I underlined. 

 

However, now I find that it might not be a problem of any procedure but the space/ memory/(whatever you want to call it) thing. After I replace the TTEST, the procedures still ran slow, and got completely blocked at   

%predictorGD_sp(predictor=x1, rest=x2 x3 x4);

A window jumped out saying something like "no space" (I cannot remember the exact wording) and provided several options : let the proedure know, cancel the submitted, terminate ... And I canceled. 

 

the log said: 

 

712 %predictorGD_sp(predictor=x1, rest=x2 x3 x4);
ERROR: Insufficient space in file WORK.FULLX1.DATA.
ERROR: File WORK.FULLX1.DATA is damaged. I/O processing did not complete.

ERROR: Sorted run creation failure.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: User asked for termination.

 

I am really confused at why this happened. 

 

Rick_SAS
SAS Super FREQ

I believe that FreelanceReinhard

FreelanceReinh
Jade | Level 19

@wang267 wrote:
  • TTEst is called 12 times in all, with other procedures in the between, not in loop .  

Doesn't your code suggest that the number of PROC TTEST calls is rather a multiple of 16?

 

PROC TTEST is called once in macro DAModelFit_srs, which in turn is called 16 times in each iteration of a loop
%do ncistart = &start %to &end %by &by;
in macro 
stage_analysis_srs.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 39 replies
  • 2842 views
  • 24 likes
  • 5 in conversation