Solved: Re: how slow would the TTest procedure be ?

wang267 · Posted 08-21-2018 05:39 PM

HI everyone,

have you ever had the experience that the TTest procedure becomes very slow on large data set (e.g. 1000 ID1* 1000 ID2 observations) with By statement? It took me a week but still did not finish. So I cut the data set into smaller ones (i.e. 10 ID1 * 1000 ID2, and 100 ID1* 100 ID2). However, TTest procedure took at least 3 hours to run for each smaller data set. At meantime, other procedures (e.g. proc logistic, sort ...) on the same data sets could all finish in a second. Any one havs an idea why TTest is so slow?

Reeza · Posted 08-23-2018 01:40 PM

Press Windows + F to bring up the Windows Search.

In the search, type in Size:Gigantic and see what pops up. Sadly their sizes are pretty small.

https://www.dummies.com/computers/pcs/how-to-locate-large-files-on-a-hard-drive-using-windows-7/

View solution in original post

Reeza · Posted 08-21-2018 11:49 PM

Seems too slow.

Please post your code and the log from a 3 hour run.

wang267 · Posted 08-22-2018 12:01 AM

Thanks for replying. The program is running right now, so I will post some code tomorrow when it finish its current TTest. The TTest is only one of the procedures enclosed in a macro. Other procedures all run good. It looks like 99.99% of the time is "proc TTest running".

FreelanceReinh · Posted 08-22-2018 10:28 AM

Hi @wang267,

Yes, I had the experience that the TTEST procedure was (relatively) slow even on a tiny dataset such as sashelp.class. PROC REG was even worse, whereas other procedures (e.g. PROC LOGISTIC) ran as quickly as could be expected (milliseconds).

It turned out that (at least on my workstation) it was the creation of (default) graphs that slowed down these procedures dramatically.

With SAS 9.4 (Windows 7) default settings a simple PROC TTEST call creates two graphs: "Summary Panel" and "Q-Q Plots". (PROC REG creates three and they are fairly complicated.) In contrast, PROC LOGISTIC creates none. When a BY statement is used, those graphs are produced for each BY group.

So, if you're only interested in text output, you can disable ODS Graphics processing generally:

ods graphics off;

(and switch it on again with ods graphics on; only when you need it -- not necessary for PROC SGPLOT etc., see documentation). This is also the default if you run SAS in batch mode.

Or you can stop the procedure from creating those plots on a call-by-call basis:

proc ttest data=... plots=none;

(same option for PROC REG).

A third option is to instruct SAS to create only selected outputs in selected ODS destinations, e.g., only the ODS output datasets from PROC TTEST, but no printed output and no graphs:

ods select none;
ods output ttests=ttt conflimits=cl statistics=stats equality=eqty;
proc ttest data=...;
...
run;
ods select all;

Example: 100 copies of sashelp.class (i.e. merely 1900 observations in total)

data test;
do j=1 to 100;
  do i=1 to 19;
    set sashelp.class point=i;
    output;
  end;
end;
stop;
run;

options fullstimer;
ods graphics on; /* This is the default! */

proc ttest data=test;
by j;
class sex;
var height;
run;

The resulting SAS log is more than 1800 (!) lines long because it ridiculously repeats the message

NOTE: Multiple concurrent threads will be used to summarize data.

16 times for each of the 100 BY groups. The relevant part is here:

NOTE: PROCEDURE TTEST used (Total process time):
      real time           5:24.41
      user cpu time       2:05.12
      system cpu time     59.10 seconds
      memory              92036.09k
      OS Memory           112820.00k

Compare this to the measurements with ods graphics off; (which also avoids the "Multiple concurrent threads ..." notes!):

NOTE: PROCEDURE TTEST used (Total process time):
      real time           0.23 seconds
      user cpu time       0.14 seconds
      system cpu time     0.09 seconds
      memory              1331.81k
      OS Memory           35564.00k

Thus, ODS graphics had increased the run-time by a factor of ~1400! (For PROC REG I've observed factors >10,000 -- which is insane!).

wang267 · Posted 08-22-2018 01:58 PM

Thank you so much for the respond. Yes, for repeated analysis we tried to minimize the possible output that are not used. I already turned off all the graphics, notes, outputs ,as such

ods graphics off;
ods exclude all;
ods noresults;
options nonotes;

however, the hours of running Ttest occurs after all these were tuned down!! I also notice the Reg procedure is also slow, but at least it runs, The ttest seems not moving at all, and the running time doubled each time for the next loop. And at a time before I noticed, it gave me error like this:

7969     %stage_analysis_srs;
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002
ERROR: No directory space left in ITEM STORE WORK.SASTMP-000000002

FreelanceReinh · Posted 08-22-2018 05:23 PM

I'm not sure I understand what your data and code look like. You mentioned both BY-group processing (with ID1 and ID2 as BY variables?) and a "loop" and apparently a macro is involved, too.

Can you please provide a small sample dataset (fake data is fine) and an example of your PROC TTEST step(s) and then explain the "dimensions" of your real data and code, e.g.

size of the analysis dataset(s)
(approximate) number and size of BY groups
number of analysis variables
how many times PROC TTEST is called

wang267 · Posted 08-22-2018 05:45 PM

Hi FreelanceReinhard,

Yes, the loop is to sample form a simulated population. Ideally there should be 1000iD1 * 1000ID2 samples, which is 1 million samples. I first obtained all the samples and combined them into a single data set, then used BY ID1 ID2 statement in the following analysis (in which TTest is one step). I hope that makes the situation clear.

All the procedures worked fine except for TTest. With the 1 million data, it was running for a week and still showing TTest running. So, I tried smaller data sets, like 10ID1 * 1000ID2, TTest still took 3 hours to run on each smaller data set, and became increasingly slower on the next data sets.

I also spend some time on observing the log notes. it seems the running notice and the log do not always correspond. So it is also possible that another procedure was blocking. I am now trying to replace the TTest since what is needed is merely the mean difference. I am still trying and will know if the new step would run faster.

Reeza · Posted 08-22-2018 05:56 PM

I simulated a 1 million record data set and ran a t-test. I had results back in less than a second with ODS Graphics Off.

Something else is going on here.

It would really help if you just showed the code and log at this point.

478 data demo;
479 do i=1 to 1000000;
480 before = rand('normal', 0, 6);
481 after = rand('normal', 1, 5);
482 output;
483 end;
484 run;

NOTE: The data set WORK.DEMO has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.23 seconds
cpu time 0.15 seconds

485
486 ods graphics off;
487 proc ttest data=demo;
488 paired before*after;
489 run;

NOTE: PROCEDURE TTEST used (Total process time):
real time 0.24 seconds
cpu time 0.18 seconds

490 ods graphics on;

wang267 · Posted 08-22-2018 06:05 PM

@Reeza

Did you use BY statement? That might be a problem.

Reeza · Posted 08-22-2018 06:06 PM

Nope, 5 million processed in under a second still with a BY statement.

Please show us your code and log from the PROC TTEST.

And my computer is really basic, 4GB of RAM and 300GBHD and 4 years old so the CPU is slow as hell....

wang267 · Posted 08-22-2018 05:55 PM

@FreelanceReinh,

To provide some more information so that you know how to help:

the full data set has 1 samples with 50 obs each and 6 variables; but I cut it down into 10000 samples and 6 variables
each ID1 * ID2 group (a sample) size is 50, so, 50 * 10000 obs in the data set
only one variable is analyzed
TTEst is called 12 times in all, with other procedures in the between, not in loop .

Reeza · Posted 08-22-2018 05:57 PM

@wang267 wrote:

@FreelanceReinh,

To provide some more information so that you know how to help:

the full data set has 1 samples with 50 obs each and 6 variables; but I cut it down into 10000 samples and 6 variables

each ID1 * ID2 group (a sample) size is 50, so, 50 * 10000 obs in the data set

only one variable is analyzed

TTEst is called 12 times in all, with other procedures in the between, not in loop .

I really think you need to show some code now. Specifically the PROC TTEST at this point at minimum. Otherwise we're really just guessing.

wang267 · Posted 08-22-2018 07:16 PM

OK. I just terminate the procedures and now am able to post some code so that you can help to solve the problem. ]

%macro samplingconti_srs; 
 %do CI = &ncistart %to &nci;
 %do sample = 1 %to &nsample;
 proc surveyselect data=logireg method=srs seed=&sample.&ci n=&size out=sp_&sample;
 run;
 data sp_&sample; set sp_&sample; n_sample = &sample; run; 
 %end;
 data cov&covmatrix.size&size.CI&CI;
 set sp_1 - sp_&nsample; 
 n_CI = &CI; 
 run;
 %end;
 data disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci;
 set cov&covmatrix.size&size.CI&ncistart - cov&covmatrix.size&size.CI&nci;
 run; 
 proc sort data=disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci; by n_CI n_sample; run; 
%mend samplingconti_srs;

%macro modelmerge;
 data modelfit;
 set mdfitx1x2x3x4 mdfitx1x2x3 mdfitx1x2x4 mdfitx1x3x4 mdfitx2x3x4
 mdfitx1x2 mdfitx1x3 mdfitx1x4 mdfitx2x3 mdfitx2x4 mdfitx3x4 mdfitx1 mdfitx2 mdfitx3 mdfitx4 mdfit;
 run;
%mend modelmerge;  



%macro DAModelFit_srs (p1, p2, p3, p4); /*add sample ID for sample data in order to locate mistake, if any*/
/*need 1000 samplings * sampling 1000 times for each sample to construct one percentile CI*/
			proc logistic data=disser.conti_cov&covmatrix.size&size._srs&ncistart._&nci;
				by n_CI n_sample; 
				model y(event="1")= &p1 &p2 &p3 &p4;
				output out=a pred=yhat; /*for Tjur coefficient*/
				ods output FitStatistics = LogLFits;/*create logl0 and loglm for McFadden*/
			run;
			 proc ttest data=a; class y; var yhat; 
                              by n_CI n_sample;
                              ods output conflimits=meand1; /*for Tjur coefficient*/
                              run; /*for Tjur coefficient*/

			data modelfitMcF (keep = n_CI n_sample x1 x2 x3 x4 subset McFadden); 
				set Loglfits;
				if Criterion = '-2 Log L';
				subset = "&p1&p2&p3&p4";
				%do i =1 %to 4;
					if "&&p&i"="x&i" then x&i=1; else x&i=0;
				%end; 
			    ** to obtain Likelihoods from -2 log Likelihoods;
			    LogL0 = InterceptOnly/-2;
			    LogLM = InterceptAndCovariates/-2;
			    L0 = exp(InterceptOnly/-2);
			    LM = exp(InterceptAndCovariates/-2);
				McFadden = 1- (LogLM/LogL0);
			run; 

			data modelfitTjur (keep = n_CI n_sample x1 x2 x3 x4 subset Tjur);
				set meand1;
				if method = "Pooled";
				Tjur = abs(Mean);
				*subset = "&&modelRhs&i";
				subset = "&p1&p2&p3&p4";
				%do i =1 %to 4;
					if "&&p&i"="x&i" then x&i=1; else x&i=0;
				%end;
			run; 

			data mdfit&p1&p2&p3&p4;
				merge modelfitMcF modelfitTjur;
				by subset; 
			run; 
%mend DAModelFit_srs; 

%macro GDmeasure_sp(spmethod);
 data disser.conti_GDmscov&covmatrix.size&size.__&spmethod&ncistart._&nci;
 merge GD_x1 GD_x2 GD_x3 GD_x4;
 by n_CI n_sample variable;
 D12 = GD_x1 - GD_x2;
 D13 = GD_x1 - GD_x3;
 D14 = GD_x1 - GD_x4;
 D23 = GD_x2 - GD_x3;
 D24 = GD_x2 - GD_x4;
 D34 = GD_x3 - GD_x4;
 R12 = GD_x1 / GD_x2;
 R13 = GD_x1 / GD_x3;
 R14 = GD_x1 / GD_x4;
 R23 = GD_x2 / GD_x3;
 R24 = GD_x2 / GD_x4;
 R34 = GD_x3 / GD_x4; 
 run; 
%mend GDmeasure_sp;

%macro predictorGD_sp(predictor=, rest=);
	data full&predictor (keep=n_CI n_sample subset &predictor &rest McFfull_&predictor Tjurfull_&predictor); 
		set modelfit (rename=(McFadden=McFfull_&predictor Tjur=Tjurfull_&predictor));
		where &predictor=1;
	run; 
	data sub&predictor (keep=n_CI n_sample subset &predictor &rest McFsub_&predictor Tjursub_&predictor); 
		set modelfit (rename=(McFadden=McFsub_&predictor Tjur=Tjursub_&predictor));
		where &predictor=0;
	run; 
/*	proc sort data=full&predictor; by descending x2 descending x3 descending x4; run;*/
/*	proc sort data=sub&predictor; by descending x2 descending x3 descending x4; run;*/
	proc sort data=full&predictor; by &rest; run;
	proc sort data=sub&predictor; by &rest; run;
	data add_&predictor;
		merge full&predictor sub&predictor;
		by &rest;
		McFGD = McFfull_&predictor - McFsub_&predictor;
		TjurGD = Tjurfull_&predictor - Tjursub_&predictor;
		keep n_CI n_sample &rest McFfull_&predictor McFsub_&predictor Tjurfull_&predictor Tjursub_&predictor McFGD TjurGD;
	run; 
	proc sql;  /*obtaining variable means by sql*/
		create table GD_&predictor as 
			select n_CI, n_sample, "McFGD" as Variable, avg(McFGD) as GD_&predictor format 8.7
				from add_&predictor
				group by n_CI, n_sample
			union
			select n_CI, n_sample, "TjurGD" as Variable, avg(TjurGD) as GD_&predictor format 8.7
				from add_&predictor
				group by n_CI, n_sample;
	quit; 
%mend predictorGD_sp;



%macro stage_analysis_srs;
	%do ncistart = &start %to &end %by &by;
	%let nci = %eval(%eval(&ncistart) + %eval(&by)-1);
		%samplingconti_srs; 
		proc datasets lib=work memtype=data nolist; delete a: cov: ful: gd: m: parent: sp: sub:; quit;  
		%DAModelFit_srs (x1, x2, x3, x4);
		%DAModelFit_srs (x1, x2, x3, );
		%DAModelFit_srs (x1, x2,  , x4);
		%DAModelFit_srs (x1,  , x3, x4);
		%DAModelFit_srs ( , x2, x3, x4);
		%DAModelFit_srs (x1, x2,  ,  );
		%DAModelFit_srs (x1,  , x3,  );
		%DAModelFit_srs (x1,  ,  , x4);
		%DAModelFit_srs ( , x2, x3,  );
		%DAModelFit_srs ( , x2,  , x4);
		%DAModelFit_srs ( ,  , x3, x4);
		%DAModelFit_srs (x1,  ,  ,  );
		%DAModelFit_srs ( , x2,  ,  );
		%DAModelFit_srs ( ,  , x3,  );
		%DAModelFit_srs ( ,  ,  , x4);
		%DAModelFit_srs ( ,  ,  ,  );
		%modelmerge; 
		proc datasets lib=work memtype=data nolist; delete mdfit:; quit;
		%predictorGD_sp(predictor=x1, rest=x2 x3 x4);
		%predictorGD_sp(predictor=x2, rest=x1 x3 x4);
		%predictorGD_sp(predictor=x3, rest=x1 x2 x4);
		%predictorGD_sp(predictor=x4, rest=x1 x2 x3);

			/*to obtain the dominance statistics to be analyzed*/
		%GDmeasure_sp(srs); 
	%end;
%mend stage_analysis_srs;

I put all the macros and the final macro wrapping up all the macros to run. (To run it, you need to simulate a population with x1 -x4 and y, and define some macro variables in the final macro). The TTEST procedure is only in one of the macros, where I underlined.

However, now I find that it might not be a problem of any procedure but the space/ memory/(whatever you want to call it) thing. After I replace the TTEST, the procedures still ran slow, and got completely blocked at

%predictorGD_sp(predictor=x1, rest=x2 x3 x4);

A window jumped out saying something like "no space" (I cannot remember the exact wording) and provided several options : let the proedure know, cancel the submitted, terminate ... And I canceled.

the log said:

712 %predictorGD_sp(predictor=x1, rest=x2 x3 x4);
ERROR: Insufficient space in file WORK.FULLX1.DATA.
ERROR: File WORK.FULLX1.DATA is damaged. I/O processing did not complete.

ERROR: Sorted run creation failure.
ERROR: Failure encountered while creating initial set of sorted runs.
ERROR: Failure encountered during external sort.
ERROR: User asked for termination.

I am really confused at why this happened.

Rick_SAS · Posted 08-22-2018 09:10 PM

I believe that FreelanceReinhard, Reeza, and I have already addressed the problems in your simulation. The main problem is that you need to replace those macro loops with BY groups to obtain performance. The "window is filled" message occurs because you need to suppress the ODS output, graphics, and notes when you run the simulations. Carefully read and study the advice and links that have been offered up to now and check back if there is anything you don't understand.

FreelanceReinh · Posted 08-23-2018 04:24 AM

@wang267 wrote:

TTEst is called 12 times in all, with other procedures in the between, not in loop .

Doesn't your code suggest that the number of PROC TTEST calls is rather a multiple of 16?

PROC TTEST is called once in macro DAModelFit_srs, which in turn is called 16 times in each iteration of a loop
%do ncistart = &start %to &end %by &by;
in macro stage_analysis_srs.

SAS Training: Just a Click Away