BookmarkSubscribeRSS Feed
lupp
Calcite | Level 5

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M2)

NOTE: This session is executing on the Linux 3.10.0-957.27.2.el7.x86_64 (LIN
X64) platform.

 

I'm trying to impute students' test scores across 3 different tests at 6 points in time. The missing data rate is no smaller than 2/3. I have 455,959 observations, and 18 variables. Here is the code I'm using:

 

proc mi data=thedata out=theoutput ;
    var 
<list of variables>
        ;
     fcs     regpmm;

When I run this I get a 0-length log file, 47234 lines of output in the lst file, no messages at the console. The output is produced during the first minute the program runs, and then stops in the middle of the output table. It was been like this for the last 51 hours. At the same time, the program continues to eat up CPU time. As of this time the SAS process has used up 1560 minutes of CPU time. Since there is no log file and no messages at the console, I have no idea how to even start figuring out what the problem is. I know it's not a log-redirection problem because when I run other SAS jobs, the log is produced as expected.

Can anyone give me some help?

TIA

9 REPLIES 9
ballardw
Super User

First thing I would try a subset of your data with fewer records, like maybe 1000.

 

proc mi data=thedata (obs=1000) out=theoutput ;
    var 

and if that works increase the Obs value and see if the time increases.

Are you running this in Batch or interactively?

 

 

Rick_SAS
SAS Super FREQ

You have 18 variables on the VAR statement? For N=456,000, that is a very big imputation problem. It will take a long time with a big data set.

 

I cannot figure out why your lst file contains 47,234 lines. Can you post the EXACT PROC MI step that you are running, including all options? For the default tables, only the "Missing Data Patterns" table can be large, and only if you have a huge number of combinations of missing values among your variables.

 

A possible work-around to your problem:

The REGPMM method is a nearest-neighbor computation, which will be very expensive when used with a large data set. I suggest that you use a more efficient method such as FCS REG.  Using REG instead or REGPMM should be many times faster.

 

The current default number of imputations is 25, but I don't remember what the default was for SAS 9.4M2. If you want to know how long ONE imputation will take, set

NIMPUTE=1

on the PROC MI statement. As a rule of thumb, if you perform k imputations, it will take k times as long as a single imputation.  With a large, you might want to limit the number of imputations to NIMPUTE=5 .

 

 

whs278
Quartz | Level 8

Hi, 

 

I'm having a similar issue with a dataset with 750,000 observations.  I'm only imputing 6 variables, and I'm using REG for four of those.  However, I wanted to use REGPMM for two attendance rate (percentage) variables because using REG results in numbers over 100.  I've tried using the MAX option before but that didn't work. 

 

Is there anyway to increase the efficiency of this operation?  If not, is there a way to estimate the length of time that is needed?  I've tried gradually increasing the number of observations, but the effect on computation time doesn't seem linear.

 

Thanks,

Bill

Rick_SAS
SAS Super FREQ

Please post your PROC MI statements.

whs278
Quartz | Level 8
PROC MI DATA = HSPSTUID(OBS = 50000) NIMPUTE =  1 OUT = HSPSTU_MI1;
	CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
	FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
	FCS NBITER = 1 REG (ELASSCZG08MI);
	FCS NBITER = 1 REG (MTHSSCZG07MI);
	FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
	FCS NBITER = 1 REG (ELASSCZG07MI);
	FCS NBITER = 1 REG (MTHSSCZG08MI);
	VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
	WHERE IN_MI = 1;

RUN
Rick_SAS
SAS Super FREQ

How many levels are in the categorical variables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P?

Run 

PROC freq DATA = HSPSTUID(OBS = 50000);
tables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
run;

or run PROC GLM with those explanatory variables and look at the ClassLevels table

whs278
Quartz | Level 8

ETHCAT has 5 and the other three variables have 2, so there are a total of 40 groups I suppose (5x2X2X2=40)

 

 

The FREQ Procedure
ETHCAT Frequency Percent Cumulative
Frequency
Cumulative
Percent
2: ASIAN/PACIFIC ISLANDER 8496 16.99 8496 16.99
3: HISPANIC 19228 38.46 27724 55.45
4: BLACK, NON-HISPANIC 14359 28.72 42083 84.17
5: WHITE, NON-HISPANIC 7300 14.60 49383 98.77
8: OTHER RACE/ETH 617 1.23 50000 100.00


GENCAT Frequency Percent Cumulative
Frequency
Cumulative
Percent
1: FEMALE 25793 51.59 25793 51.59
2: MALE 24207 48.41 50000 100.00


OVERAGEG09 Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 39900 79.80 39900 79.80
100 10100 20.20 50000 100.00


REGMTHP65G08P Frequency Percent Cumulative
Frequency
Cumulative
Percent
0 39062 78.12 39062 78.12
100 10938 21.88 50000 100.00
Rick_SAS
SAS Super FREQ

It does take a long time. For some simulated data, I predict (on my PC) that the 50k observations will take 107 seconds per missing value imputation. Here is the simulated data I used:

/* simulate the data */
data HSPSTUID;
call streaminit(123);
do i = 1 to 50000;
   ETHCAT = rand("Table", 0.17, 0.38, 0.29, 0.14, 0.02);
   GENCAT = rand("Table", 0.51, 0.49);
   OVERAGEG09 = 100*rand("BERN", 0.2);
   REGMTHP65G08P = 100*rand("BERN", 0.22);
   
   ATTPCTROLG08MI = rand("Normal");
	ELASSCZG08MI   = rand("Normal");
	MTHSSCZG07MI   = rand("Normal");
	ATTPCTROLG07MI = rand("Normal");
	ELASSCZG07MI   = rand("Normal");
	MTHSSCZG08MI   = rand("Normal");

	NBITER = ETHCAT + GENCAT - OVERAGEG09/10 - REGMTHP65G08P/10
            + ATTPCTROLG08MI - ELASSCZG08MI + MTHSSCZG07MI - ATTPCTROLG07MI 
            + ELASSCZG07MI - MTHSSCZG08MI;
   /* now introduce missing at random */
   if rand("bern",0.1) then ATTPCTROLG08MI = .;
	if rand("bern",0.05) then ELASSCZG08MI   = .;
	if rand("bern",0.05) then MTHSSCZG07MI   = .;
	if rand("bern",0.07) then ATTPCTROLG07MI = .;
	if rand("bern",0.01) then ELASSCZG07MI   = .;
	if rand("bern",0.2) then MTHSSCZG08MI   = .;
   output;
end;
run;

I then ran PROC MI for OBS=5000, OBS=10000, etc, until I got tired of waiting. The code I ran was

%let N = 5000;   /* change this numbe: 5000, 10000, 15000, ... */
PROC MI DATA = HSPSTUID(OBS = &N) NIMPUTE =  1 OUT = HSPSTU_MI1
     displaypattern=nomeans;
	CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
	FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
	FCS NBITER = 1 REG (ELASSCZG08MI);
	FCS NBITER = 1 REG (MTHSSCZG07MI);
	FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
	FCS NBITER = 1 REG (ELASSCZG07MI);
	FCS NBITER = 1 REG (MTHSSCZG08MI);
	VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
RUN;

Here are the timings from my PC. The total time appears to be quadratic in the sample size:

/* results of running PROC MI with different sample sizes */
data Timing;
input size Time;
datalines;
500  0.17
5000 1.33
8000 2.91
10000 4.52
15000 9.38
20000 16.69
25000 27.2
30000 38.23
40000 .
50000 .
;

proc glm data=Timing;
   model Time = size size*size;
   output out=GLMOut P=pred;
run; quit;

proc print data=GLMOut;run;
whs278
Quartz | Level 8
Thanks, this was really helpful. I always appreciate your help on SAS communities as well as your blog posts.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1113 views
  • 3 likes
  • 4 in conversation