Re: proc mi produces no log

lupp · Posted 10-24-2019 02:20 PM

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.
NOTE: SAS (r) Proprietary Software 9.4 (TS1M2)

NOTE: This session is executing on the Linux 3.10.0-957.27.2.el7.x86_64 (LIN
X64) platform.

I'm trying to impute students' test scores across 3 different tests at 6 points in time. The missing data rate is no smaller than 2/3. I have 455,959 observations, and 18 variables. Here is the code I'm using:

proc mi data=thedata out=theoutput ;
    var 
<list of variables>
        ;
     fcs     regpmm;

When I run this I get a 0-length log file, 47234 lines of output in the lst file, no messages at the console. The output is produced during the first minute the program runs, and then stops in the middle of the output table. It was been like this for the last 51 hours. At the same time, the program continues to eat up CPU time. As of this time the SAS process has used up 1560 minutes of CPU time. Since there is no log file and no messages at the console, I have no idea how to even start figuring out what the problem is. I know it's not a log-redirection problem because when I run other SAS jobs, the log is produced as expected.

Can anyone give me some help?

TIA

ballardw · Posted 10-24-2019 06:19 PM

First thing I would try a subset of your data with fewer records, like maybe 1000.

proc mi data=thedata (obs=1000) out=theoutput ;
    var

and if that works increase the Obs value and see if the time increases.

Are you running this in Batch or interactively?

Rick_SAS · Posted 10-25-2019 09:50 AM

You have 18 variables on the VAR statement? For N=456,000, that is a very big imputation problem. It will take a long time with a big data set.

I cannot figure out why your lst file contains 47,234 lines. Can you post the EXACT PROC MI step that you are running, including all options? For the default tables, only the "Missing Data Patterns" table can be large, and only if you have a huge number of combinations of missing values among your variables.

A possible work-around to your problem:

The REGPMM method is a nearest-neighbor computation, which will be very expensive when used with a large data set. I suggest that you use a more efficient method such as FCS REG. Using REG instead or REGPMM should be many times faster.

The current default number of imputations is 25, but I don't remember what the default was for SAS 9.4M2. If you want to know how long ONE imputation will take, set

NIMPUTE=1

on the PROC MI statement. As a rule of thumb, if you perform k imputations, it will take k times as long as a single imputation. With a large, you might want to limit the number of imputations to NIMPUTE=5 .

whs278 · Posted 07-13-2021 06:58 PM

Hi,

I'm having a similar issue with a dataset with 750,000 observations. I'm only imputing 6 variables, and I'm using REG for four of those. However, I wanted to use REGPMM for two attendance rate (percentage) variables because using REG results in numbers over 100. I've tried using the MAX option before but that didn't work.

Is there anyway to increase the efficiency of this operation? If not, is there a way to estimate the length of time that is needed? I've tried gradually increasing the number of observations, but the effect on computation time doesn't seem linear.

Thanks,

Bill

Rick_SAS · Posted 07-14-2021 06:22 AM

Please post your PROC MI statements.

whs278 · Posted 07-14-2021 09:10 AM

PROC MI DATA = HSPSTUID(OBS = 50000) NIMPUTE =  1 OUT = HSPSTU_MI1;
	CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
	FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
	FCS NBITER = 1 REG (ELASSCZG08MI);
	FCS NBITER = 1 REG (MTHSSCZG07MI);
	FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
	FCS NBITER = 1 REG (ELASSCZG07MI);
	FCS NBITER = 1 REG (MTHSSCZG08MI);
	VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
	WHERE IN_MI = 1;

RUN

Rick_SAS · Posted 07-14-2021 11:29 AM

How many levels are in the categorical variables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P?

Run

PROC freq DATA = HSPSTUID(OBS = 50000);
tables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
run;

or run PROC GLM with those explanatory variables and look at the ClassLevels table

whs278 · Posted 07-14-2021 01:20 PM

ETHCAT has 5 and the other three variables have 2, so there are a total of 40 groups I suppose (5x2X2X2=40)

The FREQ Procedure

ETHCAT	Frequency	Percent	Cumulative Frequency	Cumulative Percent
2: ASIAN/PACIFIC ISLANDER	8496	16.99	8496	16.99
3: HISPANIC	19228	38.46	27724	55.45
4: BLACK, NON-HISPANIC	14359	28.72	42083	84.17
5: WHITE, NON-HISPANIC	7300	14.60	49383	98.77
8: OTHER RACE/ETH	617	1.23	50000	100.00

GENCAT	Frequency	Percent	Cumulative Frequency	Cumulative Percent
1: FEMALE	25793	51.59	25793	51.59
2: MALE	24207	48.41	50000	100.00

OVERAGEG09	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	39900	79.80	39900	79.80
100	10100	20.20	50000	100.00

REGMTHP65G08P	Frequency	Percent	Cumulative Frequency	Cumulative Percent
0	39062	78.12	39062	78.12
100	10938	21.88	50000	100.00

Rick_SAS · Posted 07-14-2021 03:57 PM

It does take a long time. For some simulated data, I predict (on my PC) that the 50k observations will take 107 seconds per missing value imputation. Here is the simulated data I used:

/* simulate the data */
data HSPSTUID;
call streaminit(123);
do i = 1 to 50000;
   ETHCAT = rand("Table", 0.17, 0.38, 0.29, 0.14, 0.02);
   GENCAT = rand("Table", 0.51, 0.49);
   OVERAGEG09 = 100*rand("BERN", 0.2);
   REGMTHP65G08P = 100*rand("BERN", 0.22);
   
   ATTPCTROLG08MI = rand("Normal");
	ELASSCZG08MI   = rand("Normal");
	MTHSSCZG07MI   = rand("Normal");
	ATTPCTROLG07MI = rand("Normal");
	ELASSCZG07MI   = rand("Normal");
	MTHSSCZG08MI   = rand("Normal");

	NBITER = ETHCAT + GENCAT - OVERAGEG09/10 - REGMTHP65G08P/10
            + ATTPCTROLG08MI - ELASSCZG08MI + MTHSSCZG07MI - ATTPCTROLG07MI 
            + ELASSCZG07MI - MTHSSCZG08MI;
   /* now introduce missing at random */
   if rand("bern",0.1) then ATTPCTROLG08MI = .;
	if rand("bern",0.05) then ELASSCZG08MI   = .;
	if rand("bern",0.05) then MTHSSCZG07MI   = .;
	if rand("bern",0.07) then ATTPCTROLG07MI = .;
	if rand("bern",0.01) then ELASSCZG07MI   = .;
	if rand("bern",0.2) then MTHSSCZG08MI   = .;
   output;
end;
run;

I then ran PROC MI for OBS=5000, OBS=10000, etc, until I got tired of waiting. The code I ran was

%let N = 5000;   /* change this numbe: 5000, 10000, 15000, ... */
PROC MI DATA = HSPSTUID(OBS = &N) NIMPUTE =  1 OUT = HSPSTU_MI1
     displaypattern=nomeans;
	CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
	FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
	FCS NBITER = 1 REG (ELASSCZG08MI);
	FCS NBITER = 1 REG (MTHSSCZG07MI);
	FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
	FCS NBITER = 1 REG (ELASSCZG07MI);
	FCS NBITER = 1 REG (MTHSSCZG08MI);
	VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
RUN;

Here are the timings from my PC. The total time appears to be quadratic in the sample size:

/* results of running PROC MI with different sample sizes */
data Timing;
input size Time;
datalines;
500  0.17
5000 1.33
8000 2.91
10000 4.52
15000 9.38
20000 16.69
25000 27.2
30000 38.23
40000 .
50000 .
;

proc glm data=Timing;
   model Time = size size*size;
   output out=GLMOut P=pred;
run; quit;

proc print data=GLMOut;run;

whs278 · Posted 07-20-2021 11:16 AM

Thanks, this was really helpful. I always appreciate your help on SAS communities as well as your blog posts.