Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- proc mi produces no log

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 10-24-2019 02:20 PM
(1500 views)

NOTE: Copyright (c) 2002-2012 by SAS Institute Inc., Cary, NC, USA.

NOTE: SAS (r) Proprietary Software 9.4 (TS1M2)

NOTE: This session is executing on the Linux 3.10.0-957.27.2.el7.x86_64 (LIN

X64) platform.

I'm trying to impute students' test scores across 3 different tests at 6 points in time. The missing data rate is no smaller than 2/3. I have 455,959 observations, and 18 variables. Here is the code I'm using:

```
proc mi data=thedata out=theoutput ;
var
<list of variables>
;
fcs regpmm;
```

When I run this I get a 0-length log file, 47234 lines of output in the lst file, no messages at the console. The output is produced during the first minute the program runs, and then stops in the middle of the output table. It was been like this for the last 51 hours. At the same time, the program continues to eat up CPU time. As of this time the SAS process has used up 1560 minutes of CPU time. Since there is no log file and no messages at the console, I have no idea how to even start figuring out what the problem is. I know it's not a log-redirection problem because when I run other SAS jobs, the log is produced as expected.

Can anyone give me some help?

TIA

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

First thing I would try a subset of your data with fewer records, like maybe 1000.

proc mi data=thedata (obs=1000) out=theoutput ; var

and if that works increase the Obs value and see if the time increases.

Are you running this in Batch or interactively?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You have 18 variables on the VAR statement? For N=456,000, that is a very big imputation problem. It will take a long time with a big data set.

I cannot figure out why your lst file contains 47,234 lines. Can you post the EXACT PROC MI step that you are running, including all options? For the default tables, only the "Missing Data Patterns" table can be large, and only if you have a huge number of combinations of missing values among your variables.

A possible work-around to your problem:

The REGPMM method is a nearest-neighbor computation, which will be very expensive when used with a large data set. I suggest that you use a more efficient method such as FCS REG. Using REG instead or REGPMM should be many times faster.

The current default number of imputations is 25, but I don't remember what the default was for SAS 9.4M2. If you want to know how long ONE imputation will take, set

NIMPUTE=1

on the PROC MI statement. As a rule of thumb, if you perform k imputations, it will take k times as long as a single imputation. With a large, you might want to limit the number of imputations to NIMPUTE=5 .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I'm having a similar issue with a dataset with 750,000 observations. I'm only imputing 6 variables, and I'm using REG for four of those. However, I wanted to use REGPMM for two attendance rate (percentage) variables because using REG results in numbers over 100. I've tried using the MAX option before but that didn't work.

Is there anyway to increase the efficiency of this operation? If not, is there a way to estimate the length of time that is needed? I've tried gradually increasing the number of observations, but the effect on computation time doesn't seem linear.

Thanks,

Bill

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Please post your PROC MI statements.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

```
PROC MI DATA = HSPSTUID(OBS = 50000) NIMPUTE = 1 OUT = HSPSTU_MI1;
CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
FCS NBITER = 1 REG (ELASSCZG08MI);
FCS NBITER = 1 REG (MTHSSCZG07MI);
FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
FCS NBITER = 1 REG (ELASSCZG07MI);
FCS NBITER = 1 REG (MTHSSCZG08MI);
VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
WHERE IN_MI = 1;
RUN
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

How many levels are in the categorical variables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P?

Run

```
PROC freq DATA = HSPSTUID(OBS = 50000);
tables ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
run;
```

or run PROC GLM with those explanatory variables and look at the ClassLevels table

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

ETHCAT has 5 and the other three variables have 2, so there are a total of 40 groups I suppose (5x2X2X2=40)

The FREQ Procedure

ETHCAT | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|

2: ASIAN/PACIFIC ISLANDER | 8496 | 16.99 | 8496 | 16.99 |

3: HISPANIC | 19228 | 38.46 | 27724 | 55.45 |

4: BLACK, NON-HISPANIC | 14359 | 28.72 | 42083 | 84.17 |

5: WHITE, NON-HISPANIC | 7300 | 14.60 | 49383 | 98.77 |

8: OTHER RACE/ETH | 617 | 1.23 | 50000 | 100.00 |

GENCAT | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|

1: FEMALE | 25793 | 51.59 | 25793 | 51.59 |

2: MALE | 24207 | 48.41 | 50000 | 100.00 |

OVERAGEG09 | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|

0 | 39900 | 79.80 | 39900 | 79.80 |

100 | 10100 | 20.20 | 50000 | 100.00 |

REGMTHP65G08P | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|

0 | 39062 | 78.12 | 39062 | 78.12 |

100 | 10938 | 21.88 | 50000 | 100.00 |

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It does take a long time. For some simulated data, I predict (on my PC) that the 50k observations will take 107 seconds per missing value imputation. Here is the simulated data I used:

```
/* simulate the data */
data HSPSTUID;
call streaminit(123);
do i = 1 to 50000;
ETHCAT = rand("Table", 0.17, 0.38, 0.29, 0.14, 0.02);
GENCAT = rand("Table", 0.51, 0.49);
OVERAGEG09 = 100*rand("BERN", 0.2);
REGMTHP65G08P = 100*rand("BERN", 0.22);
ATTPCTROLG08MI = rand("Normal");
ELASSCZG08MI = rand("Normal");
MTHSSCZG07MI = rand("Normal");
ATTPCTROLG07MI = rand("Normal");
ELASSCZG07MI = rand("Normal");
MTHSSCZG08MI = rand("Normal");
NBITER = ETHCAT + GENCAT - OVERAGEG09/10 - REGMTHP65G08P/10
+ ATTPCTROLG08MI - ELASSCZG08MI + MTHSSCZG07MI - ATTPCTROLG07MI
+ ELASSCZG07MI - MTHSSCZG08MI;
/* now introduce missing at random */
if rand("bern",0.1) then ATTPCTROLG08MI = .;
if rand("bern",0.05) then ELASSCZG08MI = .;
if rand("bern",0.05) then MTHSSCZG07MI = .;
if rand("bern",0.07) then ATTPCTROLG07MI = .;
if rand("bern",0.01) then ELASSCZG07MI = .;
if rand("bern",0.2) then MTHSSCZG08MI = .;
output;
end;
run;
```

I then ran PROC MI for OBS=5000, OBS=10000, etc, until I got tired of waiting. The code I ran was

```
%let N = 5000; /* change this numbe: 5000, 10000, 15000, ... */
PROC MI DATA = HSPSTUID(OBS = &N) NIMPUTE = 1 OUT = HSPSTU_MI1
displaypattern=nomeans;
CLASS ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P;
FCS NBITER = 1 REGPMM ( ATTPCTROLG08MI);
FCS NBITER = 1 REG (ELASSCZG08MI);
FCS NBITER = 1 REG (MTHSSCZG07MI);
FCS NBITER = 1 REGPMM ( ATTPCTROLG07MI);
FCS NBITER = 1 REG (ELASSCZG07MI);
FCS NBITER = 1 REG (MTHSSCZG08MI);
VAR ETHCAT GENCAT OVERAGEG09 REGMTHP65G08P ATTPCTROLG08MI ELASSCZG08MI MTHSSCZG07MI ATTPCTROLG07MI ELASSCZG07MI MTHSSCZG08MI;
RUN;
```

Here are the timings from my PC. The total time appears to be quadratic in the sample size:

```
/* results of running PROC MI with different sample sizes */
data Timing;
input size Time;
datalines;
500 0.17
5000 1.33
8000 2.91
10000 4.52
15000 9.38
20000 16.69
25000 27.2
30000 38.23
40000 .
50000 .
;
proc glm data=Timing;
model Time = size size*size;
output out=GLMOut P=pred;
run; quit;
proc print data=GLMOut;run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks, this was really helpful. I always appreciate your help on SAS communities as well as your blog posts.

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 25. Read more here about **why** you should contribute and **what is in it** for you!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.