BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
yuxinni
Calcite | Level 5

Hi!

I am running an exact-matching using the greedy algorithm. I ran the macro code that I found it online and there was the error "insufficient memory to data step program. The SAS system stopped processing this step because of insufficient memory."  I doubled my sas memory to 4g but the error persisted. 

 

I stumbled upon a post detailing a similar issue at this link:

https://communities.sas.com/t5/SAS-Programming/Hash-object-with-Insufficient-memory-to-execute-data-...

 As a SAS beginner, I find myself unable to modify my code effectively to reduce the memory demands of the hash object. Any suggestions or coding help would be highly appreciated!

 

Here is the code:

%MACRO MATCH(CASE=,CONTROL=,IDCA=,IDCO=,MVARS=,WTS=,DMAXK=,DMAX=,
             NCONTLS=1, TIME=,
             METHOD=,SEEDCA=,SEEDCO=,MAXITER=100000,PRINT=y,
             OUT=__OUT,OUTNMCA=__NMCA,OUTNMCO=__NMCO,MINCONT=,MAXCONT=);
 
   %LET BAD=0;
   %IF %LENGTH(&CASE)=0 %THEN %DO;
      %PUT ERROR: NO CASE DATASET SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %LENGTH(&CONTROL)=0 %THEN %DO;
      %PUT ERROR: NO CONTROL DATASET SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %LENGTH(&IDCA)=0 %THEN %DO;
      %PUT ERROR: NO IDCA VARIABLE SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %LENGTH(&IDCO)=0 %THEN %DO;
      %PUT ERROR: NO IDCO VARIABLE SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %LENGTH(&MVARS)=0 %THEN %DO;
      %PUT ERROR: NO MATCHING VARIABLES SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %LENGTH(&WTS)=0 %THEN %DO;
      %PUT ERROR: NO WEIGHTS SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %UPCASE(&METHOD)=GREEDY %THEN %DO;
      %IF %LENGTH(&SEEDCA)=0 %THEN %DO;
         %PUT ERROR: NO SEEDCA VALUE SUPPLIED;
         %LET BAD=1;
      %END;
      %IF %LENGTH(&SEEDCO)=0 %THEN %DO;
         %PUT ERROR: NO SEEDCO VALUE SUPPLIED;
         %LET BAD=1;
      %END;
   %END;
   %IF %LENGTH(&OUT)=0 %THEN %DO;
      %PUT ERROR: NO OUTPUT DATASET SUPPLIED;
      %LET BAD=1;
   %END;
   %IF %UPCASE(&METHOD)^=GREEDY & %UPCASE(&METHOD)^=OPTIMAL %THEN %DO;
      %PUT ERROR: METHOD MUST BE GREEDY OR OPTIMAL;
      %LET BAD=1;
   %END;
   %IF (&MINCONT=  AND &MAXCONT^= ) OR (&MINCONT^=  AND &MAXCONT= )
   %THEN %DO;
      %PUT ERROR: MINCONT AND MAXCONT MUST BOTH BE SPECIFIED;
      %LET BAD=1;
   %END;
   %LET NVAR=0;
   %DO %UNTIL(%SCAN(&MVARS,&NVAR+1,' ')= );
      %LET NVAR=%EVAL(&NVAR+1);
   %END;
   %LET NWTS=0;
   %DO %UNTIL(%QSCAN(&WTS,&NWTS+1,' ')= );
      %LET NWTS=%EVAL(&NWTS+1);
   %END;
   %IF &NVAR^= &NWTS %THEN %DO;
      %PUT ERROR: #VARS MUST EQUAL #WTS;
      %LET BAD=1;
   %END;
   %LET NK=0;
   %IF %QUOTE(&DMAXK)^=  %THEN %DO %UNTIL(%QSCAN(&DMAXK,&NK+1,' ')= );
      %LET NK=%EVAL(&NK+1);
   %END;
   %IF &NK>&NVAR %THEN %LET NK=&NVAR;
   %DO I=1 %TO &NVAR;
      %LET V&I=%SCAN(&MVARS,&I,' ');
   %END;
   %IF &NWTS>0 %THEN %DO;
        DATA _NULL_;
        %DO I=1 %TO &NWTS;
             %LET W&I=%SCAN(&WTS,&I,' ');
             IF &&W&I<0 THEN DO;
                  PUT 'ERROR: WEIGHTS MUST BE NON-NEGATIVE';
                  CALL SYMPUT('BAD','1');
             END;
        %END;
        RUN;
   %END;
   %IF &NK>0 %THEN %DO;
        DATA _NULL_;
        %DO I=1 %TO &NK;
             %LET K&I=%SCAN(&DMAXK,&I,' ');
             IF &&K&I<0 THEN DO;
                  PUT 'ERROR: DMAXK VALUES MUST BE NON-NEGATIVE';
                  CALL SYMPUT('BAD','1');
             END;
        %END;
        RUN;
   %END;
   %MACRO DIJ;
      %DO I=1 %TO &NVAR-1;
         &&W&I*ABS(__CA&I-__CO&I) +
      %END;
      &&W&NVAR*ABS(__CA&NVAR-__CO&NVAR);
   %MEND DIJ;
   %MACRO MAX1;
      %IF &DMAX^= %THEN %DO;
         & __D<=&DMAX
      %END;
      %DO I=1 %TO &NK;
         & ABS(__CA&I-__CO&I)<=&&K&I
      %END;
   %MEND MAX1;
   %MACRO MAX2;
      %IF &DMAX= & &NK=0 %THEN %DO;
         %IF &time^= %then %do;
            if __cotime>__catime then
         %end;
         output;
      %end;
      %IF &DMAX^= & &NK=0 %THEN %DO;
         IF _COST_<=&DMAX
         %if &time^= %then %do;
            & __cotime>__catime
         %end;
         THEN OUTPUT;
      %END;
      %IF &DMAX= & &NK>0 %THEN %DO;
         IF ABS(__CA1-__CO1)<=&K1
         %DO I=2 %TO &NK;
            & ABS(__CA&I-__CO&I)<=&&K&I
         %END;
         %if &time^= %then %do;
            & __cotime>__catime
         %end;
         THEN OUTPUT;
      %END;
      %IF &DMAX^= & &NK>0 %THEN %DO;
         IF _COST_<=&DMAX
         %DO I=1 %TO &NK;
            & ABS(__CA&I-__CO&I)<=&&K&I
         %END;
         %if &time^= %then %do;
            & __cotime>__catime
         %end;
         THEN OUTPUT;
      %END;
   %MEND MAX2;
   %MACRO LBLS;
      %DO I=1 %TO &NVAR;
         __CA&I="&&V&I/CASE"
         __CO&I="&&V&I/CONTROL"
         __DIF&I="&&V&I/ABS. DIFF "
         __WT&I="&&V&I/WEIGHT"
      %END;
   %MEND LBLS;
   %MACRO VBLES;
      %DO I=1 %TO &NVAR;
         __DIF&I
      %END;
      %DO I=1 %TO &NVAR;
         __CA&I __CO&I
      %END;
   %MEND VBLES;
   %MACRO GREEDY;
    %GLOBAL BAD2;
      DATA __CASE; SET &CASE;
           %DO I=1 %TO &NVAR;
                %LET MISSTEST=%SCAN(&MVARS,&I,' ');
                IF &MISSTEST=. THEN DELETE;
           %END;
           %IF &TIME^= %THEN %DO;
                IF &TIME=. THEN DELETE;
           %END;
      DATA __CASE; SET __CASE END=EOF;
       KEEP __IDCA __CA1-__CA&NVAR __R &mvars
       %if &time^= %then %do;
             __catime
          %end;
          ;
         __IDCA=&IDCA;
         %if &time^= %then %do;
            __catime=&time;
         %end;
         %DO I=1 %TO &NVAR;
            __CA&I=&&V&I;
         %END;
         SEED=&SEEDCA;
         __R=RANUNI( SEED  );
         IF EOF THEN CALL SYMPUT('NCA',_N_);
      PROC SORT; BY __R __IDCA;
      DATA __CONT; SET &CONTROL;
           %DO I=1 %TO &NVAR;
                %LET MISSTEST=%SCAN(&MVARS,&I,' ');
                IF &MISSTEST=. THEN DELETE;
           %END;
           %IF &TIME^= %THEN %DO;
                IF &TIME=. THEN DELETE;
           %END;
      DATA __CONT; SET __CONT END=EOF;
       KEEP __IDCO __CO1-__CO&NVAR __R &mvars
        %if &time^= %then %do;
           __cotime
        %end;
        ;
         __IDCO=&IDCO;
         %if &time^= %then %do;
            __cotime=&time;
         %end;
         %DO I=1 %TO &NVAR;
            __CO&I=&&V&I;
         %END;
         SEED=&SEEDCO;
         __R=RANUNI( SEED  );
         IF EOF THEN CALL SYMPUT('NCO',_N_);
      RUN;
      %LET BAD2=0;
      %IF &NCO < %EVAL(&NCA*&NCONTLS) %THEN %DO;
         %PUT ERROR: NOT ENOUGH CONTROLS TO MAKE REQUESTED MATCHES;
         %LET BAD2=1;
      %END;
      %IF &BAD2=0 %THEN %DO;
         PROC SORT; BY __R __IDCO;
         DATA __MATCH;
          KEEP __IDCA __CA1-__CA&NVAR __DIJ __MATCH __CONT_N
          %if &time^= %then %do;
             __catime __cotime
          %end;
          ;
          ARRAY __USED(&NCO) $ 1 _TEMPORARY_;
            DO __I=1 TO &NCO;
               __USED(__I)='0';
            END;
            DO __I=1 TO &NCONTLS;
               DO __J=1 TO &NCA;
                  SET __CASE POINT=__J;
                  __SMALL=.;
                  __MATCH=.;
                  DO __K=1 TO &NCO;
                     IF __USED(__K)='0' THEN DO;
                        SET __CONT POINT=__K;
                        __D=%DIJ
                        IF __d^=. & (__SMALL=. | __D<__SMALL) %MAX1
                        %if &time^= %then %do;
                           & __cotime > __catime
                        %end;
                        THEN DO;
                           __SMALL=__D;
                           __MATCH=__K;
                           __DIJ=__D;
                           __CONT_N=__I;
                        END;
                     END;
                  END;
                  IF __MATCH^=. THEN DO;
                     __USED(__MATCH)='1';
                     OUTPUT;
                  END;
               END;
            END;
            STOP;
         DATA &OUT;
          SET __MATCH;
          SET __CONT POINT=__MATCH;
          KEEP __IDCA __IDCO __CONT_N __DIJ __CA1-__CA&NVAR
               __CO1-__CO&NVAR __DIF1-__DIF&NVAR __WT1-__WT&NVAR
               %if &time^= %then %do;
                  __catime __cotime
               %end;
        ;
          LABEL __IDCA="&IDCA/CASE"
                __IDCO="&IDCO/CONTROL"
                %if &time^= %then %do;
                   __catime="&time/CASE"
                   __cotime="&time/CONTROL"
                %end;
                __CONT_N='CONTROL/NUMBER'
                __DIJ='DISTANCE/D_IJ'
                %LBLS;
             %DO I=1 %TO &NVAR;
                __DIF&I=abs(__CA&I-__CO&I);
                __WT&I=&&W&I;
             %END;
      %END;
   %MEND GREEDY;
   %IF &BAD=0 %THEN %DO;
      %IF %UPCASE(&METHOD)=GREEDY %THEN %DO;
         %GREEDY
      %END;
      %ELSE %DO;
         %OPTIMAL
      %END;
      %IF &BAD2=0 %THEN %DO;
         PROC SORT DATA=&OUT; BY __IDCA __CONT_N;
         proc sort data=__case; by __IDCA;
         data &outnmca; merge __case
              &out(in=__inout where=(__cont_n=1)); by __idca;
              if __inout=0; **non-matches;
 
         proc sort data=__cont; by __IDCO;
         proc sort data=&out; by __IDCO;
         data &outnmco; merge __cont
              &out(in=__inout); by __idco;
              if __inout=0; **non-matched controls;
         proc sort data=&out; by __IDCA; **re-sort by case id;
 
         %if %upcase(&print)=Y %then %do;
         PROC PRINT data=&out LABEL SPLIT='/';
          VAR __IDCA __IDCO __CONT_N
          %if &time^= %then %do;
             __catime __cotime
          %end;
          __DIJ %VBLES;
          sum __dij;
         title9'Data listing for matched cases and controls';
         footnote
    "match macro: case=&case control=&control idca=&idca idco=&idco";
         footnote2
"   mvars=&mvars  wts=&wts dmaxk=&dmaxk dmax=&dmax ncontls=&ncontls";
         %if &time^= %then %do;
  footnote3"time=&time  method=&method  seedca=&seedca  seedco=&seedco";
         %end;
         %else %do;
           footnote3"   method=&method  seedca=&seedca  seedco=&seedco";
         %end;
         footnote4"   out=&out   outnmca=&outnmca  outnmco=&outnmco";
         run;
         title9'Summary data for matched cases and controls';
         proc means data=&out n mean sum min max; class __cont_n;
          var __dij
           %if &nvar >=2 %then %do; __dif1-__dif&nvar  __ca1-__ca&nvar
                             %if &time^= %then %do;
                                __catime
                             %end;
                             __co1-__co&nvar
                             %if &time^= %then %do;
                                __cotime
                             %end;
                             ;
           %end;
           %else %do;
                             __dif1 __ca1
                             %if &time^= %then %do;
                                __catime
                             %end;
                             __co1
                             %if &time^= %then %do;
                                __cotime
                             %end;
                             ;
           %end;
         run;
         proc means data=&outnmca n mean sum min max; var &mvars;
         title9'Summary data for NON-matched cases';
         run;
         proc means data=&outnmco n mean sum min max; var &mvars;
         title9'Summary data for NON-matched controls';
         run;
         %end;
      %END;
   %END;
    title9; footnote;
    run;
%MEND MATCH;
 

  
%match(case=match.trans20_ccaea_case, control=match.ccae20_controls02, idca=MATCHID, idco=ENROLID, ncontls=3,
		mvars=AGE, maxiter=1, wts=1, dmaxk=0, out=match, method=greedy,seedca=2,seedco=2);

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@yuxinni wrote:
Yes, the control pool has ~41 million. I am trying to do the exact matching (1:3) based on the cases' age.

If AGE is the only criteria then there is need for all of the complex code.

Add a random variable to the CONTROL group and sort by AGE and this random variable.  

Do something similar to the CASE group, only replicate the cases 3 times so you can match.  In this case sort by the AGE, the repetition number, and the random number.

Now match them up.

proc sql ;
create view _controls as
  select control_id,age, rand('uniform') as random1, control_id
  from controls
  order by age,random1
;
quit;
data _cases;
  set cases;
  do rep=1 to 3;
     random2=rand('uniform');
     output;
  end;
  keep case_id age rep random2;
run;
proc sort data=_cases;
  by age rep random2;
run;

data want;
  merge _cases(in=in1) _controls(in=in2);
  by age ;
  if in1 and in2 then output;
  call missing(in1,in2);
run;

View solution in original post

18 REPLIES 18
mkeintz
PROC Star

There is no call to use explicit memory-resident hash objects in your macro code (i.e. the word "hash" is not in your code), so it is not likely that your problem is similar to a problem entitled "Hash object with Insufficient memory to execute data step program".

 

But to help resolve your problem, please

  1. Provide the explicit error message you received.
  2. Help us determine how far your multi-step program advanced.  To do this
    1. make sure you have set
         options mprint;
      before runnig the code.  This will print statements generated by the macro to your sas log
    2. show us the log.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
yuxinni
Calcite | Level 5

Hi! Thanks for the quick reply! My error msg shows "FATAL: Insufficient memory to execute DATA step program. Aborted during the COMPILATION phase. ERROR: The SAS System stopped processing this step because of insufficient memory." Here is my log:

NOTE: PROCEDURE PRINTTO used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      


NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      


NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      


NOTE: There were 1334 observations read from the data set MATCH.TRANS08_CCAEA_CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 91 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      


NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      12:94   
NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      


NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      


NOTE: There were 40743335 observations read from the data set MATCH.CCAE08_CONTROLS02.
NOTE: The data set WORK.__CONT has 40743335 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           7.97 seconds
      cpu time            4.50 seconds
      


NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      23:94   
NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           3.29 seconds
      cpu time            2.86 seconds
      


NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           17.63 seconds
      cpu time            39.92 seconds
      

FATAL: Insufficient memory to execute DATA step program. Aborted during the COMPILATION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      


ERROR: File WORK.__MATCH.DATA does not exist.

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set MATCH.MATCH may be incomplete.  When this step was stopped there were 0 
         observations and 8 variables.
WARNING: Data set MATCH.MATCH was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.11 seconds
      cpu time            0.01 seconds
      



NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds
      


NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds
      


NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: There were 0 observations read from the data set MATCH.MATCH.
      WHERE __cont_n=1;
NOTE: The data set WORK.__NMCA has 1334 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      

NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           18.51 seconds
      cpu time            44.03 seconds
      

NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      


NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: There were 0 observations read from the data set MATCH.MATCH.
NOTE: The data set WORK.__NMCO has 40743335 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           6.46 seconds
      cpu time            6.48 seconds
      


NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds
      


NOTE: No observations in data set MATCH.MATCH.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      


NOTE: No observations in data set MATCH.MATCH.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
      


NOTE: There were 1334 observations read from the data set WORK.__NMCA.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds
      


NOTE: There were 40743335 observations read from the data set WORK.__NMCO.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           1.85 seconds
      cpu time            7.04 seconds
      
Tom
Super User Tom
Super User

Without the MPRINT option we cannot tell what the data step that is generating the error is actually trying to do.

yuxinni
Calcite | Level 5

I see... Here is the log with print:

NOTE: The macro generated output from MPRINT will also be written to external file
      D:\DukeProject\match\&logfile..log while OPTIONS MPRINT and MFILE are set.
MPRINT(MATCH):   IF 1<0 THEN DO;
MPRINT(MATCH):   PUT 'ERROR: WEIGHTS MUST BE NON-NEGATIVE';
MPRINT(MATCH):   CALL SYMPUT('BAD','1');
MPRINT(MATCH):   END;
MPRINT(MATCH):   RUN;

NOTE: DATA statement used (Total process time):
      real time           0.05 seconds
      cpu time            0.01 seconds


MPRINT(MATCH):   DATA _NULL_;
MPRINT(MATCH):   IF 0<0 THEN DO;
MPRINT(MATCH):   PUT 'ERROR: DMAXK VALUES MUST BE NON-NEGATIVE';
MPRINT(MATCH):   CALL SYMPUT('BAD','1');
MPRINT(MATCH):   END;
MPRINT(MATCH):   RUN;

NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


MPRINT(GREEDY):   DATA __CASE;
MPRINT(GREEDY):   SET match.trans08_ccaea_case;
MPRINT(GREEDY):   IF AGE=. THEN DELETE;

NOTE: There were 1334 observations read from the data set MATCH.TRANS08_CCAEA_CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 91 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds


MPRINT(GREEDY):   DATA __CASE;
MPRINT(GREEDY):   SET __CASE END=EOF;
MPRINT(GREEDY):   KEEP __IDCA __CA1-__CA1 __R AGE ;
MPRINT(GREEDY):   __IDCA=MATCHID;
MPRINT(GREEDY):   __CA1=AGE;
MPRINT(GREEDY):   SEED=2;
MPRINT(GREEDY):   __R=RANUNI( SEED );
MPRINT(GREEDY):   IF EOF THEN CALL SYMPUT('NCA',_N_);

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      12:94
NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.35 seconds
      cpu time            0.07 seconds


MPRINT(GREEDY):   PROC SORT;
MPRINT(GREEDY):   BY __R __IDCA;

NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


MPRINT(GREEDY):   DATA __CONT;
MPRINT(GREEDY):   SET match.ccae08_controls02;
MPRINT(GREEDY):   IF AGE=. THEN DELETE;

NOTE: There were 40743335 observations read from the data set MATCH.CCAE08_CONTROLS02.
NOTE: The data set WORK.__CONT has 40743335 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           7.66 seconds
      cpu time            4.65 seconds


MPRINT(GREEDY):   DATA __CONT;
MPRINT(GREEDY):   SET __CONT END=EOF;
MPRINT(GREEDY):   KEEP __IDCO __CO1-__CO1 __R AGE ;
MPRINT(GREEDY):   __IDCO=ENROLID;
MPRINT(GREEDY):   __CO1=AGE;
MPRINT(GREEDY):   SEED=2;
MPRINT(GREEDY):   __R=RANUNI( SEED );
MPRINT(GREEDY):   IF EOF THEN CALL SYMPUT('NCO',_N_);
MPRINT(GREEDY):   RUN;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
      23:94
NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           3.42 seconds
      cpu time            2.89 seconds


MPRINT(GREEDY):   PROC SORT;
MPRINT(GREEDY):   BY __R __IDCO;

NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           16.80 seconds
      cpu time            43.64 seconds


MPRINT(GREEDY):   DATA __MATCH;
MPRINT(GREEDY):   KEEP __IDCA __CA1-__CA1 __DIJ __MATCH __CONT_N ;
MPRINT(GREEDY):   ARRAY __USED( 40743335) $ 1 _TEMPORARY_;
FATAL: Insufficient memory to execute DATA step program. Aborted during the COMPILATION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.03 seconds

MPRINT(GREEDY):   DO __I=1 TO 40743335;
MPRINT(GREEDY):   __USED(__I)='0';
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   DO __I=1 TO 3;
MPRINT(GREEDY):   DO __J=1 TO 1334;
MPRINT(GREEDY):   SET __CASE POINT=__J;
MPRINT(GREEDY):   __SMALL=.;
MPRINT(GREEDY):   __MATCH=.;
MPRINT(GREEDY):   DO __K=1 TO 40743335;
MPRINT(GREEDY):   IF __USED(__K)='0' THEN DO;
MPRINT(GREEDY):   SET __CONT POINT=__K;
MPRINT(GREEDY):   __D=
MPRINT(DIJ):  1*ABS(__CA1-__CO1);
MPRINT(GREEDY):   IF __d^=. & (__SMALL=. | __D<__SMALL)
MPRINT(MAX1):   & ABS(__CA1-__CO1)<=0
MPRINT(GREEDY):   THEN DO;
MPRINT(GREEDY):   __SMALL=__D;
MPRINT(GREEDY):   __MATCH=__K;
MPRINT(GREEDY):   __DIJ=__D;
MPRINT(GREEDY):   __CONT_N=__I;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   IF __MATCH^=. THEN DO;
MPRINT(GREEDY):   __USED(__MATCH)='1';
MPRINT(GREEDY):   OUTPUT;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   END;
MPRINT(GREEDY):   STOP;


MPRINT(GREEDY):   DATA match.match;
MPRINT(GREEDY):   SET __MATCH;
ERROR: File WORK.__MATCH.DATA does not exist.
MPRINT(GREEDY):   SET __CONT POINT=__MATCH;
MPRINT(GREEDY):   KEEP __IDCA __IDCO __CONT_N __DIJ __CA1-__CA1 __CO1-__CO1 __DIF1-__DIF1 __WT1-__WT1
;
MPRINT(GREEDY):   LABEL __IDCA= "MATCHID/CASE" __IDCO= "ENROLID/CONTROL" __CONT_N= 'CONTROL/NUMBER'
__DIJ=
MPRINT(LBLS):   'DISTANCE/D_IJ' __CA1= "AGE/CASE" __CO1= "AGE/CONTROL" __DIF1= "AGE/ABS. DIFF " __WT1=
MPRINT(GREEDY):   "AGE/WEIGHT";
MPRINT(GREEDY):   __DIF1=abs(__CA1-__CO1);
MPRINT(GREEDY):   __WT1=1;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set MATCH.MATCH may be incomplete.  When this step was stopped there were 0
         observations and 8 variables.
WARNING: Data set MATCH.MATCH was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
      real time           0.08 seconds
      cpu time            0.04 seconds



MPRINT(MATCH):   PROC SORT DATA=match.match;
MPRINT(MATCH):   BY __IDCA __CONT_N;

NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


MPRINT(MATCH):   proc sort data=__case;
MPRINT(MATCH):   by __IDCA;

NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: The data set WORK.__CASE has 1334 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


MPRINT(MATCH):   data __NMCA;
MPRINT(MATCH):   merge __case match.match(in=__inout where=(__cont_n=1));
MPRINT(MATCH):   by __idca;
MPRINT(MATCH):   if __inout=0;
MPRINT(MATCH):   **non-matches;

NOTE: There were 1334 observations read from the data set WORK.__CASE.
NOTE: There were 0 observations read from the data set MATCH.MATCH.
      WHERE __cont_n=1;
NOTE: The data set WORK.__NMCA has 1334 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.03 seconds


MPRINT(MATCH):   proc sort data=__cont;
MPRINT(MATCH):   by __IDCO;
NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: The data set WORK.__CONT has 40743335 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           17.19 seconds
      cpu time            45.45 seconds


MPRINT(MATCH):   proc sort data=match.match;
MPRINT(MATCH):   by __IDCO;
NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.08 seconds
      cpu time            0.07 seconds


MPRINT(MATCH):   data __NMCO;
MPRINT(MATCH):   merge __cont match.match(in=__inout);
MPRINT(MATCH):   by __idco;
MPRINT(MATCH):   if __inout=0;
MPRINT(MATCH):   **non-matched controls;

NOTE: There were 40743335 observations read from the data set WORK.__CONT.
NOTE: There were 0 observations read from the data set MATCH.MATCH.
NOTE: The data set WORK.__NMCO has 40743335 observations and 10 variables.
NOTE: DATA statement used (Total process time):
      real time           6.44 seconds
      cpu time            6.14 seconds


MPRINT(MATCH):   proc sort data=match.match;
MPRINT(MATCH):   by __IDCA;
MPRINT(MATCH):   **re-sort by case id;

NOTE: Input data set is empty.
NOTE: The data set MATCH.MATCH has 0 observations and 8 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds


NOTE: Writing HTML Body file: sashtml.htm
MPRINT(MATCH):   PROC PRINT data=match.match LABEL SPLIT='/';
MPRINT(MATCH):   VAR __IDCA __IDCO __CONT_N __DIJ
MPRINT(VBLES):   __DIF1 __CA1 __CO1
MPRINT(MATCH):  ;
MPRINT(MATCH):   sum __dij;
MPRINT(MATCH):   title9'Data listing for matched cases and controls';
MPRINT(MATCH):   footnote "match macro: case=match.trans08_ccaea_case control=match.ccae08_controls02
idca=MATCHID idco=ENROLID";
MPRINT(MATCH):   footnote2 "   mvars=AGE  wts=1 dmaxk=0 dmax= ncontls=3";
MPRINT(MATCH):   footnote3"   method=greedy  seedca=2  seedco=2";
MPRINT(MATCH):   footnote4"   out=match.match   outnmca=__NMCA  outnmco=__NMCO";
MPRINT(MATCH):   run;

NOTE: No observations in data set MATCH.MATCH.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.54 seconds
      cpu time            0.32 seconds


MPRINT(MATCH):   title9'Summary data for matched cases and controls';
MPRINT(MATCH):   proc means data=match.match n mean sum min max;
MPRINT(MATCH):   class __cont_n;
MPRINT(MATCH):   var __dij __dif1 __ca1 __co1 ;
MPRINT(MATCH):   run;

NOTE: No observations in data set MATCH.MATCH.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.03 seconds
      cpu time            0.04 seconds


MPRINT(MATCH):   proc means data=__NMCA n mean sum min max;
MPRINT(MATCH):   var AGE;
MPRINT(MATCH):   title9'Summary data for NON-matched cases';
MPRINT(MATCH):   run;

NOTE: There were 1334 observations read from the data set WORK.__NMCA.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.04 seconds
      cpu time            0.04 seconds


MPRINT(MATCH):   proc means data=__NMCO n mean sum min max;
MPRINT(MATCH):   var AGE;
MPRINT(MATCH):   title9'Summary data for NON-matched controls';
MPRINT(MATCH):   run;

NOTE: There were 40743335 observations read from the data set WORK.__NMCO.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           1.84 seconds
      cpu time            6.78 seconds


MPRINT(MATCH):   title9;
MPRINT(MATCH):   footnote;
MPRINT(MATCH):   run;

Tom
Super User Tom
Super User

Your algorithm is not compatible with such a large dataset.

 

It is attempting to define almost 41 million variables.

MPRINT(GREEDY):   DATA __MATCH;
MPRINT(GREEDY):   KEEP __IDCA __CA1-__CA1 __DIJ __MATCH __CONT_N ;
MPRINT(GREEDY):   ARRAY __USED( 40743335) $ 1 _TEMPORARY_;
FATAL: Insufficient memory to execute DATA step program. Aborted during the COMPILATION phase.
ERROR: The SAS System stopped processing this step because of insufficient memory.
NOTE: DATA statement used (Total process time):
      real time           0.06 seconds
      cpu time            0.03 seconds
yuxinni
Calcite | Level 5
Ohhhh Thanks! I'm curious if there's a possibility to modify the code and make it compatible, or do you think it would be better to explore alternative methods for matching? Thank you.
Tom
Super User Tom
Super User

41 million is probably too many to put into a HASH object also.

 

So you really have 41 million controls?

 

What are you matching on?  Perhaps if you require an exact match on some attributes (like GENDER or AGE GROUP) then you could break the problem into smaller pieces and then you might not run into the memory limit.

 

yuxinni
Calcite | Level 5
Yes, the control pool has ~41 million. I am trying to do the exact matching (1:3) based on the cases' age.
Tom
Super User Tom
Super User

@yuxinni wrote:
Yes, the control pool has ~41 million. I am trying to do the exact matching (1:3) based on the cases' age.

If AGE is the only criteria then there is need for all of the complex code.

Add a random variable to the CONTROL group and sort by AGE and this random variable.  

Do something similar to the CASE group, only replicate the cases 3 times so you can match.  In this case sort by the AGE, the repetition number, and the random number.

Now match them up.

proc sql ;
create view _controls as
  select control_id,age, rand('uniform') as random1, control_id
  from controls
  order by age,random1
;
quit;
data _cases;
  set cases;
  do rep=1 to 3;
     random2=rand('uniform');
     output;
  end;
  keep case_id age rep random2;
run;
proc sort data=_cases;
  by age rep random2;
run;

data want;
  merge _cases(in=in1) _controls(in=in2);
  by age ;
  if in1 and in2 then output;
  call missing(in1,in2);
run;
yuxinni
Calcite | Level 5
Thanks for helping out and for the codes! It works well and it's fast! Appreciated!
Kurt_Bremser
Super User

Your control dataset has 40 million observations. The DATA step creating __MATCH tries to create an array with the same number of elements.

What is your MEMSIZE setting?

yuxinni
Calcite | Level 5
Mine is MEMSIZE=4294967296.
ballardw
Super User

For what it may be worth, MEMSIZE alone is not the limit.

I have about half the Memsize setting of the OP an a brief program on my system doesn't choke with that error:

7    data junk;
8       set sashelp.class;
9       array t (41000000) $1 _temporary_;
10   run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.JUNK has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.42 seconds
      cpu time            0.40 seconds


11
12   proc options option=memsize;run;

    SAS (r) Proprietary Software Release 9.4  TS1M4

 MEMSIZE=2147483648
                   Specifies the limit on the amount of virtual memory
                   that can be used during a SAS session.
NOTE: PROCEDURE OPTIONS used (Total process time):

One starts suspecting other elements in the OP entire SAS environment. I am running a local version of SAS.

Perhaps OP is running in a networked environment that has limits from SAS Admin?

Patrick
Opal | Level 21

@ballardw What you are sharing is really interesting and also baffles me. I thought that a temporary array will get loaded into memory and though your sample code should require more than 2GB of memory.

I've tested the code you shared under SAS9.4M7 both with a local SAS under Windows and a server instance under RHEL. In both cases more than 2GB of memory is required for not getting an error during compilation time.

 

Another interesting observation:

With a 2GB memsize limit I can run below code without error.

options fullstimer;
data _null_;
  array t (36000000) $1 _temporary_ ;
run;

But if I populate the array vars then I get again the compiler error with the insufficient memory message.

options fullstimer;
data _null_;
  array t (36000000) $1 _temporary_ (36000000*'a');
run;

When I run above two version on my rhel server environment with a higher memsize limit then the SAS log tells me that the 2nd version consumes more memory (memory 2285623.59k).

 

Really curious why things appear to work for you. What are your other memory settings? Below are mine on the local Windows environment.

Patrick_0-1690750180823.png

 

 

 

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 2104 views
  • 0 likes
  • 6 in conversation