Statistical Procedures

Programming the statistical procedures from SAS
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Haris
Lapis Lazuli | Level 10

Thanks to Danny for this wonderful resource on how to score a fixed effects logistic Bayesian model:
https://communities.sas.com/t5/SAS-Communities-Library/Bayesian-Analysis-and-PROC-IML-What-s-the-Sco....

 

I need to take this one step further and add random intercepts to the process.  My IML skills need serious work so I had to take the output of the fixed effect scoring out of PROC IML, transpose them to create one tall file and then merge it with the Iteration by RandomIntercept matrix in the OutPost file from MCMC/BGLIMMIX.  I use vValueX() function to match each Level 2 ID to the corresponding column in the Random Intercept dataset. 

 

This workflow is straightforward with relatively light code.  Maybe someone can use it as is.  For my purposes it's too SLOW!  I am working with over 1,000 Level 2 units containing over 500 Level 1 Units each and 5,000 iterations seems like a minimum for Bayesian output.  Marginal Linear Predictor (mLP) takes seconds to compute even for this large dataset in PROC IML.  Adding Random intercepts to each mLP estimate to obtain Conditional Linear Predictors takes hours.  I am in an evironment where I need to do this on a regular basis for dozens of models so the time adds up very quickly.

I am hoping some good soul can translate what I have done with transpose/merge into PROC IML and that matrix addition can shave serious time off the process.  Thanks in advance.   

 

Simulated source data files and the workflow to add ConditionalLP = mLP + RI_L2 are in the code below.  L12 dataset is what comes out from the scoring effort in the blog reference above (minus the L2 unit IDs).

%let nL2   = 10; * Level 2 units					;
%let nL1   = 20; * L1 units per L2 unit on average 	;
%let nIter = 50; * MCMC iterations					;
data L12;
  do L2ID=1 to &nL2;
    do L1ID=1 to ceil(RanUni(1)*&nL1*2); * simulate variable number of L1 units per L2 unit ;
  	array mLR_Iter[&nIter]; do i=1 to &nIter; mLR_Iter[i] = RanNor(1); end; 
	output;
	end;
  end;
	drop i;
data OutPost; * Random Intercepts: one row per MCMC Iteration, one column per L2 unit ;
  do Iteration=1 to &nIter;
	array RI_L2[&nL2]; do i=1 to &nL2; RI_L2[i] = RanNor(1);end;
	output;
  end;
	drop i;
run;

*** 2. TRANSPOSE PT DATA to TALL AND MERGE WITH RANDOM INTERCEPTS						;
proc transpose data=L12 out=L12_Tall;
 by L2ID L1ID;
 var mLR_Iter:;
proc sql;
 create table L12_Tall as
 select input(substr(_Name_,9),best.) as Iteration, /* extract Iteration ID that starts with character 9	*/
  L2ID, L1ID, Col1 as mLP 							/* marginal linear predictor							*/
 from L12_Tall
 order by Iteration, L2ID, L1ID
;
quit;

data WantM;
  merge L12_Tall
		OutPost;
 by Iteration;

 ConditionalLP = mLP + vValueX(cats('RI_L2',L2ID));

 drop RI_L2:;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
webart999ARM
Quartz | Level 8

Hi,

 

I tried this with slightly higher Level 2, Level 1 and Iteration numbers. (400, 200, 700).
It works quite fast when I compared its cpu time, real-time, memory, etc. to the same numbers in the log of your provided version.

 

%let nL2   = 400;   /* Number of level-2 units */
%let nL1   = 200;   /* Maximum level-1 units per level-2 */
%let nIter = 700;   /* Number of simulation iterations */

/* Simulate hierarchical data structure with random intercepts */
data L12;
  do L2ID=1 to &nL2;                     
    do L1ID=1 to ceil(RanUni(1)*&nL1*2);  
      array mLR_Iter[&nIter];            /* Array for storing marginal predictors */
      do i=1 to &nIter; mLR_Iter[i] = RanNor(1); end; 
      output;
    end;
  end;
  drop i;
run;

/* Simulate posterior distribution of random intercepts */
data OutPost;                           
  do Iteration=1 to &nIter;             
    array RI_L2[&nL2];                  /* Array for random intercept values */
    do i=1 to &nL2; RI_L2[i] = RanNor(1); end;
    output;
  end;
  drop i;
run;

proc iml;
/* 1. Import random intercept matrix */
use OutPost;
  read all var _NUM_ into RI_Matrix[colname=varNames];
close OutPost;
RI_Matrix = RI_Matrix[, 2:ncol(RI_Matrix)];   /* Remove iteration counter column */

/* 2. Import observed data and predictors */
use L12;
  read all var {'L2ID', 'L1ID'} into IDs;     /* Group/unit identifiers */
  read all var _NUM_ into L12_Data;           /* Full dataset with predictors */
close L12;

/* 3. Initialize dimensions and ID vectors */
nIter = nrow(RI_Matrix);                      /* Total MCMC iterations */
nObs = nrow(IDs);                             /* Total observations in dataset */
L2ID = IDs[,1];                               /* Level 2 group membership */
L1ID = IDs[,2];                               /* Level 1 unit identifiers */

/* 4. Initialize output dataset */
create WantIML var {Iteration L2ID L1ID mLP ConditionalLP};

/* 5. Stream processing by iteration */
do i = 1 to nIter;                            /* Memory-efficient iteration loop */
  mLP = L12_Data[, 2+i];                      /* Extract predictors for current iteration */
  RI_i = RI_Matrix[i, ];                      /* Current iteration's random effects */
  
  ConditionalLP = mLP + RI_i[L2ID];           /* Calculate conditional linear predictor */
  Iteration = j(nObs, 1, i);                  /* Create iteration identifier vector */
  
  append var {Iteration L2ID L1ID mLP ConditionalLP}; /* Write current iteration data */
end;

close WantIML;
quit;


However I tried to execute the code with the numbers you mentioned initially (over 1,000 Level 2 units containing over 500 Level 1 Units each and 5,000), and it failed, because of the lack of memory (I have tested this code in SAS OnDemand)

 

 
 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 68         
 69         %let nL2   = 1000; * Level 2 units;
 70         %let nL1   = 500; * L1 units per L2 unit on average ;
 71         %let nIter = 5000; * MCMC iterations;
 72         data L12;
 73           do L2ID=1 to &nL2;
 74             do L1ID=1 to ceil(RanUni(1)*&nL1*2); * simulate variable number of L1 units per L2 unit ;
 75           array mLR_Iter[&nIter]; do i=1 to &nIter; mLR_Iter[i] = RanNor(1); end;
 76         output;
 77         end;
 78           end;
 79         drop i;
 
 ERROR: Insufficient space in file WORK.L12.DATA.
 ERROR: File WORK.L12.DATA is damaged. I/O processing did not complete.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.L12 may be incomplete.  When this step was stopped there were 270315 observations and 5002 variables.
 ERROR: Insufficient space in file WORK.L12.DATA.
 NOTE: DATA statement used (Total process time):
       real time           1:15.32
       user cpu time       1:01.92
       system cpu time     4.56 seconds
       memory              5080.40k
       OS Memory           25552.00k
       Timestamp           02/02/2025 02:58:51 AM
       Step Count                        31  Switch Count  42
       Page Faults                       0
       Page Reclaims                     1113
       Page Swaps                        0
       Voluntary Context Switches        24592
       Involuntary Context Switches      1937
       Block Input Operations            0
       Block Output Operations           23066656
       
 
 
 80         data OutPost; * Random Intercepts: one row per MCMC Iteration, one column per L2 unit ;
 81           do Iteration=1 to &nIter;
 82         array RI_L2[&nL2]; do i=1 to &nL2; RI_L2[i] = RanNor(1);end;
 83         output;
 84           end;
 85         drop i;
 86         run;
 
 ERROR: An I/O error has occurred on file WORK.OUTPOST.DATA.
 ERROR: UNIX errno = 122.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.OUTPOST was only partially opened and will not be saved.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1299.15k
       OS Memory           22944.00k
       Timestamp           02/02/2025 02:58:51 AM
       Step Count                        32  Switch Count  0
       Page Faults                       0
       Page Reclaims                     29
       Page Swaps                        0
       Voluntary Context Switches        0
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 87         
 88         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 98         

 

Hope  I could be helpful.

View solution in original post

3 REPLIES 3
webart999ARM
Quartz | Level 8

Hi,

 

I tried this with slightly higher Level 2, Level 1 and Iteration numbers. (400, 200, 700).
It works quite fast when I compared its cpu time, real-time, memory, etc. to the same numbers in the log of your provided version.

 

%let nL2   = 400;   /* Number of level-2 units */
%let nL1   = 200;   /* Maximum level-1 units per level-2 */
%let nIter = 700;   /* Number of simulation iterations */

/* Simulate hierarchical data structure with random intercepts */
data L12;
  do L2ID=1 to &nL2;                     
    do L1ID=1 to ceil(RanUni(1)*&nL1*2);  
      array mLR_Iter[&nIter];            /* Array for storing marginal predictors */
      do i=1 to &nIter; mLR_Iter[i] = RanNor(1); end; 
      output;
    end;
  end;
  drop i;
run;

/* Simulate posterior distribution of random intercepts */
data OutPost;                           
  do Iteration=1 to &nIter;             
    array RI_L2[&nL2];                  /* Array for random intercept values */
    do i=1 to &nL2; RI_L2[i] = RanNor(1); end;
    output;
  end;
  drop i;
run;

proc iml;
/* 1. Import random intercept matrix */
use OutPost;
  read all var _NUM_ into RI_Matrix[colname=varNames];
close OutPost;
RI_Matrix = RI_Matrix[, 2:ncol(RI_Matrix)];   /* Remove iteration counter column */

/* 2. Import observed data and predictors */
use L12;
  read all var {'L2ID', 'L1ID'} into IDs;     /* Group/unit identifiers */
  read all var _NUM_ into L12_Data;           /* Full dataset with predictors */
close L12;

/* 3. Initialize dimensions and ID vectors */
nIter = nrow(RI_Matrix);                      /* Total MCMC iterations */
nObs = nrow(IDs);                             /* Total observations in dataset */
L2ID = IDs[,1];                               /* Level 2 group membership */
L1ID = IDs[,2];                               /* Level 1 unit identifiers */

/* 4. Initialize output dataset */
create WantIML var {Iteration L2ID L1ID mLP ConditionalLP};

/* 5. Stream processing by iteration */
do i = 1 to nIter;                            /* Memory-efficient iteration loop */
  mLP = L12_Data[, 2+i];                      /* Extract predictors for current iteration */
  RI_i = RI_Matrix[i, ];                      /* Current iteration's random effects */
  
  ConditionalLP = mLP + RI_i[L2ID];           /* Calculate conditional linear predictor */
  Iteration = j(nObs, 1, i);                  /* Create iteration identifier vector */
  
  append var {Iteration L2ID L1ID mLP ConditionalLP}; /* Write current iteration data */
end;

close WantIML;
quit;


However I tried to execute the code with the numbers you mentioned initially (over 1,000 Level 2 units containing over 500 Level 1 Units each and 5,000), and it failed, because of the lack of memory (I have tested this code in SAS OnDemand)

 

 
 1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 68         
 69         %let nL2   = 1000; * Level 2 units;
 70         %let nL1   = 500; * L1 units per L2 unit on average ;
 71         %let nIter = 5000; * MCMC iterations;
 72         data L12;
 73           do L2ID=1 to &nL2;
 74             do L1ID=1 to ceil(RanUni(1)*&nL1*2); * simulate variable number of L1 units per L2 unit ;
 75           array mLR_Iter[&nIter]; do i=1 to &nIter; mLR_Iter[i] = RanNor(1); end;
 76         output;
 77         end;
 78           end;
 79         drop i;
 
 ERROR: Insufficient space in file WORK.L12.DATA.
 ERROR: File WORK.L12.DATA is damaged. I/O processing did not complete.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.L12 may be incomplete.  When this step was stopped there were 270315 observations and 5002 variables.
 ERROR: Insufficient space in file WORK.L12.DATA.
 NOTE: DATA statement used (Total process time):
       real time           1:15.32
       user cpu time       1:01.92
       system cpu time     4.56 seconds
       memory              5080.40k
       OS Memory           25552.00k
       Timestamp           02/02/2025 02:58:51 AM
       Step Count                        31  Switch Count  42
       Page Faults                       0
       Page Reclaims                     1113
       Page Swaps                        0
       Voluntary Context Switches        24592
       Involuntary Context Switches      1937
       Block Input Operations            0
       Block Output Operations           23066656
       
 
 
 80         data OutPost; * Random Intercepts: one row per MCMC Iteration, one column per L2 unit ;
 81           do Iteration=1 to &nIter;
 82         array RI_L2[&nL2]; do i=1 to &nL2; RI_L2[i] = RanNor(1);end;
 83         output;
 84           end;
 85         drop i;
 86         run;
 
 ERROR: An I/O error has occurred on file WORK.OUTPOST.DATA.
 ERROR: UNIX errno = 122.
 NOTE: The SAS System stopped processing this step because of errors.
 WARNING: The data set WORK.OUTPOST was only partially opened and will not be saved.
 NOTE: DATA statement used (Total process time):
       real time           0.00 seconds
       user cpu time       0.00 seconds
       system cpu time     0.00 seconds
       memory              1299.15k
       OS Memory           22944.00k
       Timestamp           02/02/2025 02:58:51 AM
       Step Count                        32  Switch Count  0
       Page Faults                       0
       Page Reclaims                     29
       Page Swaps                        0
       Voluntary Context Switches        0
       Involuntary Context Switches      0
       Block Input Operations            0
       Block Output Operations           0
       
 
 87         
 88         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 98         

 

Hope  I could be helpful.

Haris
Lapis Lazuli | Level 10

Brilliant!  Thanks so much.  Will test the performance of this code.

In the meantime, I was able to shave off more than 30% of execution time by (1) removing the use of vValueX() lookup and (2) eliminating interim PROC SORTs.  I replaced the former with TRANSPOSE step for Random Intercepts and the latter with a PROC SQL step for merging instead of DATA step.  A lot less code too:

 

*** 2.a TRANSPOSE PT DATA to TALL AND MERGE WITH FLAT RANDOM INTERCEPTS					;
proc transpose data=Pt out=Pt_T( rename=(Col1=mLP) );	by ParticID PtID;	var mLR_Iter:;
*** 2.b TRANSPOSE RI DATA to TALL AND MERGE WITH TALL PATIENTS							;
proc transpose data=RI out=RI_T( rename=(Col1=RI) );	by Iteration;		var RI:;

*** 3. NO-SORT SQL MEGE																	;
proc sql; create table WantM3 as
	select p.ParticID, PtID, Iteration, mLP, RI, mLP+RI as cLP
	from 		Pt_T p
	  left join RI_T r  ON input(substr(p._NAME_,9),best.) = r.Iteration
					   AND p.ParticID = input(substr(r._NAME_,7),best.)
	order by p.ParticID, PtID, Iteration
;
quit;

  

Haris
Lapis Lazuli | Level 10

Benchmarking results are in:

1. My original code ran just a tad over two hours

2. My new and improved code without vValueX() or sorting improved significantly to 1.5 hours.

3. PROC IML method recommended above blew them both out of the water completing the task in 24.5 minutes!

 

Thanks for playing!

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1175 views
  • 1 like
  • 2 in conversation