Solved: Help with Restricted Cubic Splines : Code Optimization and Graphical O...

tina_ting · Posted 06-04-2025 07:32 AM

Hello, SAS experts

I am working on Restricted Cubic Splines analysies and have written SAS code for both the continuous outcome and the binary outcome. I have included detailed comments in the code and have accounted for covarites. I have also attached the resulting plots and output tables, as well as a sample of the original dataset for reference.

Could you please review the the code, along with the generated graphs and results, and let me know if they are correct? I would greatly appreciate any suggestions for optimization.

Thank you very much for your help!

/*A. Logistic RCS – Individual-level binary Y (poly) applied directly to raw data without prior aggregation*/
/* STEP 1: Rename MO and poly in RCS.poly to X and Y, and keep necessary covariates */
data poly;
   set RCS.poly;
   X = MO;       /* Month 1–12 */
   Y = poly;     /* Binary 0/1 */
   keep ID X Y CITY COUNT_gr;
run;

/* STEP 2: Sort by X (1–12) (needed for BY X if used for plotting) */
proc sort data=poly;
   by X;
run;

/* STEP 3: Fit logistic RCS on individual-level binary Y, adjusting for CITY and COUNT_gr */
title;
ods select ParameterEstimates SplineKnots;
proc logistic data=poly;
   /* Apply natural cubic spline on month X; knot locations can be adjusted */
   effect spl = spline(X /
                       details naturalcubic 
                       basis=tpf(noint)
                       knotmethod=percentilelist(5 27.5 50 72.5 95)  /* Example with 5 knots */
                      );
   /* Y(event='1') = spline(X) + covariates */
   model Y(event='1') = spl CITY COUNT_gr;
   /* Save each individual's predicted probability to PredProb */
   output out=LogitPred (keep=ID X Y CITY COUNT_gr PredProb) p=PredProb;
   title "Logistic RCS Model (Individual-level Binary Y)";
run;

/* STEP 4: Compute monthly average predicted probability by averaging individual-level PredProb by X */
proc means data=LogitPred noprint;
   by X;
   var PredProb;
   output out=AvgPred (drop=_TYPE_ _FREQ_) mean=MeanProb;
run;

/* STEP 5: Plot the RCS curve using 12 monthly average PredProb */
title;  
ods graphics / antialiasmax=500000;
title "Individual-level Logistic RCS: Monthly Avg Predicted Probability of poly";
proc sgplot data=AvgPred noautolegend;
   series  x=X        y=MeanProb  / lineattrs=(thickness=2 color=blue);
   scatter x=X        y=MeanProb  / markerattrs=(symbol=circlefilled size=4pt color=red);
   xaxis values=(1 to 12 by 1) label="Month (MO)";
   yaxis label="Average Predicted Probability of poly=1";
run;

Dataset：poly

Dataset：LogitPred

Dataset：AvgPred

/* B. Individual level: continuous sum*/
/* sum: Cost_opd */
/*STEP 1: Rename MO and cost in RCS.cost to X and Y, and keep covariates */
data cost_ind;
   set RCS.cost;
   X = MO;           /* Month 1–12 */
   Y = cost;         /* Continuous dependent variable */
   /* If cost has missing values and you want to delete them first, uncomment the next line */
   /* if missing(Y) then delete; */
   keep ID X Y CITY COUNT_gr;
run;

proc sort data=cost_ind;
   by X;
run;

/* STEP 2: Perform RCS on individual-level data (continuous Y = cost), adjusting for CITY and COUNT_gr */
title;  
ods select ParameterEstimates SplineKnots;
proc glmselect data=cost_ind;
   /* Define a natural cubic spline on X; example with 5 knots: roughly at X=1, 4, 6, 9, 12 */
   effect spl = spline(
                 X /
                 details naturalcubic
                 basis=tpf(noint)
                 knotmethod=percentilelist(5 27.5 50 72.5 95)
               );
   /* Include spline(X), CITY, and COUNT_gr in the model */
   model Y = spl CITY COUNT_gr / selection=none;
   /* Save each individual’s predicted value to FitCost in cost_pred */
   output out=cost_pred predicted=FitCost;
run;

/*********** STEP 3: Check whether cost_pred contains FitCost ***********/
proc print data=cost_pred(obs=20);
   var ID X Y CITY COUNT_gr FitCost;
   title "Check cost_pred (first 20 observations)";
run;

proc means data=cost_pred n nmiss min max mean;
   var FitCost Y;
   title "Check FitCost (predicted) and Y (actual)";
run;

/*********** STEP 4A: Aggregate individual-level predicted values by month X using SUM ***********/
proc means data=cost_pred noprint;
   by X;  
   var FitCost;
   output out=MonthlyFit (drop=_TYPE_ _FREQ_) 
          sum=SumFitCost;   /* Use sum instead of mean */
run;

proc print data=MonthlyFit;
   title "Sum of fitted cost for each month (X)";
   var X SumFitCost;
run;

/*********** STEP 4B: Plot – Use 12 months of summed predicted values to draw RCS curve ***********/
title;  
ods graphics / antialiasmax=500000;
title "Individual-Level RCS: Monthly Sum of Fitted Cost by MO";
proc sgplot data=MonthlyFit noautolegend;
   series  x=X         y=SumFitCost / lineattrs=(thickness=2 color=blue);
   scatter x=X         y=SumFitCost / markerattrs=(symbol=circlefilled size=4pt color=red);
   xaxis values=(1 to 12 by 1) label="Month (MO)";
   yaxis label="Sum of Fitted Cost";
run;
ods graphics off;

Dataset：cost_ind

Dataset：cost_pred

Dataset：MonthlyFit

SteveDenham · Posted 06-04-2025 12:37 PM

I don't know about "correct" but this certainly looks to be "fit for purpose." When you say optimize, I have to assume you mean "reduce how long it takes to generate results." That is going to depend on the amount of input data you have. As far as programming, I am not a good resource. The one thing I might suggest is using high performance versions of LOGISTIC and GLMSELECT (=HPLOGISTIC and =HPGENSELECT) to see if multi-threading speeds things up.

SteveDenham

View solution in original post

SteveDenham · Posted 06-04-2025 12:37 PM

I don't know about "correct" but this certainly looks to be "fit for purpose." When you say optimize, I have to assume you mean "reduce how long it takes to generate results." That is going to depend on the amount of input data you have. As far as programming, I am not a good resource. The one thing I might suggest is using high performance versions of LOGISTIC and GLMSELECT (=HPLOGISTIC and =HPGENSELECT) to see if multi-threading speeds things up.

SteveDenham

Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Re: Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Re: Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Re: Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Re: Help with Restricted Cubic Splines : Code Optimization and Graphical Output

Registration is open