BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
tina_ting
Fluorite | Level 6

Hello, SAS experts

I am working on Restricted Cubic Splines  analysies and have written SAS code for both the continuous outcome and the binary outcome. I have included detailed comments in the code and have accounted for covarites. I have also attached the resulting plots and output tables, as well as a sample of the original dataset for reference.

 

Could you please review the the code, along with the generated graphs and results, and let me know if they are correct? I would greatly appreciate any suggestions for optimization.

 

Thank you very much for your help!

/*A. Logistic RCS – Individual-level binary Y (poly) applied directly to raw data without prior aggregation*/
/* STEP 1: Rename MO and poly in RCS.poly to X and Y, and keep necessary covariates */
data poly;
   set RCS.poly;
   X = MO;       /* Month 1–12 */
   Y = poly;     /* Binary 0/1 */
   keep ID X Y CITY COUNT_gr;
run;

/* STEP 2: Sort by X (1–12) (needed for BY X if used for plotting) */
proc sort data=poly;
   by X;
run;

/* STEP 3: Fit logistic RCS on individual-level binary Y, adjusting for CITY and COUNT_gr */
title;
ods select ParameterEstimates SplineKnots;
proc logistic data=poly;
   /* Apply natural cubic spline on month X; knot locations can be adjusted */
   effect spl = spline(X /
                       details naturalcubic 
                       basis=tpf(noint)
                       knotmethod=percentilelist(5 27.5 50 72.5 95)  /* Example with 5 knots */
                      );
   /* Y(event='1') = spline(X) + covariates */
   model Y(event='1') = spl CITY COUNT_gr;
   /* Save each individual's predicted probability to PredProb */
   output out=LogitPred (keep=ID X Y CITY COUNT_gr PredProb) p=PredProb;
   title "Logistic RCS Model (Individual-level Binary Y)";
run;

/* STEP 4: Compute monthly average predicted probability by averaging individual-level PredProb by X */
proc means data=LogitPred noprint;
   by X;
   var PredProb;
   output out=AvgPred (drop=_TYPE_ _FREQ_) mean=MeanProb;
run;

/* STEP 5: Plot the RCS curve using 12 monthly average PredProb */
title;  
ods graphics / antialiasmax=500000;
title "Individual-level Logistic RCS: Monthly Avg Predicted Probability of poly";
proc sgplot data=AvgPred noautolegend;
   series  x=X        y=MeanProb  / lineattrs=(thickness=2 color=blue);
   scatter x=X        y=MeanProb  / markerattrs=(symbol=circlefilled size=4pt color=red);
   xaxis values=(1 to 12 by 1) label="Month (MO)";
   yaxis label="Average Predicted Probability of poly=1";
run;

Datasetpoly

tina_ting_0-1749036424357.png

DatasetLogitPred

tina_ting_1-1749036440256.png

DatasetAvgPred

tina_ting_2-1749036450093.png

tina_ting_3-1749036463086.pngtina_ting_4-1749036465911.pngtina_ting_5-1749036469468.png

/* B. Individual level: continuous sum*/
/* sum: Cost_opd */
/*STEP 1: Rename MO and cost in RCS.cost to X and Y, and keep covariates */
data cost_ind;
   set RCS.cost;
   X = MO;           /* Month 1–12 */
   Y = cost;         /* Continuous dependent variable */
   /* If cost has missing values and you want to delete them first, uncomment the next line */
   /* if missing(Y) then delete; */
   keep ID X Y CITY COUNT_gr;
run;

proc sort data=cost_ind;
   by X;
run;

/* STEP 2: Perform RCS on individual-level data (continuous Y = cost), adjusting for CITY and COUNT_gr */
title;  
ods select ParameterEstimates SplineKnots;
proc glmselect data=cost_ind;
   /* Define a natural cubic spline on X; example with 5 knots: roughly at X=1, 4, 6, 9, 12 */
   effect spl = spline(
                 X /
                 details naturalcubic
                 basis=tpf(noint)
                 knotmethod=percentilelist(5 27.5 50 72.5 95)
               );
   /* Include spline(X), CITY, and COUNT_gr in the model */
   model Y = spl CITY COUNT_gr / selection=none;
   /* Save each individual’s predicted value to FitCost in cost_pred */
   output out=cost_pred predicted=FitCost;
run;

/*********** STEP 3: Check whether cost_pred contains FitCost ***********/
proc print data=cost_pred(obs=20);
   var ID X Y CITY COUNT_gr FitCost;
   title "Check cost_pred (first 20 observations)";
run;

proc means data=cost_pred n nmiss min max mean;
   var FitCost Y;
   title "Check FitCost (predicted) and Y (actual)";
run;

/*********** STEP 4A: Aggregate individual-level predicted values by month X using SUM ***********/
proc means data=cost_pred noprint;
   by X;  
   var FitCost;
   output out=MonthlyFit (drop=_TYPE_ _FREQ_) 
          sum=SumFitCost;   /* Use sum instead of mean */
run;

proc print data=MonthlyFit;
   title "Sum of fitted cost for each month (X)";
   var X SumFitCost;
run;

/*********** STEP 4B: Plot – Use 12 months of summed predicted values to draw RCS curve ***********/
title;  
ods graphics / antialiasmax=500000;
title "Individual-Level RCS: Monthly Sum of Fitted Cost by MO";
proc sgplot data=MonthlyFit noautolegend;
   series  x=X         y=SumFitCost / lineattrs=(thickness=2 color=blue);
   scatter x=X         y=SumFitCost / markerattrs=(symbol=circlefilled size=4pt color=red);
   xaxis values=(1 to 12 by 1) label="Month (MO)";
   yaxis label="Sum of Fitted Cost";
run;
ods graphics off;

Dataset:cost_ind

tina_ting_6-1749036525366.png

Datasetcost_pred

tina_ting_7-1749036535181.png

DatasetMonthlyFit

tina_ting_8-1749036546854.png

tina_ting_9-1749036551891.png

tina_ting_10-1749036556050.png

tina_ting_11-1749036560589.pngtina_ting_12-1749036564183.png

tina_ting_13-1749036567913.png

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

I don't know about "correct" but this certainly looks to be "fit for purpose." When you say optimize, I have to assume you mean "reduce how long it takes to generate results." That is going to depend on the amount of input data you have. As far as programming, I am not a good resource. The one thing I might suggest is using high performance versions of LOGISTIC and GLMSELECT (=HPLOGISTIC and =HPGENSELECT) to see if multi-threading speeds things up.

 

SteveDenham

View solution in original post

1 REPLY 1
SteveDenham
Jade | Level 19

I don't know about "correct" but this certainly looks to be "fit for purpose." When you say optimize, I have to assume you mean "reduce how long it takes to generate results." That is going to depend on the amount of input data you have. As far as programming, I am not a good resource. The one thing I might suggest is using high performance versions of LOGISTIC and GLMSELECT (=HPLOGISTIC and =HPGENSELECT) to see if multi-threading speeds things up.

 

SteveDenham

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 501 views
  • 1 like
  • 2 in conversation