- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
I'm using SAS EG and trying to create Kaplan-Meier survival plots with SGPLOT based on output from LIFETEST. My dataset is about 650K observations, and this is causing memory errors when I try to generate certain plots, or takes a long time (15 - 30 minutes) for other plots. The problem seems to be exacerbated by adding options to the SGPLOT (e.g., AXISTABLE, quartile markers and labels).
I am not able to get more memory allocated, but even if I could, I would like to figure out if it is possible to reduce the size of the LIFETEST table so that SGPLOT isn't trying to plot so many data points. I've read about "pre-summarizing" tables in this type of situation, but I'm not sure if that is possible for my specific use case. Any suggestions are appreciated! I'm new to survival analysis and don't know how to proceed.
I have attached a text file with code to generate a reduced version of my dataset (first 1k and last 1k observations).
ODS graphics on;
ODS exclude all;
proc lifetest data=Survival plots=survival(atrisk=0 to 20 by 2);
time FollowY*Recid(0);
strata OffType;
ods output Survivalplot=KMSurvivalOffType;
ods output Quartiles=KMQuartilesOffTypeTemp;
ODS exclude none;
data KMQuartilesOffType;
set KMQuartilesOffTypeTemp
(Keep=OffType Stratum Percent Estimate);
rename Estimate=Time;
rename Stratum=StratumNum;
drop Percent;
proc sort;
by StratumNum Time;
data KMSurvivalPlotOffType;
merge KMSurvivalOffType KMQuartilesOffType;
by StratumNum Time;
Length PositionVar $12;
if Quartiles~=.
then do;
if StratumNum=1 and Quartiles~=.
then PositionVar="TopRight";
if StratumNum=2 and Quartiles~=.
then PositionVar="BottomLeft";
if StratumNum=3 and Quartiles~=.
then PositionVar="Right";
*GRAPH - OffType;
ods listing gpath='\\Figures\' image_DPI=300;
ods graphics / reset
height=6.5in imagename="Figure100&Z" imagefmt=PNG ATTRPRIORITY=NONE ANTIALIASMAX=378300;
Footnote justify=left height=1.2 font="Times New Roman" "Figure X.XX. Kaplan-Meier Survival Curves";
proc sgplot data=KMSurvivalPlotOffType noborder;
where Time<=20;
format AtRisk comma.;
series x=Time y=Survival /
scatter x=Time y=Percentile /
markerattrs=(symbol=circlefilled size=7)
text x=Time y=Percentile text=Years /
xaxistable atrisk /
keylegend 'S' /
title="Offender Type"
exclude=(" ")
keylegend 'Q' /
title="Survival Probability Quartiles"
titleattrs=(size=10 weight=bold)
exclude=("." "25")
yaxis label="Survival Probability"
grid values=(0 to 1 by 0.10);
xaxis label="Years"
grid values=(0 to 20 by 2);
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @samp945,
Your simulated sample data contain 99,75% redundant observations:
554 proc sort data=KMSurvivalPlotOffType out=want nodupkey; 555 by time stratumnum _all_; 556 run; NOTE: There were 366456 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE. NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored. NOTE: SAS threaded sort was used. NOTE: 365546 observations with duplicate key values were deleted. NOTE: The data set WORK.WANT has 910 observations and 13 variables.
It should be no problem to run PROC SGPLOT on the small dataset WANT and removing the redundancy should have no impact on the graph.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Could you post the log from running this, including the error messages?
Also, could you post DATA step code to create a simulated dataset work.Survival that replicates the problem?
How many records are in KMSurvivalPlotOffType?
Next up: Rick Wicklin presents Ten Tips for Effective Statistical Graphics (with SAS code) on Wednesday March 26.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much for your help, @Quentin !
There are about 367K records in KMSurvivalPlotOffType. Most of these records appear to be repeats of censored events. I'm wondering if those records can be summarized to reduce the size of the dataset but I'm not sure how that might be done.
One more thing: I am not plotting censored events because there are so many that the entire curve would be a solid line of censored tick marks.
I have posted the log below that results from running the SGPLOT code for my actual data. The plot failed after running for 45 minutes with "ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space."
I have also included code below for a simulated work.Survival dataset. This is my first time writing such code so hopefully I've done so correctly. I couldn't figure out how to make the simulated data follow the same distribution as my original data but I don't think that matters for troubleshooting purposes. In my actual data the time variable (FollowY) is right-skewed and one group (OffType) is right-skewed more than the other.
data SurvivalTemp;
do ID = 1 to 644265;
Drug = rand('Bernoulli',0.6739);
FollowY = round(22*rand('Uniform'),0.1);
Fail = rand('Bernoulli',0.4309);
data SurvivalSIM;
set SurvivalTemp;
if Drug=1
then OffType="Non-Drug";
else OffType="Drug";
Drop Drug;
1 The SAS System 11:11 Saturday, March 23, 2024 1 ;*';*";*/;quit;run; 2 OPTIONS PAGENO=MIN; 3 %LET _CLIENTTASKLABEL='22_Survival.sas'; 4 %LET _CLIENTPROCESSFLOWNAME='Standalone Not In Project'; 5 %LET _CLIENTPROJECTPATH=''; 6 %LET _CLIENTPROJECTPATHHOST=''; 7 %LET _CLIENTPROJECTNAME=''; 8 %LET _SASPROGRAMFILE='X:\22_Survival.sas'; 9 %LET _SASPROGRAMFILEHOST='82461-CJIS'; 10 11 ODS _ALL_ CLOSE; 12 OPTIONS DEV=SVG; 13 GOPTIONS XPIXELS=0 YPIXELS=0; 14 %macro HTML5AccessibleGraphSupported; 15 %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH; 16 %mend; 17 FILENAME EGHTML TEMP; 18 ODS HTML5(ID=EGHTML) FILE=EGHTML 19 OPTIONS(BITMAP_MODE='INLINE') 20 %HTML5AccessibleGraphSupported 21 ENCODING='utf-8' 22 STYLE=HTMLBlue 23 GPATH=&sasworklocation 24 ; NOTE: Writing HTML5(EGHTML) Body file: EGHTML 25 26 ods listing gpath='\\Figures\' image_DPI=300; 27 ods graphics / reset 28 height=6.5in imagename="Figure100&Z" imagefmt=PNG ATTRPRIORITY=NONE ANTIALIASMAX=378300; 29 Footnote justify=left height=1.2 font="Times New Roman" "Figure X.XX. Kaplan-Meier Survival Curves"; 30 Title; 31 32 32 ! proc sgplot data=KMSurvivalPlotOffType noborder; 33 where Time<=20; 34 format AtRisk comma.; 35 series x=Time y=Survival / 36 lineattrs=(pattern=solid) 37 Group=Stratum 38 Name='S'; 39 scatter x=Time y=Percentile / 40 group=Quartiles 41 markerattrs=(symbol=circlefilled size=7) 42 name='Q'; 43 text x=Time y=Percentile text=Years / 44 group=Stratum 45 textattrs=(size=10) 46 position=PositionVar; 47 xaxistable atrisk / 48 x=tatrisk 49 class=stratum 50 colorgroup=stratum 51 valueattrs=(weight=bold) 52 location=inside; 53 Inset ("Drug vs. Non-Drug:" = "0.93") / 54 border 55 Title="Hazard Ratio" 2 The SAS System 11:11 Saturday, March 23, 2024 56 titleattrs=(size=10 weight=bold) 57 textattrs=(size=10) 58 Position=Top; 59 keylegend 'S' / 60 linelength=20 61 title="Type" 62 exclude=(" ") 63 Titleattrs=(size=10) 64 Valueattrs=(size=10) 65 location=outside 66 position=bottom; 67 keylegend 'Q' / 68 title="Survival Probability Quartiles" 69 titleattrs=(size=10 weight=bold) 70 exclude=("." "25") 71 down=3 72 location=inside 73 position=TopRight 74 Valueattrs=(Size=10) 75 noborder; 76 yaxis label="Survival Probability" 77 grid values=(0 to 1 by 0.10); 78 xaxis label="Years" 79 grid values=(0 to 20 by 2); 80 run; NOTE: PROCEDURE SGPLOT used (Total process time): real time 44:04.98 cpu time 20.87 seconds ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space. NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 364490 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE. WHERE Time<=20; 81 82 83 %LET _CLIENTTASKLABEL=; 84 %LET _CLIENTPROCESSFLOWNAME=; 85 %LET _CLIENTPROJECTPATH=; 86 %LET _CLIENTPROJECTPATHHOST=; 87 %LET _CLIENTPROJECTNAME=; 88 %LET _SASPROGRAMFILE=; 89 %LET _SASPROGRAMFILEHOST=; 90 91 ;*';*";*/;quit;run; 92 ODS _ALL_ CLOSE; 93 94 95 QUIT; RUN; 96
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @samp945,
Your simulated sample data contain 99,75% redundant observations:
554 proc sort data=KMSurvivalPlotOffType out=want nodupkey; 555 by time stratumnum _all_; 556 run; NOTE: There were 366456 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE. NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored. NOTE: SAS threaded sort was used. NOTE: 365546 observations with duplicate key values were deleted. NOTE: The data set WORK.WANT has 910 observations and 13 variables.
It should be no problem to run PROC SGPLOT on the small dataset WANT and removing the redundancy should have no impact on the graph.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your code confirms exactly what I suspected! But I was not sure which variables to include on the NODUPKEY by statement. Your solution works perfectly with my real dataset and produces a graph with a proper XAXISTABLE in 10 seconds.
Thank you!!! You have no idea how relieving it is to figure this problem out!
One question though: I am not familiar with the _ALL_ variable that you used on the by statement. If I understand correctly, using _ALL_ after the first two variables (time stratumnum) has the effect of including all possible combinations of every other variable without actually having to write those all out. Is that correct?
Thanks again!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
One other bit of information for anyone in the future working with very large survival datasets:
In addition to reducing the size of the dataset by removing duplicate censored events, I also reduced the size of the work.survival dataset by rounding calculated time values to one decimal place (i.e., before using LIFETEST). Initially my time values (in years) were created with 8 decimal places which is an unnecessary amount of precision. If I used that level of precision for the time values, the LIFETEST table would produce 30K observations even after removing duplicate censored values. If I round the time values to one decimal place before creating the LIFETEST table, the final dataset has 900 observations after removing duplicates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@samp945 wrote:
One question though: I am not familiar with the _ALL_ variable that you used on the by statement. If I understand correctly, using _ALL_ after the first two variables (time stratumnum) has the effect of including all possible combinations of every other variable without actually having to write those all out. Is that correct?
The variable list _ALL_ stands for all variables in dataset KMSurvivalPlotOffType, ordered by variable number (see PROC CONTENTS output). So, adding more variable names in the BY statement actually creates a list with duplicate names, but the duplicates (here: time and stratumnum contained in _ALL_) are ignored, as is mentioned in the second note in the log.
Putting time stratumnum first in the BY statement ensures that dataset WANT will be sorted by time stratumnum, regardless of their position (variable number) in the dataset, which I didn't want to make assumptions about. Sorting by time is crucial for the SERIES statement of the PROC SGPLOT step. The secondary sort key stratumnum determines the order of the treatment groups in the x-axis table, in the legend and also regarding color assignment. At least this is true for the simulated sample data. You could insert descending before stratumnum to switch treatment order.
I think the sort order of the remaining variables in KMSurvivalPlotOffType has no impact on the graph, so covering them by the abbreviation _ALL_ was a convenient way to have NODUPKEY remove duplicate observations without losing any combination of values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hard to tell without data, but perhaps the automatic collision-avoidance algorithm is trying to arrange all those labels. You can use
ods graphics / labelmax=0;
to turn off the collision-avoidance algorithm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Rick_SAS : I don't think it is a label problem because I created label variables separately and there are only a handful in the huge dataset. The only labels are for quartile values. If you have a minute to check my data, I've posted code to generate a simulated datafile above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try to delete the SCATTER statement and use the HEATMAP statement instead? It should be the same syntax.