BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
samp945
Obsidian | Level 7

Hello all,

 

I'm using SAS EG and trying to create Kaplan-Meier survival plots with SGPLOT based on output from LIFETEST. My dataset is about 650K observations, and this is causing memory errors when I try to generate certain plots, or takes a long time (15 - 30 minutes) for other plots. The problem seems to be exacerbated by adding options to the SGPLOT (e.g., AXISTABLE, quartile markers and labels).

 

I am not able to get more memory allocated, but even if I could, I would like to figure out if it is possible to reduce the size of the LIFETEST table so that SGPLOT isn't trying to plot so many data points. I've read about "pre-summarizing" tables in this type of situation, but I'm not sure if that is possible for my specific use case. Any suggestions are appreciated! I'm new to survival analysis and don't know how to proceed.

 

I have attached a text file with code to generate a reduced version of my dataset (first 1k and last 1k observations).

 

Thanks!

 

 

	ODS graphics on;
	ODS exclude all;
	proc lifetest data=Survival plots=survival(atrisk=0 to 20 by 2);
	time FollowY*Recid(0);
	strata OffType;
	ods output Survivalplot=KMSurvivalOffType;
	ods output Quartiles=KMQuartilesOffTypeTemp;
	run;
	ODS exclude none;

	data KMQuartilesOffType;
		set KMQuartilesOffTypeTemp
		(Keep=OffType Stratum Percent Estimate);
		rename Estimate=Time;
		rename Stratum=StratumNum;
		Quartiles=100-Percent;
		drop Percent;
	proc sort;
		by StratumNum Time;
	run;

	data KMSurvivalPlotOffType;
		merge KMSurvivalOffType KMQuartilesOffType;
		by StratumNum Time;
		Length PositionVar $12;
		if Quartiles~=.
			then do;
				Years=round(Time,0.01);
				Percentile=Survival;
			end;
		if StratumNum=1 and Quartiles~=.
			then PositionVar="TopRight";
		if StratumNum=2 and Quartiles~=.
			then PositionVar="BottomLeft";
		if StratumNum=3 and Quartiles~=.
			then PositionVar="Right";
	run;

		*GRAPH - OffType;
		ods listing gpath='\\Figures\' image_DPI=300;
		ods graphics / reset 
			height=6.5in imagename="Figure100&Z" imagefmt=PNG ATTRPRIORITY=NONE ANTIALIASMAX=378300;
		Footnote justify=left height=1.2 font="Times New Roman" "Figure X.XX. Kaplan-Meier Survival Curves";
		Title;

		proc sgplot data=KMSurvivalPlotOffType noborder;
			where Time<=20;
			format AtRisk comma.;
		series x=Time y=Survival /
			lineattrs=(pattern=solid)
			Group=Stratum
			/*datalabel=Years
			datalabelpos=right
			datalabelattrs=(size=10)*/
			Name='S';
		scatter x=Time y=Percentile /
			group=Quartiles
			markerattrs=(symbol=circlefilled size=7)
			name='Q';
		text x=Time y=Percentile text=Years /
			group=Stratum
			textattrs=(size=10)
			position=PositionVar;
	    xaxistable atrisk /
			x=tatrisk
			class=stratum
			colorgroup=stratum
			valueattrs=(weight=bold)
			location=inside;
		keylegend 'S' /
			linelength=20
			title="Offender Type"
			exclude=(" ")
			Titleattrs=(size=10)
			Valueattrs=(size=10)
			location=outside
			position=bottom;
		keylegend 'Q' /
			title="Survival Probability Quartiles"
			titleattrs=(size=10 weight=bold)
			exclude=("." "25")
			down=3
			location=inside
			position=TopRight
			Valueattrs=(Size=10)
			noborder;
		yaxis label="Survival Probability"
			grid values=(0 to 1 by 0.10);
		xaxis label="Years"
			grid values=(0 to 20 by 2);
		run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @samp945,

 

Your simulated sample data contain 99,75% redundant observations:

554   proc sort data=KMSurvivalPlotOffType out=want nodupkey;
555   by time stratumnum _all_;
556   run;

NOTE: There were 366456 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE.
NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
NOTE: SAS threaded sort was used.
NOTE: 365546 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 910 observations and 13 variables.

It should be no problem to run PROC SGPLOT on the small dataset WANT and removing the redundancy should have no impact on the graph.

View solution in original post

9 REPLIES 9
Quentin
Super User

Could you post the log from running this, including the error messages?

 

Also, could you post DATA step code to create a simulated dataset work.Survival that replicates the problem?

 

How many records are in KMSurvivalPlotOffType?

 

The Boston Area SAS Users Group is hosting free webinars!
Next up: Joe Madden & Joseph Henry present Putting Power into the Hands of the Programmer with SAS Viya Workbench on Wednesday Nov 6.
Register now at https://www.basug.org/events.
samp945
Obsidian | Level 7

Thanks so much for your help, @Quentin !

 

There are about 367K records in KMSurvivalPlotOffType. Most of these records appear to be repeats of censored events. I'm wondering if those records can be summarized to reduce the size of the dataset but I'm not sure how that might be done.

 

One more thing: I am not plotting censored events because there are so many that the entire curve would be a solid line of censored tick marks.

 

I have posted the log below that results from running the SGPLOT code for my actual data.  The plot failed after running for 45 minutes with "ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space."

 

I have also included code below for a simulated work.Survival dataset. This is my first time writing such code so hopefully I've done so correctly. I couldn't figure out how to make the simulated data follow the same distribution as my original data but I don't think that matters for troubleshooting purposes. In my actual data the time variable (FollowY) is right-skewed and one group (OffType) is right-skewed more than the other.

 

*SIMULATED WORK.SURVIVAL DATASET;
data SurvivalTemp;
	do ID = 1 to 644265;
		Drug = rand('Bernoulli',0.6739);
		FollowY = round(22*rand('Uniform'),0.1);
		Fail = rand('Bernoulli',0.4309);
		output;
	end;
run;

data SurvivalSIM;
	set SurvivalTemp;
		if Drug=1
			then OffType="Non-Drug";
				else OffType="Drug";
	Drop Drug;
run;

 

 

1                                                          The SAS System                             11:11 Saturday, March 23, 2024

1          ;*';*";*/;quit;run;
2          OPTIONS PAGENO=MIN;
3          %LET _CLIENTTASKLABEL='22_Survival.sas';
4          %LET _CLIENTPROCESSFLOWNAME='Standalone Not In Project';
5          %LET _CLIENTPROJECTPATH='';
6          %LET _CLIENTPROJECTPATHHOST='';
7          %LET _CLIENTPROJECTNAME='';
8          %LET _SASPROGRAMFILE='X:\22_Survival.sas';
9          %LET _SASPROGRAMFILEHOST='82461-CJIS';
10         
11         ODS _ALL_ CLOSE;
12         OPTIONS DEV=SVG;
13         GOPTIONS XPIXELS=0 YPIXELS=0;
14         %macro HTML5AccessibleGraphSupported;
15             %if %_SAS_VERCOMP_FV(9,4,4, 0,0,0) >= 0 %then ACCESSIBLE_GRAPH;
16         %mend;
17         FILENAME EGHTML TEMP;
18         ODS HTML5(ID=EGHTML) FILE=EGHTML
19             OPTIONS(BITMAP_MODE='INLINE')
20             %HTML5AccessibleGraphSupported
21             ENCODING='utf-8'
22             STYLE=HTMLBlue
23             GPATH=&sasworklocation
24         ;
NOTE: Writing HTML5(EGHTML) Body file: EGHTML
25         
26         		ods listing gpath='\\Figures\' image_DPI=300;
27         		ods graphics / reset
28         			height=6.5in imagename="Figure100&Z" imagefmt=PNG ATTRPRIORITY=NONE ANTIALIASMAX=378300;
29         		Footnote justify=left height=1.2 font="Times New Roman" "Figure X.XX. Kaplan-Meier Survival Curves";
30         		Title;
31         
32         		
32       !   proc sgplot data=KMSurvivalPlotOffType noborder;
33         			where Time<=20;
34         			format AtRisk comma.;
35         		series x=Time y=Survival /
36         			lineattrs=(pattern=solid)
37         			Group=Stratum
38         			Name='S';
39         		scatter x=Time y=Percentile /
40         			group=Quartiles
41         			markerattrs=(symbol=circlefilled size=7)
42         			name='Q';
43         		text x=Time y=Percentile text=Years /
44         			group=Stratum
45         			textattrs=(size=10)
46         			position=PositionVar;
47         	    xaxistable atrisk /
48         			x=tatrisk
49         			class=stratum
50         			colorgroup=stratum
51         			valueattrs=(weight=bold)
52         			location=inside;
53         	    Inset ("Drug vs. Non-Drug:" = "0.93") /
54         			border
55         			Title="Hazard Ratio"
2                                                          The SAS System                             11:11 Saturday, March 23, 2024

56         			titleattrs=(size=10 weight=bold)
57         			textattrs=(size=10)
58         			Position=Top;
59         		keylegend 'S' /
60         			linelength=20
61         			title="Type"
62         			exclude=(" ")
63         			Titleattrs=(size=10)
64         			Valueattrs=(size=10)
65         			location=outside
66         			position=bottom;
67         		keylegend 'Q' /
68         			title="Survival Probability Quartiles"
69         			titleattrs=(size=10 weight=bold)
70         			exclude=("." "25")
71         			down=3
72         			location=inside
73         			position=TopRight
74         			Valueattrs=(Size=10)
75         			noborder;
76         		yaxis label="Survival Probability"
77         			grid values=(0 to 1 by 0.10);
78         		xaxis label="Years"
79         			grid values=(0 to 20 by 2);
80         		run;

NOTE: PROCEDURE SGPLOT used (Total process time):
      real time           44:04.98
      cpu time            20.87 seconds
      
ERROR: Java virtual machine exception. java.lang.OutOfMemoryError: Java heap space.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 364490 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE.
      WHERE Time<=20;
81         
82         
83         %LET _CLIENTTASKLABEL=;
84         %LET _CLIENTPROCESSFLOWNAME=;
85         %LET _CLIENTPROJECTPATH=;
86         %LET _CLIENTPROJECTPATHHOST=;
87         %LET _CLIENTPROJECTNAME=;
88         %LET _SASPROGRAMFILE=;
89         %LET _SASPROGRAMFILEHOST=;
90         
91         ;*';*";*/;quit;run;
92         ODS _ALL_ CLOSE;
93         
94         
95         QUIT; RUN;
96         

 

 

 

 

FreelanceReinh
Jade | Level 19

Hello @samp945,

 

Your simulated sample data contain 99,75% redundant observations:

554   proc sort data=KMSurvivalPlotOffType out=want nodupkey;
555   by time stratumnum _all_;
556   run;

NOTE: There were 366456 observations read from the data set WORK.KMSURVIVALPLOTOFFTYPE.
NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
NOTE: SAS threaded sort was used.
NOTE: 365546 observations with duplicate key values were deleted.
NOTE: The data set WORK.WANT has 910 observations and 13 variables.

It should be no problem to run PROC SGPLOT on the small dataset WANT and removing the redundancy should have no impact on the graph.

samp945
Obsidian | Level 7

Your code confirms exactly what I suspected! But I was not sure which variables to include on the NODUPKEY by statement. Your solution works perfectly with my real dataset and produces a graph with a proper XAXISTABLE in 10 seconds.

 

Thank you!!! You have no idea how relieving it is to figure this problem out!

 

One question though: I am not familiar with the _ALL_ variable that you used on the by statement. If I understand correctly, using _ALL_ after the first two variables (time stratumnum) has the effect of including all possible combinations of every other variable without actually having to write those all out. Is that correct?

 

Thanks again!

samp945
Obsidian | Level 7

One other bit of information for anyone in the future working with very large survival datasets:

 

In addition to reducing the size of the dataset by removing duplicate censored events, I also reduced the size of the work.survival dataset by rounding calculated time values to one decimal place (i.e., before using LIFETEST). Initially my time values (in years) were created with 8 decimal places which is an unnecessary amount of precision. If I used that level of precision for the time values, the LIFETEST table would produce 30K observations even after removing duplicate censored values. If I round the time values to one decimal place before creating the LIFETEST table, the final dataset has 900 observations after removing duplicates.

FreelanceReinh
Jade | Level 19

@samp945 wrote:

One question though: I am not familiar with the _ALL_ variable that you used on the by statement. If I understand correctly, using _ALL_ after the first two variables (time stratumnum) has the effect of including all possible combinations of every other variable without actually having to write those all out. Is that correct?


The variable list _ALL_ stands for all variables in dataset KMSurvivalPlotOffType, ordered by variable number (see PROC CONTENTS output). So, adding more variable names in the BY statement actually creates a list with duplicate names, but the duplicates (here: time and stratumnum contained in _ALL_) are ignored, as is mentioned in the second note in the log.

 

Putting time stratumnum first in the BY statement ensures that dataset WANT will be sorted by time stratumnum, regardless of their position (variable number) in the dataset, which I didn't want to make assumptions about. Sorting by time is crucial for the SERIES statement of the PROC SGPLOT step. The secondary sort key stratumnum determines the order of the treatment groups in the x-axis table, in the legend and also regarding color assignment. At least this is true for the simulated sample data. You could insert descending before stratumnum to switch treatment order.

 

I think the sort order of the remaining variables in KMSurvivalPlotOffType has no impact on the graph, so covering them by the abbreviation _ALL_ was a convenient way to have NODUPKEY remove duplicate observations without losing any combination of values.

Rick_SAS
SAS Super FREQ

Hard to tell without data, but perhaps the automatic collision-avoidance algorithm is trying to arrange all those labels. You can use

ods graphics / labelmax=0;

to turn off the collision-avoidance algorithm.

samp945
Obsidian | Level 7

@Rick_SAS : I don't think it is a label problem because I created label variables separately and there are only a handful in the huge dataset. The only labels are for quartile values. If you have a minute to check my data, I've posted code to generate a simulated datafile above.

Rick_SAS
SAS Super FREQ

Try to delete the SCATTER statement and use the HEATMAP statement instead? It should be the same syntax.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1699 views
  • 2 likes
  • 4 in conversation