DATA Step, Macro, Functions and more

Tips to make the following code more time/memory efficient

Reply
Contributor
Posts: 39

Tips to make the following code more time/memory efficient

Hello all, 

 

I have the following proc sql code (see below) that does what I want it to do on a sample of data. However, my computer runs out of resources to run the program on the full dataset. I have an empty 4 TB hardrive and am hoping it is possible to run with some minor tweaks to the program or SAS settings. Any suggestions would be greatly appreciated! Thank you!

 

Details about code/data:

Dataset A is 110GB 

Dataset B is 500MB

Once this procedure is completed, I would select one observation per group and the final dataset will go back down to ~110 GB. 

 

Steps I have taken: 

-specified procedure to create compressed dataset

-changed work directory to external harddrive

 

Other ideas I have: 

-run in batches and append

-somehow delete temp files as they are created/no longer needed? 

-run 1 mean at a time and merge

 

 

proc sql;
	create table work.birth_ap01 (compress=YES) as
	select A.*, B.*,
		   mean(B.meantemp) as mean_temp_wk,
		   mean(B.maxco) as mean_maxco_wk,
		   mean(B.maxeighthro3) as mean_maxeighthro3_wk,
		   mean(B.maxno2) as mean_maxno2_wk,
		   mean(B.maxo3) as mean_maxo3_wk,
		   mean(B.meanrh) as mean_meanrh_wk,
		   mean(B.meanpm25) as mean_meanpm25_wk
	from work.birth03 A, work.ap06_imputed B
	where A.zipnum=B.zipnum and (A.gewk_count=. or A.gewk_date_end ge B.date ge A.gewk_date_start)
	group by A.id, A.gewk_date_end, A.gewk_date_start
	order by A.id, A.gewk_count, B.date
	;
quit;
Respected Advisor
Posts: 4,668

Re: Tips to make the following code more time/memory efficient

[ Edited ]
Posted in reply to pamplemouse22

@pamplemouse22

You don't tell us where you run out of resources. Given the code you've posted I assume it's disk space.

Make sure not only WORK but also UTILLOC has sufficient disk space available. UTILLOC is used for temporary sort files and that's where your process will require the most disk space.

 

Taking your join condition into account: What's the relationship between your two tables? Could you end-up with many-to-many relationships? If so then this could explain why you run out of resources; ...and it would also require you to re-think if you can get correct averages this way.

What most likely really hurts you is SELECT A.*, B.* as this will likely lead to a result table with all the combined rows from tables A and B - and should there be many-to-many relationships then you can end up with a really huge table. Restrict your result table to the columns you really need; especially the ones from table A. 

 

 

You could consider using an index over ZIP code on Dataset A - but that's a secondary measure. First reduce the number of variables for your result set and ensure that your join condition doesn't allow for many-to-many cases.

Super User
Posts: 3,856

Re: Tips to make the following code more time/memory efficient

Posted in reply to pamplemouse22

One thing that looks weird in your query is you are doing summary stats (MEAN) on your B table, but your GROUP BY columns are on your A table. Is that what you meant to do? You should really keep your summary stats query separate otherwise SAS will end up remerging the data which will increase the size a lot.

PROC Star
Posts: 2,311

Re: Tips to make the following code more time/memory efficient

Posted in reply to pamplemouse22

You don't tell us how many rows you have, but for each row of the large table you mach many rows of the smaller table repeatedly.

 

A.gewk_count=. or A.gewk_date_end ge B.date ge A.gewk_date_start

How many rows do you expect from table B for each row from table A for the date condition?

Also if gewk_count is missing you match *all rows* of table B for the given ID. That's potentially a performance killer if it's not very rare.

 

 

 

Ask a Question
Discussion stats
  • 3 replies
  • 237 views
  • 0 likes
  • 4 in conversation