11-25-2017 05:19 PM
I have the following proc sql code (see below) that does what I want it to do on a sample of data. However, my computer runs out of resources to run the program on the full dataset. I have an empty 4 TB hardrive and am hoping it is possible to run with some minor tweaks to the program or SAS settings. Any suggestions would be greatly appreciated! Thank you!
Details about code/data:
Dataset A is 110GB
Dataset B is 500MB
Once this procedure is completed, I would select one observation per group and the final dataset will go back down to ~110 GB.
Steps I have taken:
-specified procedure to create compressed dataset
-changed work directory to external harddrive
Other ideas I have:
-run in batches and append
-somehow delete temp files as they are created/no longer needed?
-run 1 mean at a time and merge
proc sql; create table work.birth_ap01 (compress=YES) as select A.*, B.*, mean(B.meantemp) as mean_temp_wk, mean(B.maxco) as mean_maxco_wk, mean(B.maxeighthro3) as mean_maxeighthro3_wk, mean(B.maxno2) as mean_maxno2_wk, mean(B.maxo3) as mean_maxo3_wk, mean(B.meanrh) as mean_meanrh_wk, mean(B.meanpm25) as mean_meanpm25_wk from work.birth03 A, work.ap06_imputed B where A.zipnum=B.zipnum and (A.gewk_count=. or A.gewk_date_end ge B.date ge A.gewk_date_start) group by A.id, A.gewk_date_end, A.gewk_date_start order by A.id, A.gewk_count, B.date ; quit;
11-25-2017 06:49 PM - edited 11-25-2017 06:52 PM
You don't tell us where you run out of resources. Given the code you've posted I assume it's disk space.
Make sure not only WORK but also UTILLOC has sufficient disk space available. UTILLOC is used for temporary sort files and that's where your process will require the most disk space.
Taking your join condition into account: What's the relationship between your two tables? Could you end-up with many-to-many relationships? If so then this could explain why you run out of resources; ...and it would also require you to re-think if you can get correct averages this way.
What most likely really hurts you is SELECT A.*, B.* as this will likely lead to a result table with all the combined rows from tables A and B - and should there be many-to-many relationships then you can end up with a really huge table. Restrict your result table to the columns you really need; especially the ones from table A.
You could consider using an index over ZIP code on Dataset A - but that's a secondary measure. First reduce the number of variables for your result set and ensure that your join condition doesn't allow for many-to-many cases.
11-26-2017 01:16 AM
One thing that looks weird in your query is you are doing summary stats (MEAN) on your B table, but your GROUP BY columns are on your A table. Is that what you meant to do? You should really keep your summary stats query separate otherwise SAS will end up remerging the data which will increase the size a lot.
11-26-2017 09:28 PM
You don't tell us how many rows you have, but for each row of the large table you mach many rows of the smaller table repeatedly.
A.gewk_count=. or A.gewk_date_end ge B.date ge A.gewk_date_start
How many rows do you expect from table B for each row from table A for the date condition?
Also if gewk_count is missing you match *all rows* of table B for the given ID. That's potentially a performance killer if it's not very rare.