06-29-2017 10:40 PM
My dataset has 31 variables and 100,000,000 rows. I impute missing data in this dataset using PROC MI (num of impute = 10) then analyze the imputed dataset using PROC SURVEYREG. From running a subset of the data, I estimated it would take well over a week to finish. At my institute, any job running over a week will be terminated. What could I do to speed up PROC MI?. Or is there a way that makes my program continue where it is terminated without starting over from the beginning?
Thanks for your help,
07-03-2017 06:15 AM
I have to say that's one very big data set you have and to be honest I wouldn't be surprised if the job failed to complete because of lack of work space anyway. I think the only way you're going to get this through is to reduce your file size and from a programmer's perspective there are a few questions you could ask yourself:
1. Do I need all the variables in the file for my analysis? If not then removing them will save space and probably speeed the job up.
2. If I'm doing By Group processing can I split the file up into smaller files, run the job on each file and then combine the results?
3. Do I need to impute for all the groups? If not could I remove some rows from that part of the process and add them back later?
4. If I'm doing suggestion 2 can I run the smaller jobs in parallel by using SAS MP Connect?
5. Will compressing the file(s) significantly reduce file size?
These are just a few ideas you could start with...
07-03-2017 08:48 AM
Thank you for your suggestions. I already applied suggestion 2, but each small group still took more than a week to finish. In fact, I don’t have a work space running out issue but the time limit policy imposed on the institute’s cluster.
I am curious about your suggestion 5? Could you elaborate this option?
Also I am interested in suggestions of how to add a checkpoint, so my program can restart from that checkpoint instead of from the beginning.
07-03-2017 09:22 AM
When I talk about work space I'm not referring to a Workspace Server but to the area of the disk used for the SAS Work library and utility files. These utility files are often created "behind the scenes" by SAS Procedures and can contribute to you running out of work space.
I'm rather surprised that you estimate a file of reduced size taking so long - I would expect, for example, if you had a job which ran for 10 hours, that splitting the file in half so that you had File A and File B and then running the same job twice (once against File A and once against File B) the total elapsed time for each run would be a little over 5 hours (there's a certain amount of overhead associated with any job).
You can find out more about SAS file compression at this link - if you can get a good compression ratio then you will reduce the number of data pages the file occupies and hence the number of disk reads and this should help reduce the run times. My guess is with such a large file there won't be a single "silver bullet" and you'll have to use a combination of techniques.