10-11-2017 10:21 AM - edited 10-11-2017 10:22 AM
I have a table called all_data read into SAS that consists of 166 variables, of which column M1 is the primary key column, and consists of 50 million records - so as you can imagine the table is quite large, takes up about 55GB hard drive space.
I have a macro which selects the primary key M1 column, and one of the other variables (so two columns in total) and reads it into a temporary table which the macro uses to compute. This macro will run for all the variables in the table so I am looking to save time.
At the moment, I am using a Data step to single out my two columns at the start of the macro. The variable that differs each time is read in as the parameter var declared in the macro:
data table_temp; set all_data (keep=M1 &var.); run;
I currently have this running for the first variable, but this data step alone takes about 40 minutes, which is not ideal because there is a lot of processing after that.
I was wondering if using a Proc SQL statement instead of a Data step would speed up this process? Something like below:
PROC SQL; create table table_temp as select M1, &var. from all_data; quit;
Or possibly using Proc SQL and selecting all and using a KEEP in the FROM statement?
If anyone could offer advice on the speediest way to do this, with an explanation, it would be very much appreciated - thankyou!
10-11-2017 10:26 AM - edited 10-11-2017 10:30 AM
Which is faster? I'm going to guess that the data step is faster ... but ... you could try them both and then you'd know, you wouldn't have to rely on someone's guess.
But since we don't know what you are doing as far as the next step after selecting the proper columns from all_data — I have a suggestion that might work (again depending on what you are doing). The suggestion is to NOT do this one variable at a time, but to do the next step on ALL variables; for example if you are computing some summary statistics via PROC MEANS or PROC SUMMARY, do this on ALL variables, for example:
proc summary data=all_data; class m1; var firstvarname--lastvarname; /* or some legal list of variable names */ output out=stats mean= stddev=/autoname; run;
I would certainly pursue this method and take it as far as possible, rather than a macro with loops.
10-11-2017 11:24 AM
I'll echo the advice: do _not_ do this the way you're suggesting. 150 or so reads from this dataset, and then writes of ~1GB data files, and then (further processing), will be slow, slow, slow.
What else can you do? Well, you could turn this into a vertical dataset, meaning ID + variable name/key + variable value. This will be a net-larger dataset (maybe 3 times as large, at most). But you can now use BY group processing to perform your analysis, and the increase in size may not be significant.
You can also use views to process your work, which at least avoids the step of writing out the 150 1GB datasets. You still do 150 reads of the dataset, which is still slow, but perhaps it's not that bad - depending on what you're doing after this it's possible you're not going to have that much of an issue here.
Ultimately any better advice needs more details as to what you're doing next.
10-11-2017 01:40 PM
Change your process - whatever it is to process all your columns at once rather than two at a time. Otherwise you're reading your data too many times.
Without more details that's the best we can offer.
You can review the Don't be Loopy paper by David Cassell for more details on why you shouldn't use macros.