I have a question about efficiency. I am trying to select only one variable from a file. For purposes of illustration, say file1 has 100 variables and file 2 has 1. Would it considerably matter in terms of time whether I use a data statement in conjunction with PROC SQL which selects that variable from file2 or just to go ahead and use file1 anyway and just simply select my variable of choice ? In short, is there a any saving in terms of memory and time between the two:
create table xxx as
create table xxx as
> version 2 will often be quicker than version 1 as the
> data are only passed once.
> Version 1 can be simplified to
> DATA xxx;
> SET file1(KEEP=var1);
> and would then be about the same as version 2.
Thanks for replying. It just occurred to me that I might have made my problem seem simpler than it really is. The intended proc sql in version 2 should not be as trivial as I made it look. It should involve a join of either file1 or file2 with say file3 & file4 so your simplification of version 1 above will certainly not work alone except with an accompanying merge. I think my question boils down to "Is it quicker in PROC SQL to select the same 2 variables from a file with 100 variables or from one with 5 variables or is the size immaterial ?"
The difference is measurable, but may not be noticeable. In both the data step and SQL, SAS has to read the entire program data vector (PDV) in order to pick off the variables of interest (search the documentation for "PDV"). So, with a hundred observations, one will hardly notice the difference between a dataset with 5 variables and one with 100. With a million, you definitely will notice the difference. When your data start getting "large", you also need to start looking at the LENGTH of each variable; that can also make a difference.
Search for on support.sas.com for a lot of SUGI/SGF articles that examine this question more formally.