10-21-2014 09:43 AM
So I am starting to be a little familiar with sas, and I realize that a lot of operations that I learned to do with data sets can also be done using proc sql statements, including merging, creating variables, subsetting and many others.
My question is therefore : "When is it best to do what ?", is proc sql better ? always ?
Thanks in advance
10-21-2014 10:25 AM
I think it is mostly determined by your background. If you grew up as a sql coder, you would probably prefer sql...but as other's have mentioned, there are some things that are better done in data steps..some things are better in sql.
10-21-2014 10:33 AM
I often see this question raised on games programming forums, what technology is best from x or y. The answer is inevitably based on what can "you" do with the technology, and what the goal is. If your happy merging datasets and find joins and such like difficult then keep using datasteps. There are of course some considerations, such as speed of execution and memory/hdd space, however unless you are dealing with large datasets (i.e. in the millions of records) it shouldn't matter too much to you.
10-21-2014 10:35 AM
It depends. Just like a carpenter may use different tools for different tasks, a SAS programmer will either use the DATA step or PROC SQL step depending on the circumstances. The DATA step is very useful when you need to implement business logic or do heavy manipulations on your data. Meanwhile, PROC SQL is useful for mining from a variety of datasouces. Once you become familar and comfortable with both steps, you'll develop a propensity to use the most applicable step depending on the situation.
In a nutshell, DATA step is great for business logic and heavy data manipulations. PROC SQL is great and intuitive for mining and combing many data sources.
I'd check out this paper for an introduction to the differences between the DATA step and PROC SQL: http://www2.sas.com/proceedings/sugi29/269-29.pdf.
Happy Programming! :smileysilly:
10-21-2014 10:55 AM
So a answer with "it depends" does not help at all. If you do so, please at least tell on what id does depend.
Then, it sometimes happens that something is better, that is why there is research in algorithms.
So I rephrase my question : which one is the fastest, which one is the less memory consuming, in which cases.
And it happens that I have data with several millions observations.
10-21-2014 11:52 AM
What can you do with the datastep.
- process all kind of external input data (Sftp url flat files)
- combining a lot of datasources at the same time (set merge point) Hash object external fcuntions like Perl (regular strings).
- creating a lot of dataset as SAS datasets or as tables, all in one pass.
What can you do with SQL.
- Accessing external DBMS systems in their advanced native mode.
- working with sas-datasets in the same way as on a external dbms and combine those
When there is a different functionality choose the best tool for that fcuntionality.
A hammer and screw is a bad combination. This implies you have to learn to use/understand those different tools.
What to choose when there is a good overlap, is that your question?.
The datastep is fast when the sequential approach is good and the limitation is IO speed.
Proc SQL by nature is better when you can use the multithreading features. That is processing al lot in parallel.
When that processing in parallel is alo IO bound and that is sequential (SAN or that spinning thing dasd) when hitting above about 20% of the data (rule of thumb) it will lose in turn around time of sequential processing.
All other factors when speed does not matter but the ease of coding is the most major issue.
Make it as readable and comfortable as you like (personal favors). With most small datasets executing time is not the problem but coding time.