Hello,
When handling large datasets ,is it better to work with proc sql or DATA step?
Which is working more quickly?
It depends on what you intend to do, it's a case by case situation, there is no general answer or rule to follow.
For my part I'd say it is triggered by task, functionality, usage, readability and easiness of validation for third party.
- Cheers -
I prefer the DATA step as well, but there are features in PROC SQL that really aren't available in a DATA step (and vice versa), and so the answer is what @Oligolas said: "It depends".
You could time the data step versus the proc sql step using some of the code included in the link below.
https://blogs.sas.com/content/sgf/2015/01/21/sas-timer-the-key-to-writing-efficient-sas-code/
I'm not totally familiar with how the two are going to vary under the hood, but it gives you a metric to go by. Also assumes you're willing to write both steps.
What is best for large data, SQL or data step?
Depends. On many things.
First, how large is "large"? Are we talking 10 GB or 1000 GB? In the first case, it should be possible to fit all the data into the memory of your server, or even a decent-sized PC. In the second case, not so easy.
The data step used to be one of the fastest beasts around, back in the old days when even a mainframe might not have more than a few gigabytes of memory, and most processing had to be done from disc or even tape. SAS data step language is very fast for sequential processing, also when you are low on memory. It is not an interpreted language, it gets compiled to machine code on the fly, and can do a lot of processing with very little memory.
Nowadays, SQL interpreters are pretty good at optimizing data handling, and memory is not so much a bottleneck.
Second, where is your data stored? If it is on a modern database server, using pass-through native SQL will in many situations be faster than the data step. Even SAS SQL, if written correctly, can be executed as native SQL on the server.
Third, what do know best? Badly written data step code is easily outperformed by well-written SQL, and vice versa.
Fourth, what are you doing? Is it updating data with transactions, analyzing data, or transforming data for a Data Warehouse? The solution you choose should also depend on the task at hand.
Fifth, what are your resources? A large SAS server can be quite expensive (especially the software licences), and it you may be better off investing in a powerful database server, and using SAS for the tasks where it really shines. On the other hand you may have a powerful SAS server, used by analysts during the day, but available for batch processing at night. In which case you may prefer SAS data step code for you night batch.
Personally, I am completely flexible. But even when doing native SQL on a database server, SAS is a very good framework for tying things together. I have not seen a macro language on par with SAS on any other platform.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.