I have a dataset with about 6 million observations and this sas data step is insufficient so I was asked to run it as an sql query. I am not familiar how to phrase if then statements in sql within a select statement as well as format variables. The original code looks like this:
data convert_code;
format date1 date 2 mmddyy8. code $4.;
set userid.code; (this table is pulled from oracle)
if ID='X' then P='A'; else P='B';
If CAT in ('1','2','3') then CAT_group='ONE'; else CAT_GROUP='TWO';
if CAT_2 > '0' and CAT_2A >='1' then d=1; else d=0;
date1=datepart(date1);
date2=datepart(date2);
if code='3' and type_code in ('A','B','C') THEN DO;
if P_CODE in ('1','2','3') then P='1';
if P_CODE in ('4','5','6') then P='2';
end;
if code='4' and e_code in ('A') then DO;
if B_CODE in ('11','12','13') then P='3';
if B_CODE in ('14','15','16') then P='4';
end;
run;
if ID='X' then P='A'; else P='B';
Becomes
Case when ID='X' then 'A'
else 'B' end as P format=$1.,
The rest is similar though you may need to nest case statements and format where appropriate.
Your code only takes about 3 seconds to run on an old machine. Why is it being considered insufficient?
Proc SQL may not do it any quicker.
Agree with Arthur, being asked to rewritte this is very odd. SQL may not be faster at all.
You need to optimise in other ways (try compressing tables for example).
Does userid.code point to an actual Oracle table? If it does then SAS is translating your code into SQL anyway behind the scenes.
If userid.code is a SAS table extracted from Oracle first then you could combine the extraction and your code into one step and that would definitely be quicker.
So 99% of the time that data step is taking is pulling the data from Oracle to SAS. Re-coding the IF/THEN to CASE statements will not help with that.
The best thing you can do is to cut down on the amount of data you need to pull from Oracle to SAS.
Perhaps you could add a KEEP data set option to only move the variables you need?
Perhaps you could add a WHERE statement to only move the observations you need?
A mentioning Oracle and a SAS datastep. As Tom already stated you problem is not the datastep/SQL. Normally complex function with a datastep are faster and can doe more as SQL.
Your real problem is the data movement from oracle to SAS as the data is saved in the saswork the destination is unclear.
When the destination is SAS the goal is to minimize the needed data (keep and where)
When the destination is back to Oracle the whole processing could be done in Oracle using implicit or explicit SQL.
With Oracle there is no support for in-database processing and publishing SAS formats. SAS(R) 9.4 In-Database Products: User's Guide, Fourth Edition The approach for using formats with in-database is a nogo.
The only SAS fucntions being supported are of the Oracle SAS/ACCESS(R) 9.4 for Relational Databases: Reference, Sixth Edition
The datepart (date1/date2) to be eliminiated (oracle/sas datetime difference) and only focus on that new variable code.
The case and in of SAS(R) 9.4 SQL Procedure User's Guide should be able to work well for a implicit one (most easy to code).
Hi all who answered,
This is pinkyc, I no longer have access to the account due to my college discontinuing my email and can't reset my password so I apologize that I cannot tick helpful answers for this thread. I think the issue as Tom correctly pointed out is that the table I was pulling from (userid.) didn't have 6 million observations, more like 60 million which was caused by a query that wasn't specific enough. So I fixed that to actually 6 million and it ran fine. Thanks everyone else too!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.