Hi there,
I have a database which is assigned by a libname and I want to grab data from this dataset which contains 10,664,192 observations.
The following code is used to extract a subset of this dataset and the subset itself contains about 1,531,165 observations.
The time (real time) for the code to run is 17 min 37 sec, which I find a bit long. Is there a way to improve the code for it to be faster a little bit or there's no improvement possible?
%let MY = 201910;
data subset;
set myset(where=(source="XYZ"));
if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;
keep field1 field2 field3 field4 field5 field6 ... field14;
if state in ("BC") then tx = 0.02;
else if state in ("NT") then tx = 0.03;
else if state in ("YK") then tx = 0.02;
.... (there's 10 more else if statements)
run;I am pretty sure there's something I could do tu make this code faster. I tried with options threads cpucount=actual but I think this is not working with datastep procedure. I am using SAS EG on a Grid environment (Unix).
Thanks in advance for any help. This is appreciated.
You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option. You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out. Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.
You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables. Or even more so if you want to just summarize the data instead of getting all of the detail lines. But then you need to know how to code in Oracle's flavor of SQL.
Depends a lot one your computing environment. Where is the source dataset being read from? Where are you writing the result.
If the source is some external database it might help a lot to change the KEEP statement into a KEEP= dataset option on the input dataset as that will limit the amount of data that needs to move from the external database into SAS.
@Tom wrote:
Depends a lot one your computing environment. Where is the source dataset being read from? Where are you writing the result.
If the source is some external database it might help a lot to change the KEEP statement into a KEEP= dataset option on the input dataset as that will limit the amount of data that needs to move from the external database into SAS.
Hi @Tom ,
The source is read from an Oracle database (on the network) and I am writing the result in the SAS WORK library.
For the keep options, since those else/if statements create 3 new fields, should I write the keep options in the input dataset with those fields or this is not necessary?
You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option. You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out. Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.
You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables. Or even more so if you want to just summarize the data instead of getting all of the detail lines. But then you need to know how to code in Oracle's flavor of SQL.
@Tom wrote:
You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option. You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out. Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.
You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables. Or even more so if you want to just summarize the data instead of getting all of the detail lines. But then you need to know how to code in Oracle's flavor of SQL.
I will try the "KEEP=" options in the input dataset. By the mean time, I tried to re-run this code with those options :
options threads cpucount=actual;
options msglevel=i fullstimer;And here's the ouput in the log. Is there something you see that could be a hint of improvement?
NOTE: DATA statement used (Total process time):
real time 17:02.43
user cpu time 2:16.40
system cpu time 19.93 seconds
memory 1200.62k
OS Memory 23976.00k
Timestamp 2019-11-13 10:21:20 AM
Step count 36  swtich count 132
Page faults 3
Page Reclaims 7380
Page Swaps 0
Voluntary Context Switches 2226057
Involuntary  Context Switches 2113
Block Input Operations 15944
Block Output Operations 263968One place:
if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;
Instead of doing 2 numeric to text and then comparing, and possibly I would even guess a further text to numeric conversion create an actual date value for &MY and you could add to the where clause
and( effective_date le &my. and expired_date ge &my.)
If you see a NOTE about character to numeric conversions then you are wasting cycles with this.
Another though very minor is this set of If/then/else statements.
if state in ("BC") then tx = 0.02;
else if state in ("NT") then tx = 0.03;
else if state in ("YK") then tx = 0.02;
An INFORMAT when you are assigning values based on a single input variable, especially if it does not change often can execute much quicker:
Proc format library=work; invalue sttx 'BC' = 0.02 'NT' = 0.03 'YK' = 0.02 other=_error_ ; run; data example; input st $; tx = input(st,sttx.); datalines; BC NT YK QQ ;
The option on the invalue of other=_error_ treats any unexpected value as an error, which your If/Then/Else approach may not catch (since you didn't show the entire block);
@ballardw wrote:
One place:
if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;Instead of doing 2 numeric to text and then comparing, and possibly I would even guess a further text to numeric conversion create an actual date value for &MY and you could add to the where clause
and( effective_date le &my. and expired_date ge &my.)If you see a NOTE about character to numeric conversions then you are wasting cycles with this.
Another though very minor is this set of If/then/else statements.
if state in ("BC") then tx = 0.02; else if state in ("NT") then tx = 0.03; else if state in ("YK") then tx = 0.02;An INFORMAT when you are assigning values based on a single input variable, especially if it does not change often can execute much quicker:
Proc format library=work; invalue sttx 'BC' = 0.02 'NT' = 0.03 'YK' = 0.02 other=_error_ ; run; data example; input st $; tx = input(st,sttx.); datalines; BC NT YK QQ ;The option on the invalue of other=_error_ treats any unexpected value as an error, which your If/Then/Else approach may not catch (since you didn't show the entire block);
I saw a note about conversion, you are right. Will try to use your advice. However, I would like to passe the condition on thoses dates (expired and effective) into the WHERE in the input dataset line. How could I do that if I already have one condition (which is the SOURCE =...)?
@jpprovost wrote:
I saw a note about conversion, you are right. Will try to use your advice. However, I would like to passe the condition on thoses dates (expired and effective) into the WHERE in the input dataset line. How could I do that if I already have one condition (which is the SOURCE =...)?
WHERE will allow many conditions just like any IF.
set somedataset (where = ( ( thisvar = 57 and thatvar in ( 2 3 4 ) ) or Name='Jake' ) );
@jpprovost maybe try doing the processing into more than one datastep. Fetch data => work data => put results.
combined those states where the tax is the same.
Use Proc SQL with case conditions.
@VDD wrote:
@jpprovost maybe try doing the processing into more than one datastep. Fetch data => work data => put results.
combined those states where the tax is the same.
Use Proc SQL with case conditions.
Hi @VDD ,
I tried with a PROC SQL and the CASE conditions. Wow, really fast (2 min vs 17 min). Really good improvement.
@Tom ,
I tried to combine my WHERE and my KEEP into one line and it says that one of my variable (SOURCE) is not on file?
Here's the line :
set myset(keep =  field1 field2 field3 field4 field5 field6 ... field14 where=(source="XYZ" and effective_date <= &MY. and expired_date >= &MY.));
I got the following error:
ERROR: Variable SOURCE is not on file myset.
@Tom wrote:
You told Oracle you didn't want SOURCE, so you cannot use it in the WHERE clause.
Hint: SAS will process the dataset options in alphabetical order, WHERE is the last one.
It works, my bad.
Thanks to all. Very useful as always.
If you are extracting only small part of possible states and the state is a key in the database (indexed)
then use :
   set myset(where = (state in('BC', 'NT' ....) and source = 'XYZ'));that will eliminate data transfer from oracle to sas.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
