BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jpprovost
Quartz | Level 8

Hi there,

 

I have a database which is assigned by a libname and I want to grab data from this dataset which contains 10,664,192 observations.

The following code is used to extract a subset of this dataset and the subset itself contains about 1,531,165 observations.

The time (real time) for the code to run is 17 min 37 sec, which I find a bit long. Is there a way to improve the code for it to be faster a little bit or there's no improvement possible?

 

%let MY = 201910;

data subset;
set myset(where=(source="XYZ"));
if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;

keep field1 field2 field3 field4 field5 field6 ... field14;
if state in ("BC") then tx = 0.02;
else if state in ("NT") then tx = 0.03;
else if state in ("YK") then tx = 0.02;
.... (there's 10 more else if statements)
run;

I am pretty sure there's something I could do tu make this code faster. I tried with options threads cpucount=actual but I think this is not working with datastep procedure. I am using SAS EG on a Grid environment (Unix).

 

Thanks in advance for any help. This is appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option.  You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out.  Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.

 

You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables.  Or even more so if you want to just summarize the data instead of getting all of the detail lines.  But then you need to know how to code in Oracle's flavor of SQL.

View solution in original post

12 REPLIES 12
Tom
Super User Tom
Super User

Depends a lot one your computing environment.  Where is the source dataset being read from? Where are you writing the result.

 

If the source is some external database it might help a lot to change the KEEP statement into a KEEP= dataset option on the input dataset as that will limit the amount of data that needs to move from the external database into SAS.

jpprovost
Quartz | Level 8

@Tom wrote:

Depends a lot one your computing environment.  Where is the source dataset being read from? Where are you writing the result.

 

If the source is some external database it might help a lot to change the KEEP statement into a KEEP= dataset option on the input dataset as that will limit the amount of data that needs to move from the external database into SAS.


Hi @Tom ,

 

The source is read from an Oracle database (on the network) and I am writing the result in the SAS WORK library.

 

For the keep options, since those else/if statements create 3 new fields, should I write the keep options in the input dataset with those fields or this is not necessary? 

Tom
Super User Tom
Super User

You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option.  You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out.  Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.

 

You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables.  Or even more so if you want to just summarize the data instead of getting all of the detail lines.  But then you need to know how to code in Oracle's flavor of SQL.

jpprovost
Quartz | Level 8

@Tom wrote:

You will need to tell ORACLE what columns you want to read it with the KEEP= dataset option.  You might still need a KEEP, or more likely a DROP statement, to tell SAS what variables from the data step to write out.  Some people code a KEEP= dataset option on the output dataset name, but I always find that as a confusing way to code.

 

You could also perhaps increase the speed a lot by doing the processing in Oracle. Especially if you want to eliminate observations based on some of your calculated variables.  Or even more so if you want to just summarize the data instead of getting all of the detail lines.  But then you need to know how to code in Oracle's flavor of SQL.


 

I will try the "KEEP=" options in the input dataset. By the mean time, I tried to re-run this code with those options :

 

options threads cpucount=actual;
options msglevel=i fullstimer;

And here's the ouput in the log. Is there something you see that could be a hint of improvement?

 

NOTE: DATA statement used (Total process time):

real time 17:02.43
user cpu time 2:16.40
system cpu time 19.93 seconds
memory 1200.62k
OS Memory 23976.00k
Timestamp 2019-11-13 10:21:20 AM
Step count 36  swtich count 132
Page faults 3
Page Reclaims 7380
Page Swaps 0
Voluntary Context Switches 2226057
Involuntary  Context Switches 2113
Block Input Operations 15944
Block Output Operations 263968
ballardw
Super User

One place:

if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;

Instead of doing 2 numeric to text and then comparing, and possibly I would even guess a further text to numeric conversion create an actual date value for &MY and you could add to the where clause

 

and( effective_date le &my. and expired_date ge &my.)

If you see a NOTE about character to numeric conversions then you are wasting cycles with this.

 

Another though very minor is this set of If/then/else statements.

if state in ("BC") then tx = 0.02;
else if state in ("NT") then tx = 0.03;
else if state in ("YK") then tx = 0.02;

An INFORMAT when you are assigning values based on a single input variable, especially if it does not change often can execute much quicker:

Proc format library=work;
invalue sttx
'BC' = 0.02
'NT' = 0.03
'YK' = 0.02
other=_error_
;
run;

data example;
   input st $;
   tx = input(st,sttx.);
datalines;
BC
NT
YK
QQ
;

The option on the invalue of other=_error_ treats any unexpected value as an error, which your If/Then/Else approach may not catch (since you didn't show the entire block);

 

jpprovost
Quartz | Level 8

@ballardw wrote:

One place:

if put(effective_date, yymmn.) <= &MY. and put(expired_date,yymmn.) >= &MY.;

Instead of doing 2 numeric to text and then comparing, and possibly I would even guess a further text to numeric conversion create an actual date value for &MY and you could add to the where clause

 

and( effective_date le &my. and expired_date ge &my.)

If you see a NOTE about character to numeric conversions then you are wasting cycles with this.

 

Another though very minor is this set of If/then/else statements.

if state in ("BC") then tx = 0.02;
else if state in ("NT") then tx = 0.03;
else if state in ("YK") then tx = 0.02;

An INFORMAT when you are assigning values based on a single input variable, especially if it does not change often can execute much quicker:

Proc format library=work;
invalue sttx
'BC' = 0.02
'NT' = 0.03
'YK' = 0.02
other=_error_
;
run;

data example;
   input st $;
   tx = input(st,sttx.);
datalines;
BC
NT
YK
QQ
;

The option on the invalue of other=_error_ treats any unexpected value as an error, which your If/Then/Else approach may not catch (since you didn't show the entire block);

 


 

I saw a note about conversion, you are right. Will try to use your advice. However, I would like to passe the condition on thoses dates (expired and effective) into the WHERE in the input dataset line. How could I do that if I already have one condition (which is the SOURCE =...)?

 

ballardw
Super User

@jpprovost wrote:

 

I saw a note about conversion, you are right. Will try to use your advice. However, I would like to passe the condition on thoses dates (expired and effective) into the WHERE in the input dataset line. How could I do that if I already have one condition (which is the SOURCE =...)?

 


WHERE will allow many conditions just like any IF.

 

set somedataset (where = (  ( thisvar = 57 and thatvar in ( 2 3 4 )  ) or Name='Jake' ) );

 

 

VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

@jpprovost maybe try doing the processing into more than one datastep.  Fetch data => work data => put results.

 

combined those states where the tax is the same.

 

Use Proc SQL with case conditions.

 

jpprovost
Quartz | Level 8

@VDD wrote:

@jpprovost maybe try doing the processing into more than one datastep.  Fetch data => work data => put results.

 

combined those states where the tax is the same.

 

Use Proc SQL with case conditions.

 


Hi @VDD ,

 

I tried with a PROC SQL and the CASE conditions. Wow, really fast (2 min vs 17 min). Really good improvement.

 

@Tom ,

 

I tried to combine my WHERE and my KEEP into one line and it says that one of my variable (SOURCE) is not on file?

 

Here's the line :

 

set myset(keep =  field1 field2 field3 field4 field5 field6 ... field14 where=(source="XYZ" and effective_date <= &MY. and expired_date >= &MY.));

I got the following error:

 

ERROR: Variable SOURCE is not on file myset.
Tom
Super User Tom
Super User
You told Oracle you didn't want SOURCE, so you cannot use it in the WHERE clause.
Hint: SAS will process the dataset options in alphabetical order, WHERE is the last one.
jpprovost
Quartz | Level 8

@Tom wrote:
You told Oracle you didn't want SOURCE, so you cannot use it in the WHERE clause.
Hint: SAS will process the dataset options in alphabetical order, WHERE is the last one.

It works, my bad.

Thanks to all. Very useful as always.

Shmuel
Garnet | Level 18

If you are extracting only small part of possible states and the state is a key in the database (indexed)

then use :

   set myset(where = (state in('BC', 'NT' ....) and source = 'XYZ'));

that will eliminate data transfer from oracle to sas. 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 2036 views
  • 4 likes
  • 5 in conversation