Hi
i have a piece of code that takes on average 2hrs two run due to the size of the data.
below is the code sample.
is there more better way i can code this.
e.g.
data Data_set_name;
set Data_set_name;
where channel = 'PHONE'
or channel = 'BRN'
or channel = 'BRANCH'
or channel = 'FD'
or channel = 'OKS'
or channel = 'FFI'
or channel = 'FIRSTDIRECT'
or channel = 'OCC'
or channel = 'OTHERCALLCENTRE'
or channel = 'FPAS'
and contacted = "Y"
and Direct_Marketing = "Y";
keep id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing;
run;
Allistair,
Try this modified version of your data step. It only reads the columns you require into the PDV, and uses shorter where clause condition.
DATA Data_set_name;
SET Data_set_name(KEEP=id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing
WHERE=( channel IN ('PHONE','BRN','BRANCH','FD','OKS','FFI','FIRSTDIRECT','OCC','OTHERCALLCENTRE','FPAS')
AND contacted = "Y"
AND Direct_Marketing = "Y"));
RUN;
Please keep in mind, performance enhancements do not rely on code modifications only!! Sometimes additional changes will be required in order to enhance response times, such as
1. Adding an index for the channel column. (If contacted and Direct_Marketing only contain Y/N values, then no index required)
2. The Physical location of the data set.
2.1 Network Access --> Network bandwidth limitation
2.2 Database Access --> Changes into the Read Buffer size
2.3 Disk --> Change stripe widths of 64 or 128K
Check the following blog, it includes links for additional insight into resolving performance issues
Improving performance: Determine the cause
Good luck,
Ahmed
Please try
data Data_set_name;
set Data_set_name(keep = id_no cust_no cust_cde dataset_name date_end channel description purpose prod_hse credit_campaign Direct_Marketing);
where prxmatch('/PHONE|BRN|BRANCH|FD|OKS|FFI|FIRSTDIREC|FIRSTDIRECT|OCC|OTHERCALLCENTRE|FPAS/',channel)>0
and contacted='Y' and Direct_Marketing = "Y";
run;
Thanks,
Jag
Several recommendations that might allow the WHERE statement to evaluate faster.
Examine the percentage of observations that satisfy each condition in the WHERE statement. Reorder the conditions so that the WHERE statement can make a decision as early as possible. For example, if only 5% of the observations meet the condition CONTACTED="Y", place that first:
where contacted="Y" and
(channel='PHONE' or ... or channel='FPAS')
and Direct_Marketing="Y";
Next, add those parentheses around the CHANNEL values. When you don't, it gets confusing to interpret and to be confident that you are asking for the right set of observations.
Finally, within the list of CHANNEL values, order them from MOST to LEAST frequent occurrence. Again, that let's the WHERE statement find a match as early as possible so it doesn't have to check the remaining values.
In summary, AND conditions should be ordered from least likely to most likely, but OR conditions should be ordered from most likely to least likely.
Good luck.
Allistair,
I agree with both of Astounding recommendations and the other that appeared in the other two recommendations I saw (i.e., only keeping the data you need).
That last one will be most effective if you aren't keeping a lot of variables. I just ran a test case where I was dropping a 1,000 variables from the run and it, alone, caused a 75% reduction in running time.
However, you will also need to include contacted in your keep statement, but then drop it in your data statement. i.e.:
data Data_set_name4 (drop=contacted);
set Data_set_name (keep=id_no cust_no cust_cde dataset_name
date_end channel description purpose prod_hse credit_campaign
Direct_Marketing contacted);
I agree with the above, and add:
1) +1 for the suggestion to use indexes.
2) We've seen good performance gains by using a "SPDE Work" library. There are some caveats you need to be aware of (see the doc). But, it's really easy to implement:
libname spdework spde "%sysfunc(pathname(work))" temp=yes;
data spdework.Data_set_name4 (drop=contacted);
set spdework.Data_set_name (keep=id_no cust_no cust_cde dataset_name
date_end channel description purpose prod_hse credit_campaign
Direct_Marketing contacted);
I'm assuming your source data is a work dataset since it has a one-level name.
You can simplify this further by using the options user=spdework; statement. This redirects one level names to the specified library.
On top of all the good advice already given:
- Use option "fullstimer" so you can see in the log where you're actually spending the time SAS(R) 9.4 Language Reference: Concepts, Second Edition
- Have permanent datasets compressed to decrease disk I/O.
- If the source data set is stored in a data base: Code your where clause in a way that it gets fully pushed to the data base for execution. Use "OPTIONS SASTRACE=',,,d' SASTRACELOC=SASLOG NOSTSUFFIX;" to see in the log what part of processing actually gets sent to the DB.
If analysing the log using "fullstimer" tells you it's about I/O then the following will be of interest to you:
Hi,
Please check this one. this could also help.
https://communities.sas.com/message/172882#172882.
Regards
Uma Shanker Saini
I never noticed that the order in the where cleasu is affecting performance. What is, is the use of index or not. Whether is optimal to create index or not, it is its own subject area which is discussed in each each project I would say... If you decide to have indexes, and your data is in SAS, I would consider SPDE since it handles indexes much more efficient than the Base SAS engine (especially creating and updating).
Then I wonder if your where cleuse is doing what you want, AND ties "harder" than OR, so your logic contacted = "Y" and Direct_Marketing = "Y" is only evaluated for channel FPAS, is that what you want?
Otherwise, replace the whole OR block with an IN(), easier to code, and perhaps avoids simple coding errors...
LinusH,
I ran a few tests to explore results of changing order in the WHERE statement. I was able to confirm that the order of AND conditions makes a difference, but not the order of OR conditions. Here's one program I used:
data test;
y=5;
do x=1 to 10000000;
output;
end;
retain var3 - var10 0;
run;
data test2;
set test;
where y < 0 and x > 0;
run;
data test3;
set test;
where x > 0 and y < 0;
run;
In real life, I ran both DATA steps several times to confirm that the CPU time is different. In the case of OR conditions, I suspect that SAS changes them internally, converting them to an IN operator. But I don't think I found the whole story on that.
Of course, the differences will be small because the WHERE statement accounts for just a small percentage of the work in a DATA step.
Yeah, you are correct. And the difference on CPU time is about 20%, so it does have some impact.
I guess that since these variables are not indexed, and therefore have no stats of the contents in the variable, SAS can't guess in what order it should evaluate the where clause. So this is a clear differentiation to RDBMS where you usually have the possibility to calculate statistics for a table, indexed or not.
Hi! I agree with Linus! The WHERE-statements below are exactly the same:
where channel = 'PHONE' or channel = 'BRN' or channel = 'BRANCH' or channel = 'FD' or channel = 'OKS'
or channel = 'FFI' or channel = 'FIRSTDIRECT' or channel = 'OCC' or channel = 'OTHERCALLCENTRE'
or channel = 'FPAS' and contacted = "Y" and Direct_Marketing = "Y";
and
where channel = 'PHONE' or channel = 'BRN' or channel = 'BRANCH' or channel = 'FD' or channel = 'OKS'
or channel = 'FFI' or channel = 'FIRSTDIRECT' or channel = 'OCC' or channel = 'OTHERCALLCENTRE'
or (channel = 'FPAS' and contacted = "Y" and Direct_Marketing = "Y");
Then of course it all depends on "todays data" if it shows up "today", tomorrow, or some other day.To verify that the statements have EXACTLY the same meaning by performing tests, requires a test data set which "spans all the different values". NOT AT ALL easy to build!
Yes they are the same, but I really have to wonder if the OP realized that? I.e., I think that the OP has to clarify whether he/she was really trying to accomplish:
where channel = 'PHONE' or channel = 'BRN' or channel = 'BRANCH' or channel = 'FD' or channel = 'OKS'
or channel = 'FFI' or channel = 'FIRSTDIRECT' or channel = 'OCC' or channel = 'OTHERCALLCENTRE'
or channel = 'FPAS' and contacted = "Y" and Direct_Marketing = "Y";
or
where (channel = 'PHONE' or channel = 'BRN' or channel = 'BRANCH' or channel = 'FD' or channel = 'OKS'
or channel = 'FFI' or channel = 'FIRSTDIRECT' or channel = 'OCC' or channel = 'OTHERCALLCENTRE'
or channel = 'FPAS') and contacted = "Y" and Direct_Marketing = "Y";
Hi! I agree completely!
Watch out because AND is evaluated before OR in the order of operations precedence for SAS. SAS Operators : Order of Evaluation of SAS Operators
Note that SAS will show you how it has interpreted your WHERE condition in the notes.
52 data one ;
53 set test;
54 where c=1 or c=2 or c=3 and a and b;
55 run;
NOTE: There were 9 observations read from the data set WORK.TEST.
WHERE c in (1, 2) or ((c=3) and a and b);
56 data two;
57 set test;
58 where a and b or c=1 or c=2 or c=3 ;
59 run;
NOTE: There were 14 observations read from the data set WORK.TEST.
WHERE (a and b) or c in (1, 2, 3);
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.