code improvement to run fast

JJP1 · Posted 06-01-2020 11:34 AM

Hi team,

Currently i am having a code which starts at 02:49:17 and ends at 07:04:57 time.because of this the sas dataset is over running and users are trying to access the datasets from their end which is causing failure and chances datasets will not be updated with correct data.

i can not change the schedule as it is dependent on other datasets also.

so i thought of trying to improve the performance of the code so that it will be running fast.this is just my thought.

i am having a very low knowledge so i am requesting to please provide your thoughts as i believe that will be more excellent and efficinet options please.

iam having so many proc sorts in code- please suggest can i do something on this example : tagsort option or any other advice please

ballardw · Posted 06-01-2020 12:03 PM

Sorting is one of the most common and time intensive activities.

Are you using indexed data sets? Using indexes instead of sorts can speed some activities. Are you rebuilding the same data every day? Appending to existing data?

Does the data reside in an external DBMS?

What exactly are you doing that requires multiple sorts? Can you provide specific examples of the code using the different sorts?

Sometimes a careful sort order selection can allow reuse of the same data without resorting. And if you are doing multiple summaries of data then perhaps use of class variables in a procedure like Proc Summary could do multiple summaries and select the needed one by using a selection of the created _type_ variable.

Some details may help.

Reeza · Posted 06-01-2020 12:09 PM

Given what you've posted, my suggestion would be to search "efficient coding strategies" or "make code faster" on lexjansen.com and read through the paper there.

50 Ways to Make Your SAS® Code Execute More Efficiently

https://www.google.com/url?client=internal-element-cse&cx=011240857950991443104:hsqcj3nokh0&q=http:/...

Twenty Ways to Run Your SAS® Program Faster and Use Less
Space

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3684-2019.pdf

JJP1 · Posted 06-01-2020 12:33 PM

Thanks, i will check and try.

May i know is there any other approach you can see and suggest as i am from basic level of knowledge on this issue please

Reeza · Posted 06-01-2020 12:41 PM

As BallardW indicates you need to provide more details. Vague questions get vague answers.

JJP1 · Posted 06-01-2020 01:07 PM

ok but code is very large. but the code is written in a efficient way only as i just checked and compared the code with the documnet on "Twenty Ways to Run Your SAS® Program Faster and Use Less
Space"

DavePrinsloo · Posted 06-01-2020 01:06 PM

Look into hashing. The most common reason that sorts are done in SAS is to merge in following steps. If you use hashing, then you dont need to sort either table. Hashing is really useful when working with huge tables and you need to look up things in other "smaller" tables. The proviso is that the smaller table need to comfortably fit in memory. In an optimization consulting project where a table of 100 million records was had to be summarised after being enriched from multiple other sources, I managed to avoid all sorts and the large table was only parsed. The final output was a table with only 20,000 records and maybe 20 columns. The summary was done into a hash table and all enrichment was done in the same step. It was possible to reduce a run time of 1 hour into 1 minute.
But this may not fit your problem: indeed, maybe you could simply adjust the memsize and sortsize options to get a huge boost in performance, without changing any code.

Kurt_Bremser · Posted 06-01-2020 01:16 PM

To make suggestions on possible improvements for your code, we need to see it. Since the log contains performance-relevant information, we need to see it too.

If you can identify single steps that take up most of the time, these need to be worked on first.

Undertaking a large job for production use with large data while having only basic knowledge is, of course, the wrong way to go. You build your skills with small tasks first, then apply the experience you gained to the bigger issues later.

If time is of the essence, I suggest you hire a consultant and then learn from the methods used by said consultant. I had some of my important early "aha" moments like this.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

JJP1 · Posted 06-01-2020 01:54 PM

Hi @Kurt_Bremser ,

below code is taking more time(24:39.42) compared with the rest of all code and other sections of code is of seconds only maximum 7 minutes. would you please suggest any option to improve the below code

data 2a (drop=ty amt);
set ay;
by num no no1;
if ty      = 'AA' then ccb  = amt;
else if ty = 'BB'  then cles = amt;
else if ty = 'CCC' then nut  = amt;
else if ty = 'DDD' then amp   = amt;
else if ty = 'EE' then ss = amt;
else if ty = 'FFF' then rat = amt;
else if ty = 'GGG' then xcv  = amt;
else if ty = 'HHH' then wee    = amt;
else if ty = 'II' then ler  = amt;
else if ty = 'JJJ' then mit  = amt;
else if ty = 'KK' and hhhoi in ('SETL' 'SET1' 'SET2' 'SET3') then do;
if gjhrt = '16' then ghfcv = 'SD';
else ghfcv = 'ED';
end;
 
retain ccb amp cles nut ler mit
ss rat xcv wee 0 ghfcv " ";
 
if last.no1 then do;
output;
ccb  = 0;
cles = 0;
nut  = 0;
amp   = 0;
ss = 0;
rat = 0;
xcv  = 0;
wee    = 0;
ler  = 0;
mit  = 0;
ghfcv    = " ";
end;
run;
 
proc datasets lib=work memtype=data;
delete ay;
run;
quit;

Tom · Posted 06-01-2020 02:31 PM

Looks sort of like a transpose. Perhaps you don't need to do the transpose at all?

How do you use the dataset, 2A? Is it used just once or multiple times?

DavePrinsloo · Posted 06-01-2020 02:57 PM

How many records are in the input table, and how many are in the output table? Showing the log of this excerpt, plus that of the previous proc sort of AY would help a lot.

ballardw · Posted 06-01-2020 04:44 PM

Which value of TY is the most common?

If you have the more common values of the variable TY in that block of If/then/else then fewer ifs have to be evaluated.

For example if the JJJ were the most common value then you have to execute 9 additional statements than if the block started with

If ty = 'JJJ' then ...

RichardDeVen · Posted 06-01-2020 07:09 PM

I am guessing you tweaked the code somewhat because WORK.2A is not a valid data set name as coded, even with OPTIONS VALIDVARNAME=ANY.

Things to check are excessively long variables such as $4096 when $20 is sufficient.

The output data set uses option (DROP=) and could allow un-needed 'tag-along' variables in your per-group data record, an explicit (KEEP=) could be helpful.

As for

by num no no1;

How many by groups do you have ?

Is data set AY constructed elsewhere with a JOIN ? If so, the real world data might be subverting the expectations of the join criteria.

Do you have any log snippets that showing the steps

- number of observations read and written

- system memory and cpu

Get more info by setting

OPTIONS FULLSTIMER;

Log snippet should look like

NOTE: There were #### observations read from the data set ********.********
NOTE: The data set ********.******** has #### observations and #### variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              708.40k
      OS Memory           24584.00k
      Timestamp           06/01/2020 06:56:30 PM
      Step Count                        401  Switch Count  0

Can you show the output from

Proc CONTENTS data=<name of SET data set>
Proc CONTENTS data=<name of OUTPUT data set>

mkeintz · Posted 06-01-2020 10:56 PM

After you create dataset 2A (an erroneous data set name), you delete data set AY.

So I presume AY is created earlier in the program, i.e. it's not a permanent data set. If so, then you might save some resources by declaring AY as a data set VIEW rather than a data set FILE (i.e. DATA AY/ view=AY; instead of DATA AY;).. This can speed things up, because AY is never written to disk. Instead its contents are dynamically generated and passed to the DATA 2A step only when the SET AY statement is encountered. If AY is not used multiple times in your program, this will likely save a lot of resources and time ordinarily needed to write data to disk only to re-read and discard it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 06-01-2020 11:13 PM

My suspicion is that disk or network I/O is the bottleneck. If this data step runs for 24+ minutes AND source ay is a table and not a view then it must be high volume - really a lot of rows and variables and/or some character variables with a big length.

- Turn on option FULLSTIMER and share the log with us.

- Run a Proc Contents on ay and share the result with us.

- Drop all variables you can from input data set ay - in the set statement something like: set ay(keep=....);

- Set option COMPRESS=YES at the start of your program and test if this improves runtimes

Below some tweaks to the code you've posted. Not sure if this will improve execution time.

data _2a (drop=ty amt _ind);
  set ay;
  by num no no1;

  array a_vars {*} ccb cles nut amp ss rat xcv wee ler mit;
  _ind=WHICHC(ty, 'AA','BB','CCC','DDD','EE','FFF','GGG','HHH','II','JJJ');
  if _ind>0 then a_vars(_ind)=amt;
  else if ty = 'KK' and hhhoi in ('SETL' 'SET1' 'SET2' 'SET3') then
    do;
      if gjhrt = '16' then ghfcv = 'SD';
      else ghfcv = 'ED';
    end;

  retain ccb amp cles nut ler mit ss rat xcv wee 0 ghfcv " ";
  if last.no1 then
    do;
      output;
      ccb  = 0;
      cles = 0;
      nut  = 0;
      amp   = 0;
      ss = 0;
      rat = 0;
      xcv  = 0;
      wee    = 0;
      ler  = 0;
      mit  = 0;
      ghfcv    = " ";
    end;
run;

proc datasets lib=work memtype=data nolist;
  delete ay;
run;

code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Re: code improvement to run fast

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!