Optimizing Set with By statements.

Accepted Solution Solved
Reply
Contributor
Posts: 62
Accepted Solution

Optimizing Set with By statements.

1Q)   Will Hash objects increase the performance for Data step with set by statements (more than 3 billion records)?

      If so could any one please provide me the code for the below code in Data step Using Hash? (The major concern is in Descending order for few variables).

proc sort data=&work..1234 ;

  by a b c d e f g

     descending h descending i ;

quit ;

2Q) If Hash is not the answer then Is there any alternative to optimize the result?


Accepted Solutions
Solution
‎12-18-2013 01:56 PM
Super User
Super User
Posts: 7,039

Re: Optimizing Set with By statements.

Perhaps you just want to lag some variables when W is missing?  You could do this using POINT= option on a second SET statement.

data &sastemp..yyy

   set &work..yyy ;

    by a c d f;

    previous=_n_ - 1;

   if w=. and not first.f then set &work..yyy(keep=w o p q v r s t u) point=previous;

   b = min(w,w_b);

run ;

View solution in original post


All Replies
Super User
Super User
Posts: 7,039

Re: Optimizing Set with By statements.

It is very difficult to give advice without an actual example of what you are trying to do with the data.

In general if you are processing the whole dataset then hash objects will not improve performance.  Best performance improvement will be to process using PROC's that are optimized to take advantage of multiple cores.

Contributor
Posts: 62

Re: Optimizing Set with By statements.

Hi Tom thank you for the quick reply, Please find the below data step which is getting executed after the sort step. ( The challenge is to optimize the code).

data &sastemp..yyy

     (keep= a b c  d e   f  g

            h    i j k   l   m n

            o p q r s t

            u v w  x    y

            z ......... ) ;

   set &work..yyy ;

   by a c d f w_b g w;

   length aa 8. ;

   array bb {8} 8. o p q v

                        r s t u ;

   array cc {8} 8. _temporary_ ;

   retain cc aa 0 ;

   if first.f then  do ;

      aa=w ;

      do i=1 to dim(cc) ;

         cc{i}=bb{i} ;

      end ;

   end ;

   if w=. then  do ;

      w=aa ;

      do i=1 to dim(bb) ;

         bb{i}=cc{i} ;

      end ;

   end ;

   if w_b < w        then b = w ;

   else if w_b ge w  then b = w_b ;

   output ;

   aa = w ;

   do i=1 to dim(cc) ;

       cc{i}=bb{i} ;

   end ;

run ;

Super User
Super User
Posts: 7,039

Re: Optimizing Set with By statements.

What is the purpose of the code?  Can you describe in words what it is trying to do?

Looks like when W is missing it wants to replace the values of the other listed variables with the values from the first observations for that ID group.

Contributor
Posts: 62

Re: Optimizing Set with By statements.

Thanks Tom, As I have sample data which is not millions of records the performance has decreased but need to check with the actual data. Thanks Again.

Contributor
Posts: 62

Re: Optimizing Set with By statements.

Dear Tom,

Before the  previous code the below sort is running, is there any alternative to reduce the CPU time of this proc sort. As modifying the table with the index is taking huge time (not much difference), Or there is no alternative.

proc sort data=&work..yyy ;

  by a c d f w_b g w

     descending l descending m ;

quit ;

Super User
Super User
Posts: 7,039

Re: Optimizing Set with By statements.

Try this.  First create a composite INDEX for your four key variables.

Here is code to create the index, but you could also create it while you are sorting the dataset by using dataset options.

proc datasets nolist lib=&sastemp;

  modify yyy;

  index create composite=(a c d f);

run; quit;

Then use it to find the first record when there is a missing value for W.  In the KEEP= dataset option on the second SET statement you should list all of the variables used in the index and any other variables that you want to replace.

data &sastemp..yyy

   set &work..yyy ;

   if w=. then set &work..yyy(keep=a c d f w o p q v r s t u) index=composite;

   b = min(w,w_b);

run ;

I am not sure what will happen when there is more than one record with missing values for the current id group.  I worry that it might retrieve the values from the second observation instead of the first.   Try it and see what happens.

Message was edited by: Tom Abernathy

Add the /UNIQUE option after the index name to the second SET statement to always use the first observation for the id group.

Solution
‎12-18-2013 01:56 PM
Super User
Super User
Posts: 7,039

Re: Optimizing Set with By statements.

Perhaps you just want to lag some variables when W is missing?  You could do this using POINT= option on a second SET statement.

data &sastemp..yyy

   set &work..yyy ;

    by a c d f;

    previous=_n_ - 1;

   if w=. and not first.f then set &work..yyy(keep=w o p q v r s t u) point=previous;

   b = min(w,w_b);

run ;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 279 views
  • 0 likes
  • 2 in conversation