Do Loop within group optimization question

Accepted Solution Solved
Reply
Super Contributor
Posts: 412
Accepted Solution

Do Loop within group optimization question

Hi Everyone,

My problem is that for each record, I want to find the future "time" when future price> current record level.

Since the data has many different object, I want to apply the above rule for each object time series.

     the condition is: if price2 > level and obj=object then found=1;

My code below can handle it. But what I worry is that SAS will go from current record to THE END of FILE if there is no future record that meet the condition Price>current level.

It will be very much inefficient since it only need to check record of the same object.

I wonder if the sort step will help SAS to process more efficiently, meaning when object change, SAS will stop checking but assign grater=0 (meaning condition is not met).

If this sort step doesnt help, is there any way to make it process better?

Thank you,

HHC

data have;

  input object time price level;

  datalines;

1 1     5     9

1 2     2     3

1 3     8     5

1 4     10    25

1 5     12    63

2 1     15     29

2 2     12     3

2 3     18     15

2 4     10    15

2 5     18    6

3 1     15     29

3 2     22     32

3 3     8     52

3 4     11    50

3 100     1200    6

;run;

proc sort; by object time;run;

data want (keep=object time price level greater);

  set have nobs=totalobs;

  i+1;

  found=0;

  do j=i+1 to totalobs until (found=1) ;

    set have(keep=object time price rename=(object =obj time=greater price=price2)) point=j;

    if price2 > level and obj=object then found=1;

  end;

  if found=0 then greater=0;

run;


Accepted Solutions
Solution
‎12-24-2014 06:36 PM
Contributor
Posts: 52

Re: Do Loop within group optimization question

Two proposed solutions, the first one being the faster one:

For solution 1, I assume that for a given object, there will be no more than 10000 records. Of course one can rewrite solution 1 with hash objects replacing the arrays. I am doing this on the fly and doing quick and dirty.

/********************/
/**** solution 1 ****/
/********************/
data want(keep=object time price level greater);
  array prc(10000) _temporary_;
  array tim(10000) _temporary_;
  nn=0;

  call missing (of prc(*));
  call missing (of tim(*));

  do until (last.object);
    set have;
    by object;
    nn+1;
    prc(nn)=price;
    tim(nn)=time;
  end;

  ntot=nn;
  nn=0;

  do until (last.object);
    set have;
    by object;
    nn+1;
    greater=0;

    do i=nn+1 to ntot;
       if (prc(i)>level) then do; greater=tim(i); leave; end;
    end;
    output;
  end;
run;

/********************/
/**** solution 2 ****/
/********************/
proc sql;
create table want as
select a.*, coalesce(
       (select time
        from   have
        where  object=a.object and time gt a.time and price gt a.level
        having time = min(time)),0) as greater
from have a;
quit;


The issue, as stated by hhchenfx, is the issue of the performance of a solution given a very large dataset.

I recalled reading a SUGI paper showing that sorting using hash object is faster than the plain proc sort when the SAS datasets are quite large; but then the hash object method uses a lot of RAM. Proc SQL is often as fast as hash object methods and uses as much resources. For the hash object and SQL methods, the dataset needs not be sorted beforehand.  Also the point= statement is pretty slow. If your system does not have lots of RAM, SQL solutions can be slow.

I made a simulated dataset:

/***************************/
/**** simulated dataset ****/
/***************************/
data have(drop=t);
  do object=1 to 10000;
     time=1;
  do t = 1 to 300;
     time+(1+ceil(4*ranuni(3)));
     price = ceil(200*ranuni(3));
     level = price+ceil(30*ranuni(3));
     output;
  end;
  end;
run;

This table is 158 MB in size and has 3,000,000 rows. My SAS system does not have plentiful resources.

SAS Log for solution 1 (run 1): real time 3.42 seconds cpu time  3.37 seconds.

SAS Log for solution 1 (run 2): real time 3.52 seconds cpu time  3.50 seconds.

With the method of hhchenfx, I had to add code such as:
if (i>= 300) then stop;

SAS Log for method of hhchenfx :  30 obs -> real time 10.18 seconds   cpu time  10.19 seconds
SAS Log for method of hhchenfx : 300 obs -> real time 2:22.43 minutes cpu time  2:22.48 minutes

or other words just to find variable greater for the first 300 records for table have, it took seconds 140. The simulated dataset has 3,000,000 records.
It would have taken some 140*(3000000/300)*(1/2) = 710000 sec = 197 hours.


For solution 2, I had to modify code from
from have a;
to:
from have (obs=30) a;

SAS Log for solution 2 ( 30 records) :  real time   10.96 seconds cpu time    10.97 seconds
SAS Log for solution 2 (300 records) :  real time 1:48.74 minutes cpu time  1:48.78 minutes

View solution in original post


All Replies
Super User
Posts: 10,028

Re: Do Loop within group optimization question

How big table you have ?

 
 
data have;
  input object time price level;
  datalines;
1 1     5     9
1 2     2     3
1 3     8     5
1 4     10    25
1 5     12    63
2 1     15     29
2 2     12     3
2 3     18     15
2 4     10    15
2 5     18    6
3 1     15     29
3 2     22     32
3 3     8     52
3 4     11    50
3 100     1200    6
;run;
 
 proc sql;
  create table want as
   select *,case when (select count(*) from have where object=a.object and time gt a.time and price gt a.level) gt 0 then 1
            else 0 end as found
      from have as a;
quit;

Xia Keshan

Super Contributor
Posts: 412

Re: Do Loop within group optimization question

Thanks for helping.

The file is more than 20 GB (of course I can break it into pieces).

Also, I really wonder how SAS process the DO LOOP code.

HHC

Regular Contributor
Posts: 180

Re: Do Loop within group optimization question

I think the poblem is not the DO LOOP but the SET with point=

CTorres

Super Contributor
Posts: 412

Re: Do Loop within group optimization question

I think I will add a new variable for the last record of each object to control for the LOOP.

data have; set have;
by object;
if last.object then last=1;run;


data want (keep=object time price level greater);
  set have nobs=totalobs;

  i+1;
  found=0;

  do j=i+1 to totalobs until (found=1 or last=1) ;
    set have(keep=object time price rename=(object =obj time=greater price=price2)) point=j;
    if price2 > level and obj=object then found=1;
  end;

  if found=0 then greater=0;
run;

Respected Advisor
Posts: 4,173

Re: Do Loop within group optimization question

How big is the data set with columns "object, time & price" only? Reason I'm asking: If this would fit into memory then we could use quite a simple and well performing approach using a SAS hash table.

Respected Advisor
Posts: 4,173

Re: Do Loop within group optimization question

I believe using a hash approach could be very efficient. Easiest would be if you can fit all the data required for look-up into a single hash at once. If this is not possible then you would need to process your data object group by object group. Still possible using a hash but less efficient. Below the code for both approaches.

data have;

  input object time price level;

  datalines;

1 1 5 9

1 2 2 3

1 3 8 5

1 4 10 25

1 5 12 63

2 1 15 29

2 2 12 3

2 3 18 15

2 4 10 15

2 5 18 6

3 1 15 29

3 2 22 32

3 3 8 52

3 4 11 50

3 100 1200 6

;

run;

/* approach 1: All look-up data fits into a single hash */

data want;

  set have;

  if _n_=1 then

    do;

      if 0 then

        set have(keep=object time price rename=(object=_object time=future_time price=future_price));

      dcl hash h1(dataset:'have(keep=object time price rename=(object=_object time=future_time price=future_price))',ordered:'yes');

      _rc=h1.defineKey('_object','future_time');

      _rc=h1.defineData(all:'yes');

      _rc=h1.defineDone();

      dcl hiter iter('h1');

    end;

  found=0;

  _rc= iter.setcur(keySmiley Surprisedbject, key:time);

  _rc = iter.next();

  do while (_rc = 0);

    if _object ne object then leave;

    if future_price > level then

      do;

        found=1;

        leave;

      end;

    _rc = iter.next();

  end;

  if found =0 then

    do;

      call missing (future_time, future_price);

    end;

run;

/* approach 2: Processing per object group using hash look-up */

proc sort data=have out=inter;

  by object;

run;

data want;

  set inter;

  by object;

  if _n_=1 then

    do;

      if 0 then

        set inter(keep=object time price rename=(time=future_time price=future_price));

      dcl hash h1(ordered:'yes');

      _rc=h1.defineKey('object','future_time');

      _rc=h1.defineData('future_time','future_price');

      _rc=h1.defineDone();

      dcl hiter iter('h1');

    end;

  if first.object then

    do;

      _rc=h1.clear();

      do until(last.object);

        set  inter(keep=object time price rename=(time=future_time price=future_price));

        by object;

        _rc=h1.add();

      end;

    end;

  found=0;

  _rc= iter.setcur(keySmiley Surprisedbject, key:time);

  _rc = iter.next();

  do while (_rc = 0);

    if future_price > level then

      do;

        found=1;

        leave;

      end;

    _rc = iter.next();

  end;

  if found =0 then

    do;

      call missing (future_time, future_price);

    end;

run;

Solution
‎12-24-2014 06:36 PM
Contributor
Posts: 52

Re: Do Loop within group optimization question

Two proposed solutions, the first one being the faster one:

For solution 1, I assume that for a given object, there will be no more than 10000 records. Of course one can rewrite solution 1 with hash objects replacing the arrays. I am doing this on the fly and doing quick and dirty.

/********************/
/**** solution 1 ****/
/********************/
data want(keep=object time price level greater);
  array prc(10000) _temporary_;
  array tim(10000) _temporary_;
  nn=0;

  call missing (of prc(*));
  call missing (of tim(*));

  do until (last.object);
    set have;
    by object;
    nn+1;
    prc(nn)=price;
    tim(nn)=time;
  end;

  ntot=nn;
  nn=0;

  do until (last.object);
    set have;
    by object;
    nn+1;
    greater=0;

    do i=nn+1 to ntot;
       if (prc(i)>level) then do; greater=tim(i); leave; end;
    end;
    output;
  end;
run;

/********************/
/**** solution 2 ****/
/********************/
proc sql;
create table want as
select a.*, coalesce(
       (select time
        from   have
        where  object=a.object and time gt a.time and price gt a.level
        having time = min(time)),0) as greater
from have a;
quit;


The issue, as stated by hhchenfx, is the issue of the performance of a solution given a very large dataset.

I recalled reading a SUGI paper showing that sorting using hash object is faster than the plain proc sort when the SAS datasets are quite large; but then the hash object method uses a lot of RAM. Proc SQL is often as fast as hash object methods and uses as much resources. For the hash object and SQL methods, the dataset needs not be sorted beforehand.  Also the point= statement is pretty slow. If your system does not have lots of RAM, SQL solutions can be slow.

I made a simulated dataset:

/***************************/
/**** simulated dataset ****/
/***************************/
data have(drop=t);
  do object=1 to 10000;
     time=1;
  do t = 1 to 300;
     time+(1+ceil(4*ranuni(3)));
     price = ceil(200*ranuni(3));
     level = price+ceil(30*ranuni(3));
     output;
  end;
  end;
run;

This table is 158 MB in size and has 3,000,000 rows. My SAS system does not have plentiful resources.

SAS Log for solution 1 (run 1): real time 3.42 seconds cpu time  3.37 seconds.

SAS Log for solution 1 (run 2): real time 3.52 seconds cpu time  3.50 seconds.

With the method of hhchenfx, I had to add code such as:
if (i>= 300) then stop;

SAS Log for method of hhchenfx :  30 obs -> real time 10.18 seconds   cpu time  10.19 seconds
SAS Log for method of hhchenfx : 300 obs -> real time 2:22.43 minutes cpu time  2:22.48 minutes

or other words just to find variable greater for the first 300 records for table have, it took seconds 140. The simulated dataset has 3,000,000 records.
It would have taken some 140*(3000000/300)*(1/2) = 710000 sec = 197 hours.


For solution 2, I had to modify code from
from have a;
to:
from have (obs=30) a;

SAS Log for solution 2 ( 30 records) :  real time   10.96 seconds cpu time    10.97 seconds
SAS Log for solution 2 (300 records) :  real time 1:48.74 minutes cpu time  1:48.78 minutes

Super User
Posts: 10,028

Re: Do Loop within group optimization question

Of course, If table has already been sorted,  DOW + temporary array is the fastest solution . But if not I recommend to split this big table into lots of small table .

Xia Keshan

Super Contributor
Posts: 412

Re: Do Loop within group optimization question

Hi Everyone,

The discussion here really helps me a lot.

I am new to Hash process so I will not have any question here. Based on the simulated data, I find the array approach is so efficient.

I would like to make myself clear about it and below, I try to teach myself with explanation of each step.

I have 3 questions and it would be very much helpful if you could help me to clarify them.

Thank you so much and have a nice Holiday.

HHC

**********************************************************************************

*My 1st question is the how SAS process the code:

1. at the benning, SAS goes to section 2 and collect prc and time record from 1 to min(10,000 , last.object)

2. then SAS goes to Section 3, take the first record of have and run the checking with the array price();

*Once is done, SAS redo the 2 section with 2nd record of have

I wonder why SAS doesn't run the section 1 from the beginning to the end before moving onto section 3.

What line of code prevent SAS from doing so?

*My 2nd question is after finishing 1 record of have, why don’t we reset nn=0;

    end;

    output;

nn=0;

  end;

*My erd question is: what is the role of :;

  call missing (of prc(*));

  call missing (of tim(*));

***************************************************************************************************;

data want(keep=object time price level greater exit_prc);

*Section 1: declare array specification;

  array prc(10000) _temporary_;

  array tim(10000) _temporary_;

  kk=0;

  call missing (of prc(*));

  call missing (of tim(*));

*Section 2: gathering records in the array prc() and tim().

These array is Max 10,000 records;

  do until (last.object);

    set have;

    by object;

    kk+1;

    prc(kk)=price;    *start from price(0+1)?;

    tim(kk)=time;

  end;

*Section 3: the main code of checking first record with price above level;

  ntot=kk;    *the ntot is the value of record in the prc()/tim() array, Max 10,000;

  nn=0;    *Reset nn=0 for below do loop;

  do until (last.object);

    set have;    *open file have, start with the first record;

    by object;

    nn+1;        *start with nn=0;

    greater=0;

    do i=nn+1 to ntot; *first I value =1 to the max ntot record above, max 10,000;

       if (prc(i)>level) then do;

            greater=tim(i);

            exit_prc=prc(i);

            leave; *exit the checking loop to OUTPUT;

            end;

    end;

    output;

  end;

run;

Super User
Super User
Posts: 7,050

Re: Do Loop within group optimization question

*My 1st question is the how SAS process the code:

1. at the benning, SAS goes to section 2 and collect prc and time record from 1 to min(10,000 , last.object)

2. then SAS goes to Section 3, take the first record of have and run the checking with the array price();

*Once is done, SAS redo the 2 section with 2nd record of have

I wonder why SAS doesn't run the section 1 from the beginning to the end before moving onto section 3.

What line of code prevent SAS from doing so?

The first DO UNTIL(LAST.OBJECT) loop will cause SAS to read the KK records for current OBJECT group.  This happens all in one execution of the main data step loop.  The LAST.OBJECT automatic variable will be set to true when you have read the last observation in this group in the first SET statement.  Note that the second loop works the same way but it is reading the data set again because you have another SET statement. So it will read the same set of observations again starting with the first one for the current BY group. So the normal data step loop will actually run only once per BY group rather than the normal once per observation.  This happens because you have nested the SET statements inside of DO loops so that they could execute more than once per execution of the main data step loop and so process more than one observation in each execution of the main data step loop.

*My 2nd question is after finishing 1 record of have, why don’t we reset nn=0;

     end;

    output;

nn=0;

  end;

You would not want to reset NN here as NN is working as your record counter within this BY group.  You are using it to set the lower bound of the DO loop that does the look ahead to find the first occurrence where the target price is met.

*My erd question is: what is the role of :;

  call missing (of prc(*));

  call missing (of tim(*));

You actually do not need these lines in the code you posted.  Note that TEMPORARY arrays are automatically retained so CALL MISSING() is a good way to reset them all to missing.  But since section 2 is setting values of all of the values in this particular group there is no need to set the other values to missing (or anything else).  The second loop in section 3 will only reference the array elements that had values assigned in the first do loop in section 2.

Super Contributor
Posts: 412

Re: Do Loop within group optimization question

Thank you all for your help.

HHC

Super Contributor
Posts: 412

Re: Do Loop within group optimization question

Hi,

I try to do the moving Highest and Lowest of price using this array approach.

Somehow I got the error "Array subscript out of range ..."

Could you help me to correct it?

Thank you,

HHC

data have;

  input object time price level;

  datalines;

1 1     5     9

1 2     2     3

1 3     8     5

1 4     10    25

1 5     12    63

2 1     15     29

2 2     12     3

2 3     18     15

2 4     10    15

2 5     18    6

3 1     15     29

3 2     22     32

3 3     8     52

3 4     11    50

3 100     1200    6

;run;


data want2;
array price2(4) _temporary_;

kk=0;

do until (last.object);
set have;
by object;
kk+1;
price2(kk)=price;
end;

max_record=kk;
nn=0;

do until (last.object);
set have;
by object;
nn+1;
highest=0;
lowest=10000;

do i=nn+1 to max_record;
  if price2(i)>highest then highest=price2(i);
  if price2(i)<lowest then lowest=price2(i);
end;
end;
run;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 763 views
  • 6 likes
  • 6 in conversation