Solved: Re: Find location of Moving max per group

hhchenfx · Posted 09-26-2022 07:52 PM

Hi Everyone,

For each date, I lookback 3 days and find the max value using this proc expand below.

Now I want to find the day that rolling max occurs.

So I use the loop to get it done. However, it is quite slow. I wonder if there is better option.

Thank you,

HHC


data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;run;

data have; set have;
drop lid;
lid=lag(id);
if lid^=id then change_id=1;
run;

proc sort data=have; by id descending date;run;

data want; 
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
	set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
	if count=3 then leave;
	else
	if value1>max_value then do;
		max_value=value1;
		max_date=date1;
		end;
	count=count+1;
end;
run;

Tom · Posted 09-27-2022 09:42 AM

A "circular" array is a simple way to implement a rolling window. Use the MOD() function to determine which variable in the array to place the current values. That way the values roll out of the window automatically when they are replaced by the new value.

data have;
  input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;


data want ;
  set have ;
  by id;
  array values [3] _temporary_ ;
  array dates  [3] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,3)]=value;
  dates[1+mod(_n_,3)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

proc print;
run;

                               max_
Obs    date    id    value    value    max_date

  1      1      1        1       1         1
  2      2      1        9       9         2
  3      3      1        0       9         2
  4      4      1        7       9         2
  5      5      1        5       7         4
  6      1      2      100     100         1
  7      2      2        0     100         1
  8      3      2       -2     100         1
  9      4      2        2       2         4
 10      5      2        0       2         4
 11      6      2        1       2         4

View solution in original post

ballardw · Posted 09-27-2022 12:42 AM

You state "I look back 3 days and find the max value using this proc expand below." I don't see any call to Proc Expand.

How many values/dates do you have for each of your Id values? I have an idea but it may have issues with large numbers of records per id.

Ksharp · Posted 09-27-2022 07:45 AM

/*Assuming the dataset has been sorted by id and date*/
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;

data want;
if _n_=1 then do;
 declare hash h();
 h.definekey('d');
 h.definedata('d','v');
 h.definedone();
end;
 set have;
 by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-2 to date;
 if h.find()=0 then do;
  if v>max_value then do;max_value=v;max_date=d;end;
 end;
end;
drop d v;
run;

Tom · Posted 09-27-2022 09:42 AM

A "circular" array is a simple way to implement a rolling window. Use the MOD() function to determine which variable in the array to place the current values. That way the values roll out of the window automatically when they are replaced by the new value.

data have;
  input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;


data want ;
  set have ;
  by id;
  array values [3] _temporary_ ;
  array dates  [3] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,3)]=value;
  dates[1+mod(_n_,3)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

proc print;
run;

                               max_
Obs    date    id    value    value    max_date

  1      1      1        1       1         1
  2      2      1        9       9         2
  3      3      1        0       9         2
  4      4      1        7       9         2
  5      5      1        5       7         4
  6      1      2      100     100         1
  7      2      2        0     100         1
  8      3      2       -2     100         1
  9      4      2        2       2         4
 10      5      2        0       2         4
 11      6      2        1       2         4

hhchenfx · Posted 09-27-2022 05:08 PM

Thanks you all for helping.

Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.

I run on lookback 3 days and on lookback 15 days.

ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.

The other 2 methods are comparable in time.

Any idea why ARRAY is so good?

Thanks,

HHC

Here is the time needed:

Look back 3	Fast PC	Slow PC
Do Loop	real time 17.67 sec	real time 26.88 sec
	cpu time 17.48 sec	cpu time 26.85 sec

HASH	real time 18.34 sec	real time 26.33 sec
	cpu time 17.25 sec	cpu time 26.20 sec

ARRAY	real time 10.59 sec	real time 10.06 sec
	cpu time 6.34 sec	cpu time 9.84 sec

Do Loop	real time 55.51 sec	real time 1:30.10
	cpu time 55.50 sec	cpu time 1:30.11
Look back 15
HASH	real time 47.34 sec	real time 1:09.37
	cpu time 47.01 sec	cpu time 1:09.39

ARRAY	real time 9.07 sec	real time 10.99 sec
	cpu time 7.17 sec	cpu time 10.68 sec


data want; 
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
	set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
	if count=15 then leave;
	else
	if value1>max_value then do;
		max_value=value1;
		max_date=date1;
		end;
	count=count+1;
end;
run;



data want;
if _n_=1 then do;
 declare hash h();
 h.definekey('d');
 h.definedata('d','v');
 h.definedone();
end;
 set have;
 by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-14 to date;
 if h.find()=0 then do;
  if v>max_value then do;max_value=v;max_date=d;end;
 end;
end;
drop d v;
run;



data want ;
  set have ;
  by id;
  array values [15] _temporary_ ;
  array dates  [15] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,15)]=value;
  dates[1+mod(_n_,15)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

ballardw · Posted 09-27-2022 06:42 PM

@hhchenfx wrote:

Thanks you all for helping.

Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.

I run on lookback 3 days and on lookback 15 days.

ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.

The other 2 methods are comparable in time.

Any idea why ARRAY is so good?

Thanks,

HHC

Arrays use internal fixed storage addresses, usually held in memory, and the technology behind them has been around for a very long time. I can't find a quick reference for the first array use but languages with arrays such as FORTRAN have been around nearly 70 years, FORTRAN developed in 1954 for example. So there has been a lot of opportunity to optimize code. The bit about addresses means that values can be retrieved/written very quickly. The iterated loop, such as Do i= 1 to 10 doesn't require logical comparisons to decide on whether to keep going or not unlike Do While/Until and hence run a bit faster.

I'm sure there are other bits as well.

Ksharp · Posted 09-28-2022 07:44 AM

Tom ,

What if there were a gap between two date?

Your code can't get job down .

Like this :

1 1 1

2 1 9

<-------------

5 1 5

Tom · Posted 09-28-2022 10:18 AM

@Ksharp wrote:

Tom ,

What if there were a gap between two date?

Your code can't get job down .

Like this :

1 1 1

2 1 9

<-------------

5 1 5

Yes. The code assumes there are no gaps in dates. The original poster said they had used PROC EXPAND to fill any date gaps.

hhchenfx · Posted 09-28-2022 11:19 AM

Yeah, my date will be continuous.

HHC

acordes · Posted 09-30-2022 06:47 AM

Thanks for your solution.

For me it's easier to understand if I do a proc transpose before.

Here comes my question. Inside the loop I want to to dynamically use the max(of arr[*]) construct.

But using for example max(of temp(1 --3)) doesn't give the desired result.

I tried as well to construct a command inside the loop "max(temp(1), temp(2))" but I cannot plug this character expression into to max_value line. I thought the resolve and %sysevalf would accomplish this, but I was wrong, at least with my attempts.

proc transpose data=have out=wide(drop=_name_) prefix=_;
id date;
by id;
var value;
run;


data wide1;
set wide;
array temp _:;
do i=1 to dim(temp) - cmiss(of temp(*));
if i=1 then max_value=temp(1);
if i=2 then max_value=max(temp(1), temp(2));
if i ge 3 then max_value=max(temp(i), temp(i-1), temp(i-2));
value=temp(i);
date=i;
max_date=vname(temp(whichn(max_value,of temp[*])));
output;
end;
drop i _:;
run;

Tom · Posted 09-30-2022 08:44 AM

You cannot do that.

The code for a data step is fixed before the data step starts running. You cannot use macro code to change the SAS code after the step has started executing (even if you could figure out a way to pass the value of the dataset variable to the macro processor).

Which is one reason processing the vertical table is so much easier.

Just use a DO loops and keep track of the maximum value seen in the inner loop. Since you stuffed the dates into the variable names you will have to add logic to extract the name back from the variable name.

data wide1;
  set wide;
  array temp _:;
  do start = 1 to dim(temp);
    max_value=.;
    max_date=.;
    do back=max(1,start-2) to start;
      if temp[back] > max_value then do;
        max_value=temp[back];
        max_date=input(substr(vname(temp[back]),2),32.);
      end;
    end;
    output;
  end;
  keep id start max_value max_date ;
run;

You could also still use the circular array trick, by then did you bother with transposing the data?

acordes · Posted 09-30-2022 09:07 AM

@Tom , by the way.

It's you who received an award at the SAS Explore event, right?

Tom · Posted 09-30-2022 09:14 AM

@acordes wrote:

@Tom , by the way.

It's you who received an award at the SAS Explore event, right?

Yes.

https://communities.sas.com/t5/SAS-Explore/2022-SAS-Customer-Recognition-Award-Winners/m-p/835967/em...

acordes · Posted 09-30-2022 09:21 AM

Congrats and many thanks for the many things I've learned from you and other super users.

But I would like to come back to my thought.

Isn't it possible to create a string with logic as for example "max(temp(1), temp(2))" (for i=2) and use it for the max_value as argument?

So I would need to resolve the cat string and use it as input to the function: max_value=resolve(cat_string);

Tom · Posted 09-30-2022 09:30 AM

Not sure what you are asking.

You could use macro code to generate "wall paper" code like:

array values _: ;
array max_values [100] ;
max_values[1] =values[1];
max_values[2] =max(values[1],values[2]);
max_values[3]=max(values[1].values[2],values[3]);
max_values[4]=max(values[2],values[3].values[4]);
....

But why?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away