BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
hhchenfx
Rhodochrosite | Level 12

Hi Everyone,

For each date, I lookback 3 days and find the max value using this proc expand below.

Now I want to find the day that rolling max occurs. 

So I use the loop to get it done. However, it is quite slow. I wonder if there is better option.

Thank you,

HHC

 


data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;run;

data have; set have;
drop lid;
lid=lag(id);
if lid^=id then change_id=1;
run;

proc sort data=have; by id descending date;run;

data want; 
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
	set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
	if count=3 then leave;
	else
	if value1>max_value then do;
		max_value=value1;
		max_date=date1;
		end;
	count=count+1;
end;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

A "circular" array is a simple way to implement a rolling window.  Use the MOD() function to determine which variable in the array to place the current values.  That way the values roll out of the window automatically when they are replaced by the new value.

data have;
  input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;


data want ;
  set have ;
  by id;
  array values [3] _temporary_ ;
  array dates  [3] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,3)]=value;
  dates[1+mod(_n_,3)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

proc print;
run;
                               max_
Obs    date    id    value    value    max_date

  1      1      1        1       1         1
  2      2      1        9       9         2
  3      3      1        0       9         2
  4      4      1        7       9         2
  5      5      1        5       7         4
  6      1      2      100     100         1
  7      2      2        0     100         1
  8      3      2       -2     100         1
  9      4      2        2       2         4
 10      5      2        0       2         4
 11      6      2        1       2         4

View solution in original post

21 REPLIES 21
ballardw
Super User

You state "I look back 3 days and find the max value using this proc expand below." I don't see any call to Proc Expand.

 

How many values/dates do you have for each of your Id values? I have an idea but it may have issues with large numbers of records per id.

Ksharp
Super User
/*Assuming the dataset has been sorted by id and date*/
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;

data want;
if _n_=1 then do;
 declare hash h();
 h.definekey('d');
 h.definedata('d','v');
 h.definedone();
end;
 set have;
 by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-2 to date;
 if h.find()=0 then do;
  if v>max_value then do;max_value=v;max_date=d;end;
 end;
end;
drop d v;
run;
Tom
Super User Tom
Super User

A "circular" array is a simple way to implement a rolling window.  Use the MOD() function to determine which variable in the array to place the current values.  That way the values roll out of the window automatically when they are replaced by the new value.

data have;
  input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;


data want ;
  set have ;
  by id;
  array values [3] _temporary_ ;
  array dates  [3] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,3)]=value;
  dates[1+mod(_n_,3)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

proc print;
run;
                               max_
Obs    date    id    value    value    max_date

  1      1      1        1       1         1
  2      2      1        9       9         2
  3      3      1        0       9         2
  4      4      1        7       9         2
  5      5      1        5       7         4
  6      1      2      100     100         1
  7      2      2        0     100         1
  8      3      2       -2     100         1
  9      4      2        2       2         4
 10      5      2        0       2         4
 11      6      2        1       2         4

hhchenfx
Rhodochrosite | Level 12

Thanks you all for helping. 

Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.

I run on lookback 3 days and on lookback 15 days.

ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.

The other 2 methods are comparable in time.

Any idea why ARRAY is so good?

Thanks,

HHC

 

Here is the time needed: 

 

Look back 3 Fast PC Slow PC
Do Loop       real time           17.67 sec       real time           26.88 sec
        cpu time            17.48 sec       cpu time            26.85 sec
     
HASH       real time           18.34 sec       real time           26.33 sec
        cpu time            17.25 sec       cpu time            26.20 sec
     
ARRAY       real time           10.59 sec       real time           10.06 sec
        cpu time            6.34 sec       cpu time            9.84 sec
     
Do Loop       real time           55.51 sec       real time           1:30.10
        cpu time            55.50 sec       cpu time            1:30.11
Look back 15    
HASH       real time           47.34 sec       real time           1:09.37
        cpu time            47.01 sec       cpu time            1:09.39
     
ARRAY       real time           9.07 sec       real time           10.99 sec
        cpu time            7.17 sec       cpu time            10.68 sec

 

 

 


data want; 
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
	set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
	if count=15 then leave;
	else
	if value1>max_value then do;
		max_value=value1;
		max_date=date1;
		end;
	count=count+1;
end;
run;



data want;
if _n_=1 then do;
 declare hash h();
 h.definekey('d');
 h.definedata('d','v');
 h.definedone();
end;
 set have;
 by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-14 to date;
 if h.find()=0 then do;
  if v>max_value then do;max_value=v;max_date=d;end;
 end;
end;
drop d v;
run;



data want ;
  set have ;
  by id;
  array values [15] _temporary_ ;
  array dates  [15] _temporary_ ;
  if first.id then call missing(of values[*] dates[*]);
  values[1+mod(_n_,15)]=value;
  dates[1+mod(_n_,15)]=date;
  max_value=max(of values[*]);
  max_date=dates[whichn(max_value,of values[*])];
run;

 

ballardw
Super User

@hhchenfx wrote:

Thanks you all for helping. 

Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.

I run on lookback 3 days and on lookback 15 days.

ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.

The other 2 methods are comparable in time.

Any idea why ARRAY is so good?

Thanks,

HHC

 

 


Arrays use internal fixed storage addresses, usually held in memory, and the technology behind them has been around for a very long time. I can't find a quick reference for the first array use but languages with arrays such as FORTRAN have been around nearly 70 years, FORTRAN developed in 1954 for example. So there has been a lot of opportunity to optimize code.  The bit about addresses means that values can be retrieved/written very quickly. The iterated loop, such as Do i= 1 to 10 doesn't require logical comparisons to decide on whether to keep going or not unlike Do While/Until and hence run a bit faster.

 

I'm sure there are other bits as well.

Ksharp
Super User

Tom ,

What if there were a gap between two date?

Your code can't get job down .

Like this :

1 1 1

2 1 9

<-------------

5 1 5

Tom
Super User Tom
Super User

@Ksharp wrote:

Tom ,

What if there were a gap between two date?

Your code can't get job down .

Like this :

1 1 1

2 1 9

<-------------

5 1 5


Yes. The code assumes there are no gaps in dates.  The original poster said they had used PROC EXPAND to fill any date gaps.

hhchenfx
Rhodochrosite | Level 12

Yeah, my date will be continuous. 

HHC

 

acordes
Rhodochrosite | Level 12

Thanks for your solution. 

 

For me it's easier to understand if I do a proc transpose before. 

Here comes my question. Inside the loop I want to to dynamically use the max(of arr[*]) construct. 

But using for example max(of temp(1 --3)) doesn't give the desired result. 

I tried as well to construct a command inside the loop "max(temp(1), temp(2))" but I cannot plug this character expression into to max_value line. I thought the resolve and %sysevalf would accomplish this, but I was wrong, at least with my attempts.

 

proc transpose data=have out=wide(drop=_name_) prefix=_;
id date;
by id;
var value;
run;


data wide1;
set wide;
array temp _:;
do i=1 to dim(temp) - cmiss(of temp(*));
if i=1 then max_value=temp(1);
if i=2 then max_value=max(temp(1), temp(2));
if i ge 3 then max_value=max(temp(i), temp(i-1), temp(i-2));
value=temp(i);
date=i;
max_date=vname(temp(whichn(max_value,of temp[*])));
output;
end;
drop i _:;
run;
Tom
Super User Tom
Super User

You cannot do that.

The code for a data step is fixed before the data step starts running.  You cannot use macro code to change the SAS code after the step has started executing (even if you could figure out a way to pass the value of the dataset variable to the macro processor).

 

Which is one reason processing the vertical table is so much easier.

 

Just use a DO loops and keep track of the maximum value seen in the inner loop.  Since you stuffed the dates into the variable names you will have to add logic to extract the name back from the variable name.

data wide1;
  set wide;
  array temp _:;
  do start = 1 to dim(temp);
    max_value=.;
    max_date=.;
    do back=max(1,start-2) to start;
      if temp[back] > max_value then do;
        max_value=temp[back];
        max_date=input(substr(vname(temp[back]),2),32.);
      end;
    end;
    output;
  end;
  keep id start max_value max_date ;
run;

You could also still use the circular array trick, by then did you bother with transposing the data?

 

 

acordes
Rhodochrosite | Level 12

@Tom , by the way.

It's you who received an award at the SAS Explore event, right?

Tom
Super User Tom
Super User

@acordes wrote:

@Tom , by the way.

It's you who received an award at the SAS Explore event, right?


Yes.

https://communities.sas.com/t5/SAS-Explore/2022-SAS-Customer-Recognition-Award-Winners/m-p/835967/em...

 

acordes
Rhodochrosite | Level 12

Congrats and many thanks for the many things I've learned from you and other super users. 

But I would like to come back to my thought. 

Isn't it possible to create a string with logic as for example "max(temp(1), temp(2))" (for i=2) and use it for the max_value as argument?

So I would need to resolve the cat string and use it as input to the function: max_value=resolve(cat_string);

 

Tom
Super User Tom
Super User

Not sure what you are asking.

You could use macro code to generate "wall paper" code like:

array values _: ;
array max_values [100] ;
max_values[1] =values[1];
max_values[2] =max(values[1],values[2]);
max_values[3]=max(values[1].values[2],values[3]);
max_values[4]=max(values[2],values[3].values[4]);
....

But why?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 21 replies
  • 3248 views
  • 15 likes
  • 5 in conversation