Hi Everyone,
For each date, I lookback 3 days and find the max value using this proc expand below.
Now I want to find the day that rolling max occurs.
So I use the loop to get it done. However, it is quite slow. I wonder if there is better option.
Thank you,
HHC
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;run;
data have; set have;
drop lid;
lid=lag(id);
if lid^=id then change_id=1;
run;
proc sort data=have; by id descending date;run;
data want;
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
if count=3 then leave;
else
if value1>max_value then do;
max_value=value1;
max_date=date1;
end;
count=count+1;
end;
run;
A "circular" array is a simple way to implement a rolling window. Use the MOD() function to determine which variable in the array to place the current values. That way the values roll out of the window automatically when they are replaced by the new value.
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;
data want ;
set have ;
by id;
array values [3] _temporary_ ;
array dates [3] _temporary_ ;
if first.id then call missing(of values[*] dates[*]);
values[1+mod(_n_,3)]=value;
dates[1+mod(_n_,3)]=date;
max_value=max(of values[*]);
max_date=dates[whichn(max_value,of values[*])];
run;
proc print;
run;
max_ Obs date id value value max_date 1 1 1 1 1 1 2 2 1 9 9 2 3 3 1 0 9 2 4 4 1 7 9 2 5 5 1 5 7 4 6 1 2 100 100 1 7 2 2 0 100 1 8 3 2 -2 100 1 9 4 2 2 2 4 10 5 2 0 2 4 11 6 2 1 2 4
You state "I look back 3 days and find the max value using this proc expand below." I don't see any call to Proc Expand.
How many values/dates do you have for each of your Id values? I have an idea but it may have issues with large numbers of records per id.
/*Assuming the dataset has been sorted by id and date*/
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;
data want;
if _n_=1 then do;
declare hash h();
h.definekey('d');
h.definedata('d','v');
h.definedone();
end;
set have;
by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-2 to date;
if h.find()=0 then do;
if v>max_value then do;max_value=v;max_date=d;end;
end;
end;
drop d v;
run;
A "circular" array is a simple way to implement a rolling window. Use the MOD() function to determine which variable in the array to place the current values. That way the values roll out of the window automatically when they are replaced by the new value.
data have;
input date id value ;
datalines;
1 1 1
2 1 9
3 1 0
4 1 7
5 1 5
1 2 100
2 2 0
3 2 -2
4 2 2
5 2 0
6 2 1
;
data want ;
set have ;
by id;
array values [3] _temporary_ ;
array dates [3] _temporary_ ;
if first.id then call missing(of values[*] dates[*]);
values[1+mod(_n_,3)]=value;
dates[1+mod(_n_,3)]=date;
max_value=max(of values[*]);
max_date=dates[whichn(max_value,of values[*])];
run;
proc print;
run;
max_ Obs date id value value max_date 1 1 1 1 1 1 2 2 1 9 9 2 3 3 1 0 9 2 4 4 1 7 9 2 5 5 1 5 7 4 6 1 2 100 100 1 7 2 2 0 100 1 8 3 2 -2 100 1 9 4 2 2 2 4 10 5 2 0 2 4 11 6 2 1 2 4
Thanks you all for helping.
Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.
I run on lookback 3 days and on lookback 15 days.
ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.
The other 2 methods are comparable in time.
Any idea why ARRAY is so good?
Thanks,
HHC
Here is the time needed:
Look back 3 | Fast PC | Slow PC |
Do Loop | real time 17.67 sec | real time 26.88 sec |
cpu time 17.48 sec | cpu time 26.85 sec | |
HASH | real time 18.34 sec | real time 26.33 sec |
cpu time 17.25 sec | cpu time 26.20 sec | |
ARRAY | real time 10.59 sec | real time 10.06 sec |
cpu time 6.34 sec | cpu time 9.84 sec | |
Do Loop | real time 55.51 sec | real time 1:30.10 |
cpu time 55.50 sec | cpu time 1:30.11 | |
Look back 15 | ||
HASH | real time 47.34 sec | real time 1:09.37 |
cpu time 47.01 sec | cpu time 1:09.39 | |
ARRAY | real time 9.07 sec | real time 10.99 sec |
cpu time 7.17 sec | cpu time 10.68 sec |
data want;
set have;
drop id_1 date1 value1 count i change_id;
count=0;
i+1;
max_value=value;
max_date=date;
do j =i to nobs until (change_id=1);
set have(rename=(id=id_1 date=date1 value=value1)) point=j nobs=nobs;
if count=15 then leave;
else
if value1>max_value then do;
max_value=value1;
max_date=date1;
end;
count=count+1;
end;
run;
data want;
if _n_=1 then do;
declare hash h();
h.definekey('d');
h.definedata('d','v');
h.definedone();
end;
set have;
by id;
if first.id then h.clear();
d=date;v=value;h.add();
do d=date-14 to date;
if h.find()=0 then do;
if v>max_value then do;max_value=v;max_date=d;end;
end;
end;
drop d v;
run;
data want ;
set have ;
by id;
array values [15] _temporary_ ;
array dates [15] _temporary_ ;
if first.id then call missing(of values[*] dates[*]);
values[1+mod(_n_,15)]=value;
dates[1+mod(_n_,15)]=date;
max_value=max(of values[*]);
max_date=dates[whichn(max_value,of values[*])];
run;
@hhchenfx wrote:
Thanks you all for helping.
Since we have 3 working codes, I am curious which one is faster. I run 3 code on 2 PC (1 fast and 1 slower) on a sample of 37million rows.
I run on lookback 3 days and on lookback 15 days.
ARRAY method is super fast and surprisingly, the it doesn't take much time for lookback 15 than for lookback 3.
The other 2 methods are comparable in time.
Any idea why ARRAY is so good?
Thanks,
HHC
Arrays use internal fixed storage addresses, usually held in memory, and the technology behind them has been around for a very long time. I can't find a quick reference for the first array use but languages with arrays such as FORTRAN have been around nearly 70 years, FORTRAN developed in 1954 for example. So there has been a lot of opportunity to optimize code. The bit about addresses means that values can be retrieved/written very quickly. The iterated loop, such as Do i= 1 to 10 doesn't require logical comparisons to decide on whether to keep going or not unlike Do While/Until and hence run a bit faster.
I'm sure there are other bits as well.
Tom ,
What if there were a gap between two date?
Your code can't get job down .
Like this :
1 1 1
2 1 9
<-------------
5 1 5
@Ksharp wrote:
Tom ,
What if there were a gap between two date?
Your code can't get job down .
Like this :
1 1 1
2 1 9
<-------------
5 1 5
Yes. The code assumes there are no gaps in dates. The original poster said they had used PROC EXPAND to fill any date gaps.
Yeah, my date will be continuous.
HHC
Thanks for your solution.
For me it's easier to understand if I do a proc transpose before.
Here comes my question. Inside the loop I want to to dynamically use the max(of arr[*]) construct.
But using for example max(of temp(1 --3)) doesn't give the desired result.
I tried as well to construct a command inside the loop "max(temp(1), temp(2))" but I cannot plug this character expression into to max_value line. I thought the resolve and %sysevalf would accomplish this, but I was wrong, at least with my attempts.
proc transpose data=have out=wide(drop=_name_) prefix=_;
id date;
by id;
var value;
run;
data wide1;
set wide;
array temp _:;
do i=1 to dim(temp) - cmiss(of temp(*));
if i=1 then max_value=temp(1);
if i=2 then max_value=max(temp(1), temp(2));
if i ge 3 then max_value=max(temp(i), temp(i-1), temp(i-2));
value=temp(i);
date=i;
max_date=vname(temp(whichn(max_value,of temp[*])));
output;
end;
drop i _:;
run;
You cannot do that.
The code for a data step is fixed before the data step starts running. You cannot use macro code to change the SAS code after the step has started executing (even if you could figure out a way to pass the value of the dataset variable to the macro processor).
Which is one reason processing the vertical table is so much easier.
Just use a DO loops and keep track of the maximum value seen in the inner loop. Since you stuffed the dates into the variable names you will have to add logic to extract the name back from the variable name.
data wide1;
set wide;
array temp _:;
do start = 1 to dim(temp);
max_value=.;
max_date=.;
do back=max(1,start-2) to start;
if temp[back] > max_value then do;
max_value=temp[back];
max_date=input(substr(vname(temp[back]),2),32.);
end;
end;
output;
end;
keep id start max_value max_date ;
run;
You could also still use the circular array trick, by then did you bother with transposing the data?
@Tom , by the way.
It's you who received an award at the SAS Explore event, right?
Congrats and many thanks for the many things I've learned from you and other super users.
But I would like to come back to my thought.
Isn't it possible to create a string with logic as for example "max(temp(1), temp(2))" (for i=2) and use it for the max_value as argument?
So I would need to resolve the cat string and use it as input to the function: max_value=resolve(cat_string);
Not sure what you are asking.
You could use macro code to generate "wall paper" code like:
array values _: ;
array max_values [100] ;
max_values[1] =values[1];
max_values[2] =max(values[1],values[2]);
max_values[3]=max(values[1].values[2],values[3]);
max_values[4]=max(values[2],values[3].values[4]);
....
But why?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.